Bioinformatics and computational biology

views updated

Bioinformatics and computational biology

Bioinformatics, or computational biology, refers to the development of new database methods to store geno-mic information, computational software programs, and methods to extract, process and evaluate this information, and the refinement of existing techniques to acquire the genomic data. Finding genes and determining their function, predicting the structure of proteins and RNA (ribonucleic acid) sequences from the available DNA (deoxyribonucleic acid) sequence, and determining the evolutionary relationship of proteins and DNA sequences are also part of bioinformatics.

The genome sequences of some bacteria, yeast, a nematode, the fruit fly Drosophila, and several plants have been obtained in the recent past, with many more sequences having been completed or nearing completion. Although work continues in order to refine the data, the initial sequencing (a rough draft) of the human genome was completed in 2000. It was announced in April 2003 that the complete genome sequence was completed. In May 2006, the sequence of the last chromosome was published in the journal Nature. Although publicly stated that the Human Genome Project has been completed, work continues. As of 2005, the number of genes in the human genome was re-stated as 20,000 to 25,000, down from the estimated number of 30,000 to 40,000. Experts predict that it will take geneticists several more years before a precise number can be given.

In addition, to this accumulation of nucleotide sequence data, elucidation of the three-dimensional structure of proteins coded for by the genes has been accelerating. The result is a vast ever-increasing amount of databases and genetic information. The efficient and productive use of this information requires the specialized computational techniques and software. Bioinformatics has developed and grown from the need to extract and analyze the reams of information pertaining to genomic information like nucleotide sequences and protein structure.

Bioinformatics utilizes statistical analysis, step-wise computational analysis and database management tools in order to search databases of DNA or protein sequences to filter out background from useful data and enable comparison of data from diverse databases. This sort of analysis is ongoing. The exploding number of databases, and the various experimental methods used to acquire the data, can make comparisons tedious to achieve. However, the benefits can be enormous. The immense size and network of biological databases provides a resource to answer biological questions about mapping, gene expression patterns, molecular modeling, molecular evolution, and to assist in the structural-based design of therapeutic drugs.

Obtaining information is a multi-step process. Databases are examined, or browsed, by posing complex computational questions. Researchers who have derived a DNA or protein sequence can submit the sequence to public repositories of such information to see if there is a match or similarity with their sequence. If so, further analysis may reveal a putative structure for the protein coded for by the sequence as well as a putative function for that protein. Four primary databases, those containing one type of information (only DNA sequence data or only protein sequence data), currently available for these purposes are the European Molecular Biology DNA Sequence Database (EMBL), GenBank, SwissProt and the Protein Identification Resource (PIR). Secondary databases contain information derived from other databases. Specialist databases, or knowledge databases, are collections of sequence information, expert commentary, and reference literature. Finally, integrated databases are collections (amalgamations) of primary and secondary databases.

The area of bioinformatics concerned with the derivation of protein sequences makes it conceivable to predict three-dimensional structures of the protein molecules, by use of computer graphics and by comparison with similar proteins, which have been obtained as a crystal. Knowledge of structure allows the site(s) critical for the function of the protein to be determined. Subsequently, drugs active against the site can be designed, or the protein can be utilized to enhance commercial production processes, such as in pharmaceutical bioinformatics.

Bioinformatics also encompasses the field of comparative genomics. This is the comparison of functionally equivalent genes across species. A yeast gene is likely to have the same function as a worm protein with the same amino acid. Alternately, genes having similar sequence may have divergent functions. Such similarities and differences will be revealed by the sequence information. Practically, such knowledge aids in the selection and design of genes to instill a specific function in a product to enhance its commercial appeal.

The most widely known example of a bioinformatics driven endeavor is the Human Genome Project (HGP, which has been mentioned earlier). Charles DeLisi, who at the time was Director of the Health and Environmental Research Programs, under the U.S. Department of Energy (DOE), began the HGP in 1986. The project was formally established in the United States in 1990 as a joint project of the DOE and the U.S. National Institutes of Health. International cooperation occurred among geneticist from the United States, Japan, Germany, France, and the United Kingdom. Work related to the Human Genome Project has allowed dramatic improvements worldwide in molecular biological techniques and improved computational tools for studying genomic function.