Proteins@Home carries out research into large scale protein structure prediction. By increasing our knowledge of proteins, this will contribute to a better understanding of many diseases and pathologies, and contribute to progress in both medicine and technology.
Proteins@home project URL; http://biology.polytechnique.fr/proteinsathome/
The inverse protein folding problem: structure prediction and protein design
The amino acid sequence of a protein determines its three-dimensional structure, or 'fold'. Conversely, the three-dimensional structure is compatible with a large, but limited set of amino acid sequences. Enumerating the allowed sequences for a given fold is known as the 'inverse protein folding problem'. We are working to solve this problem for a large number of known protein folds (a representative subset: about 1500 folds). The most expensive step is to build a database of energy functions that describe all these structures. For each structure, we consider all possible sequences of amino acids. Surprisingly, this is computationally tractable, because our energy functions are sums over pairs of interactions. Once this is done, we can explore the space of amino acid sequences in a fast and efficient way, and retain the most favorable sequences. This large-scale mapping of protein sequence space will have applications for predicting protein structure and function, for understanding protein evolution, and for designing new proteins. By joining the project, you will help to build the database of energy functions and advance an important area of science with potential biomedical applications.
Introduction: structure prediction on a genomic scale
|Fig. 1 The polypeptide chain. A closeup shows the chemical form of the "backbone'', to which the `side chains' Ri, Ri+1, ..., are attached. The (C=O) and (N-H) groups are linked by the "peptide'' bond, which has a partial double bond character, making the (C=O)-(N-H) "peptide group'' stiff and planar. The torsion angles Phi and Psi, around single bonds, are soft. The side chains can be any of the twenty common amino acid side chain|
Over the past decade, the genomes of about 1000 organisms have been entirely sequenced: the exact nucleotide sequence of the DNA that makes up their chromosome(s) has been experimentally determined. These nucleotides encode all the molecules the cells need to produce, including their full complement of proteins. Proteins are the essential actors of the living cell: biochemical catalysts, motors, pumps, reading and interpreting the genetic message, directing the response to external signals or attacks.
Humans, for example have a genome of about 3.4 billion nucleotides, including about 25,000 genes that code for proteins. A challenge today is to determine the structure and biological function of all the known proteins. Indeed, although the amino acid sequences of millions of proteins have been determined, most of their three-dimensional molecular structures are unknown. Yet the knowledge of these structures is essential to identify, understand, and possibly engineer or modify their biological functions. Predicting the three-dimensional structure from the amino acid sequence is the classic "Protein Folding Problem'", one of the most important problems in molecular biology today.
In the cell, the amino acid sequence of a protein uniquely directs it to "fold" into a specific, three-dimensional, molecular structure (Figs. 1, 2). In effect, the amino acid chain has a unique, preferred, three-dimensional arrangement, which corresponds to its lowest possible free energy. It also has the ability to rapidly explore the available conformational space to find this preferred structure. The preferred structure is known as the "native'' structure. The ability to fold rapidly into a unique, native structure is an essential and universal property of natural proteins.
|Fig. 2 Folded proteins. Space-filling views of cytochrome c (an electron carrier in the respiratory chain) and hemoglobin (an oxygen carrier in the blood), along with a water molecule, approximately to scale.|
Protein structure prediction is difficult for two main reasons. First, a protein has many degrees of freedom and an almost unlimited set of possible conformations. Second, to denature, or "unfold" a protein, only about 10 kcal/mol are usually required. For a small protein of 1000 atoms, this represents 0.01 kcal/mol per atom. This energy can be compared to the average kinetic energy of each atom at room temperature: about 1 kcal/mol. Thus, the most stable, native structure is only separated from non-native structures by a very small free energy difference. Fortunately, there is often limited information available about the native structure, which can lead to a useful prediction, despite these difficulties.
For example, the amino acid sequence may be similar to the sequence of another protein, whose 3D structure is already known. Proteins with similar sequences are said to be "homologous'', and structure prediction in this case reduces to a "homology modelling'' problem. This type of prediction is carried out by the Rosetta@Home and Predictor@Home distributed computing projects, described at boinc.bakerlab.org/rosetta and predictor.scripps.edu.
We are exploring a related approach, with a more limited goal. Instead of searching for the optimal conformation, or fold for a given amino acid sequence, we consider the inverse problem. For a given fold, we search for the best amino acid sequences. With the Protein Folding Problem, we needed to search a vast conformational space.
With the "Inverse Folding Problem'', we need to search the space of amino acid sequences (of a given length). We are solving the inverse folding problem for a representative subset of all known protein structures. This subset includes 1000 structures of protein "domains'', collected in the "Structural Classification of Proteins'' (SCOP) database. A protein "domain'' is a structural unit, made of 50-300 amino acids, which is either a small protein, or a part of a larger protein. Larger proteins are invariably built up from several distinct domains, and a protein domain can often fold into its specific structure by itself, even if it is removed from the rest of the protein to which it belongs.
What does this have to do with structure prediction? In fact, we want to solve a "fold recognition'' problem (Figure 3). For each domain, we consider several million possible sequences, and identify the most favorable. These provide a "signature'' of the 3D domain structure, or fold. Indeed, if we consider now a new protein sequence, for which the 3D structure is unknown, we can compare it to our database of computed sequences. If the new sequence is similar to one or more in our database, we can infer that it will adopt the same 3D structure. In effect, we want to identify the fold of the new sequence, and this is the first step towards structure prediction by homology modelling.
In addition to structure prediction, another application of these techniques is the construction of new proteins, or "protein design''. This technique, extensively developed and applied by the Rosetta project (boinc.bakerlab.org/rosetta), can also be performed with our software. Among the sequences associated with a given protein domain, we can select those that are likely to perform a desired function, such as binding specifically to another protein, or catalyzing a particular chemical reaction. By selecting sequences that stabilize a given fold and, at the same time, are capable of performing a specific chemical or biological function, one performs molecular evolution in the computer. This technique for protein design is referred to as "Directed Evolution''. Directed evolution has been successfully used in recent years to develop new biosensors, new catalysts, and to create completely new protein folds.
|Fig. 3 (A) The inverse folding problem. A large number of sequences (left) are tested for their ability to stabilize a given backbone fold (center). Favorable sequences are retained (right). In effect, the sequences are "filtered'' through the 3D structure. (B) Fold recognition. Given the amino acid sequence corresponding to a new gene, we try to match it to a large but finite number of known protein structures. The matching can be performed by comparing the new sequence to the sequence families generated (A) for all the known protein domain structures.|
Video about protein structure
This is not about the Proteins@home project. Its a presentation on the SWISS-MODEL structure prediction for adeno-associated virus serotype 9 (AAV9) and its along the same lines as protein structure prediction.
|< Prev||Next >|