This is another archival repost from the old blog, this one from May 2007.
The Hollow Man was on the television the other day, confirming once again my belief that hollywood lacks the imagination to make sci-fi even a fraction as interesting as real science. There was, however, a most fantastic little piece of software in it that produced hypothetical proteins and tested their stability and biological properties in silico, complete with festive graphics of exploding ball and stick molecular models. It wasn’t quite up the standards of the animation that the Daleks had conveniently produced to explain their evil plan of Dalek-Human hybrids to The Doctor, but it was certainly an impressive piece of work. Now, before I start to give the impression that I make a habit of watching bad sci-fi, I’ll get to the point. The protein development program from The Hollow Man would be a bloody useful piece of kit, not just for protein engineering, but for a far more pressing matter.
The EMBL-bank and NCBI GenBank databases (for our purposes the two are mirror sites) are where the products of the sequencing projects, like the Human Genome Project, end up. They currently have just under 90 million entries, and should pass 100 million sometime in the summer. If you were to browse around at random (and I have no idea why you would), found a gene (as supposed to one of the many and varied non-gene entries) and managed to work out what the horribly formatted results meant, you’d find that the entries consist of the name of the gene, the organism it comes from, and a sequence. If you’re lucky, it will also contain some references to papers with information about what the gene does, when and were it’s expressed, and what happens when things go wrong. There are other databases, like Mendelian Inheritance in Man, an encyclopaedia of genetic diseases, Swis-PROT, an annotated database of protein sequences and structures, and the NCBI SNP database, which documents variation, and which is so vast already that I’ve forgotten the number of entries it has, even though I looked it up this morning (all I remember is that that the statistic was very well hidden!).
Mendelian Inheritance in Man and Swis-PROT contain an impressive amount of information about functional genomics, but they hardly make a dent in the raw data of the GenBank and SNP databases. The main reason for this is that sequencing genomes is these days quick and cheap, compared to working out how they work, and what the individual genes are supposed to do. There are several ways of attempting to find out a gene/gene product/protein’s (hereafter abbreviated to “gene”) function, but none of them are perfect, so multiple strategies may have to be employed to find the answer.
The first way is to find out what tissue a gene is expressed in, and at what time: expression profiling. The gene may be switched on at certain stages of development, or in response to stress, hormones, infection, or some other signal from outside the cell, or it might be constantly expressed, or never expressed, and from this information one can built hypotheses about what its function might be. The traditional way of doing this is to start with a gene and look at what events cause it to be expressed. Levels of expression are typically assayed by northern blot or real time PCR – the details aren’t important for our purposes, they simply quantify the amount of a particular transcript in the cell, and thus the amount of gene expression. Generally these methods allow you to look at many samples in a single experiment, but only one gene at a time. Thus, you could do a time course experiment, harvesting samples on different days during development; or you could do a tissue type experiment, harvesting samples at the same time, but from different organs; or you could see what happens to the expression of the gene after drug or hormone treatment. Generally, however, you need a prior hypothesis with this system, in order to avoid searching for a needle in a haystack. New expression profiling systems have been developed that overcome this problem by looking at many genes at once, and the most famous, microarrays, can now look at the expression levels of every gene in the human genome at the same time. Using this method, one can find interesting things about gene expression that you never would have thought to look for, and find out “expression signatures” that link multiple genes into discreet systems.
Another way to try out find out what a gene does is to see what happens to the phenotype (the characteristics of the individual) when you break it. There are several ways of causing a gene to be under- or over- expressed, which would likely take me several paragraphs to explain, and aren’t important to this story. These studies can be carried out in cell culture, but cell culture doesn’t have the same characteristics as an individual organism, so can only tell us so much. The studies are therefore often done with worms, flies and mice. The phenotype associated with a broken gene, though, doesn’t necessarily tell us what the gene does when working normally. Many genes are named after the phenotype associated with abnormalities, but the actual function of those genes can not be said to be “creation of the opposite phenotype”. The Sonic Hedgehog gene is named after a mutant phenotype in fruit flies, in which the body is covered in pointed lumps, but its actual function is best described as a cell signal creating a concentration gradient that tells cells in developing organs and limbs where they are on the anterior/posterior (head/tail) or dorsal/ventral (back/front or top/bottom) axis, and thus how they should proceed to develop. A gene I find particularly interesting is Retinoblastoma, named after a childhood cancer of the eye. One could describe its normal function as “preventing cancer of the eye”, or one could describe it as “preventing cell division in the absence of growth factors, by inhibiting the expression of genes required for duplicating DNA.”
Other ways to study protein function include determining its structure, which remains pretty time consuming and expensive; and finding out where it’s located in the cell and what other things it interacts with, which first requires a knowledge of when it’s expressed, and is still unlikely to tell you the full story! It would therefore be very useful if, instead of having to go through all this, we could simply predict the function of a gene from its sequence. The gene sequence can tell us a few things: it is a perfect representation of the sequence of amino acids in the protein product, for example. We can also predict things about the gene’s function based on similarity of the sequence with other genes: if two genes have very similar sequences, they are likely to have somewhat similar functions (though they will have around 20% gene similarity by chance; it’s a high sequence similarity of amino acids, the building blocks of proteins, that is important). We can also predict the presence of some domains of proteins, which may determine function and location in the cell, by the similarity of amino acid sequences to the consensus sequences always associated with certain functional domains. However, the presence of a consensus does not guarantee that it folds to form the functional domain predicted, or that, when the protein is fully folded, the domain is not hidden somewhere inside, unable to interact with anything.
We’re making inroads into understanding many of the layers of regulation that control how the genome becomes the proteome, but at the moment, the laws of protein folding, and thus the prediction of protein structure from gene sequence, evades us. Folding at Home is a distributed computing project that seeks to determine these rules, and it even comes with a graphical interface that looks a little like something from sci-fi. Perhaps one day it will come up with the rules, and we can develop software that computes protein structures, and from that, what they interact with, how stable they are, and whether they can make us invisible. The conclusion though, is that if we do work out what the rules are, the first application we should be putting them to is software to go through the EMBL-Bank/GenBank database predicting the protein structure associated with each gene. Next on the blog, I suppose I should talk about how wecurrently come up with novel engineered proteins, and why they’re useful.