Having a BLAST with Darwin

-or- “(One of many reasons) Why genomics matters”

This is an archival repost which was originally posted on the old blog in feb 2009, during the Darwin 200 celebrations.

In chapter 14 of The Origin Darwin discussed embryological stages and their utility in classification. This utility derives from the fact that in animals phenotypic variation between species is less complicated at earlier developmental stages, and less influenced by what Darwin calls “special habits”. In some cases, this principle can be taken to extremes. Today, we refine species classification using the gold-standard method of comparative genomics. Comparative genomics involves matching up the As, Cs, Gs, and Ts of orthologous sequences (that is, sequences related by descent in now separate species) and highlighting where the genes have diverged. But sometimes, the most interesting finding is where the genes have not diverged.

The genomes of animals are stuffed full of sequences which have no obvious purpose, and in which mutation seems to be able to run free without consequence. Over evolutionary time, these sequences are buffeted by random processes, and as species diverge, so these sections of their genomes drift apart. Buried in amongst all of this nonsense, though, are functional sequences, including genes. In these functional sequences, mutations and variation might have all sorts of phenotypic consequences, and genomic variations which have consequences can be subject to selection. There is a tendency, therefore, for such sequences to show very different patterns of variation between species to that of the “junk” sequences. These patterns can reveal important principles in biology, and the patterns themselves are discovered by comparative genomics.

One of the most notorious finds is the homeobox motif.[1] This motif is a gene-subsection found in a set of genes called Hox genes, and it has the rare distinction of being found in such distant groups as vertebrates and insects with an almost identical sequence – it is therefore said to be highly “conserved”. The sequences vary only in that they contain “silent mutations”. Due to the redundancy of the “genetic code”, which maps the 64 possible triplets of nucleotides – gene letters – to the twenty commonly used amino acids – protein letters – many mutations within genes will make absolutely no difference to their carrier’s phenotype, and thus go unnoticed by selection. Such mutations are said to be “silent”. Darwin may have been interested to learn that the Hox genes, which contain this famously conserved motif, happen to be key in determining the layout of the body early in development. There is genetic conformity to match the phenotypic conformity in early development.

But the sequences encoding proteins are not the only important functional sequences in the genome. There are plenty of sequences associated with genes – regulators of gene expression, for instance – and a few interesting things that are not associated with genes at all. “Cis-regulatory regions”, for example, are found beside genes, and contain sequences which enable enzymes to attach themselves to the DNA and initiate gene expression. During the past decade, interesting patterns of conservation have been discovered in the cis-regulatory regions associated with many genes, and especially with key developmental genes. These cis-regulatory sequences do not directly map to protein sequences, and so they can not contain “silent mutations” in the sense that protein coding regions can. But these sequences can be conserved. They can be very highly conserved. They can be ultraconserved![2][3] Sanderlin, et al found in 2004, for example, “ultraconserved” cis-regulatory sequences over 1000 bases in length which were identical in human, mouse, and pufferfish, and over 3,500 perfectly identical regulatory regions of 50 bases or more.[4] Those numbers might mean nothing to you, but the bottom line is that if genomes were essays, somebody would be in front of a plagiarism tribunal right now.

In BMC Evolutionary Biology, last year, Lin et al[5] described a novel collection of ultraconserved regions (UCRs) that they stumbled upon in the Hox genes of placental mammals. These particular UCRs were not found in the cis-regulatory regions, but are the first to be found in the protein-coding sections of genes. The URCs are at least 125 bases in length, and are identical in humans, dogs, and mice – indeed, they actually show a greater degree of identity in these species than does the famously conserved homeobox. Lin et al can not yet explain the importance of the UCRs that they have found, but it is reasonable to assume that important they must be. The thing I find particularly interesting about their story, however, is how the UCRs were found and investigated. Lin et al stumbled upon the conserved regions while looking at a different question – the divergence of Hox genes in mammals. They had retrieved the sequences of orthologous Hox genes for mammalian species such as human, chimp, cow, dog, duck-billed platypus, macaque, mouse, opossum, and rat, along with chicken, pufferfish, and zebrafish for comparison. These are all species whose genomes have been sequenced and made publicly available in government funded databases. Lin et al used a clever search engine called BLAST to find all of the Hox genes of these species, and then used a classic piece of bioinformatics software, ClustalX, to “align” all of the genes and point out where they do and do not vary between the species.

Why do I get excited by BLAST and ClustalX, the workhorses of computational biology? Because it’s little studies like this which serve to remind us of why genomics and computational biology are important. Genomics it seems has still not recovered from accusations of being overhyped after the biotech bubble burst a decade ago; and computational biology gets all kinds of slander thrown at it – a discipline churning out unreliable results to be dismissed, or a field to turn to in desperation as laboratory studies refuse to give you the answer you want. Perhaps, when you’ve slaved for years as a student and had to fight for the funding to maintain a laboratory, there is something a little frightening about research that requires just a PC, an internet connection, and a clever idea. But Lin et al discovered a whole new category of ultraconserved genomic regions right under the noses of the hundreds of molecular and developmental “wet” biologists who work on these hugely important Hox genes, using just some everyday software, the bulk raw data of genome projects, and the wit to spot an interesting pattern.

BLAST and ClustalX are exciting because with just a sequence alignment, you can demonstrate again and again, in an almost endless variety of ways, just how right and how powerful the theory of evolution is.


  1. ^ Woltering J, and Duboule D: Conserved elements within open reading frames of mammalian Hox genes. J. Biol. 2009 8:17. doi.
  2. ^ Dermitzakis ET, Reymond A, Antonarakis SE: Conserved non-genic sequences – an unexpected feature of mammalian genomes. Nat Rev Genet 2005, 6:151-157.
  3. ^ Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science 2004, 304:1321-1325.
  4. ^ Sandelin A, Bailey P, Bruce S, Engström PG, Klos JM, Wasserman WW, Ericson J, Lenhard B: Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 2004, 5:99. doi
  5. ^ Zhenguo Lin, Hong Ma, Masatoshi Nei (2008). Ultraconserved coding regions outside the homeobox of mammalian Hox genes BMC Evolutionary Biology, 8 (1) DOI: 10.1186/1471-2148-8-260

Disclosure: I work for the publisher of three of the journals cited, and handled one of the papers (Lin, 2008). All opinions are my own, this post was written on the train, not in the course of duties, no privileged detail that can’t be found in the paper was disclosed, etc, etc.

Leave a comment

Your email address will not be published. Required fields are marked *