I recently did an interview with New Scientist for what, I am happy to say, was one of the most reasonable popular reviews of "junk DNA" that has appeared in recent times (Pearson 2007). My small section appeared in a box entitled "Survival of the fattest", in which most of the discussion related to diversity in genome size and its causes and consequences. It even included mention of "the onion test", which I proposed as a tonic for anyone who thinks they have discovered "the" functional explanation for the existence of vast amounts of non-coding DNA within eukaryotic genomes. Also thrown in, though not because I said anything about it, was a brief analogy to computer code: "Computer scientists who use a technique called genetic programming to 'evolve' software also find their pieces of code grow ever larger -- a phenomenon called code bloat or 'survival of the fattest'".

I do not follow the literature of computer science, though I am aware that "genetic algorithms" (i.e., program evolution by mutation and selection) is a useful approach to solving complex puzzles. When I read the line about code bloat, my impression was that it probably gave other readers an interesting, though obviously tangential, analogy by which to understand the fact that streamlined efficiency of any coding system, genetic or computational, is not a given when it is the product of a messy process like evolution.

More recently, I have been made aware of an electronic article published in the (non-peer-reviewed) online repository known as arXiv (pr. "archive"; the "X" is really "chi") that takes this analogy to an entirely different level. Indeed, the authors of the paper (Feverati and Musso 2007) claim to use a computer model to provide insights into how some eukaryotic genomes become so bloated. That is, instead of applying biological observations (i.e., naturally evolving genomes can become large) to a computational phenomenon (i.e., programs evolved in silico can become large, too), the authors flipped the situation around and decided that a computer model could provide substantive information about how genomes evolve in nature.

I will state up front that I am rarely (read: never) convinced by proof-by-analogy studies. Yes, modeling can be helpful if it provides a simplified way to test the influence of individual parameters in complex systems, but only insofar as the conclusions are then compared against reality. When it comes to something like genome size evolution, which applies to millions of species (billions if you consider that every species that has ever lived, about 99% of which are extinct, had a genome) and billions of years, one should be very skeptical of a model that involves only a handful of simplified parameters. This is especially true if no effort is made to test the model in the one way that counts: by asking if it conforms to known facts about the real world.

The abstract of the Feverati and Musso (2007) article says the following:
The development of a large non-coding fraction in eukaryotic DNA and the phenomenon of the code-bloat in the field of evolutionary computations show a striking similarity. This seems to suggest that (in the presence of mechanisms of code growth) the evolution of a complex code can't be attained without maintaining a large inactive fraction. To test this hypothesis we performed computer simulations of an evolutionary toy model for Turing machines, studying the relations among fitness and coding/non-coding ratio while varying mutation and code growth rates. The results suggest that, in our model, having a large reservoir of non-coding states constitutes a great (long term) evolutionary advantage.
I will not embarrass myself by trying to address the validity of the computer model itself -- I am but a layman in this area, and I am happy to assume for the sake of argument that it is the single greatest evolutionary toy model for Turing machines ever developed. It does not follow, however, that the authors are correct in their assertion that they "have developed an abstract model mimicking biological evolution".

As I understand it, the simulation is based on devising a pre-defined "goal" sequence, similarity to which forms the basis of selecting among randomly varying algorithms. As algorithms undergo evolution by selection, they tend to accumulate more non-coding elements, and the ones that reach the goal most effectively turn out to be those with an "optimal coding/non-coding ratio" which, in this case, was less than 2%. The implication, not surprisingly, is that genomes evolve to become larger because this improves long-term evolvability by providing fodder for the emergence of new genes.

Before discussing this conclusion, it is worth considering the assumptions that were built into the model. The authors note that:
For the sake of simplicity, we imposed various restrictions on our model that can be relinquished to make the model more realistic from a biological point of view. In particular we decided that:
  1. non-coding states accumulate at a constant rate (determined by the state-increase rate pi) without any deletion mechanism [this is actually two distinct claims rolled into one],
  2. there is no selective disadvantage associated with the accumulation of both coding and non-coding states,
  3. the only mutation mechanism is given by point mutation and it also occurs at a constant rate (determined by the mutation rate pm),
  4. there is a unique ecological niche (defined by the target tape),
  5. population is constant,
  6. reproduction is asexual.
As noted, I am fine with considering this a fantastic computer simulation -- it just isn't a simulation that has any resemblance to the biological systems that it purports to mimic. Consider the following:
  • Although some authors have suggested that non-coding DNA accumulates at a constant rate (e.g., Martin and Gordon 1995), this is clearly not generally true. All extant lineages can trace their ancestries back to a single common ancestor, and thus all living lineages (though not necessarily all taxonomic groups) have existed for exactly the same amount of time. And yet the amount of non-coding DNA varies dramatically among lineages, even among closely related ones. Ergo, the rate of accumulation of non-coding DNA differs among lineages. Premise 1 is rejected.
  • The insertion of non-coding elements can be selectively relevant not only in terms of effects on protein-coding genes (many transposable elements are, after all, disease-causing mutagens), but also in terms of bulk effects on cell division, cell size, and associated organism-level traits (Gregory 2005). Premise 2 is rejected.
  • The accumulation of non-coding DNA in eukaryotes does not occur by point mutation, except in the sense that genes that are duplicated may become pseudogenized by this mechanism. Indeed, the model seems only to involve a switch between coding and non-coding elements without the addition of new "nucleotides", which makes it even more distant from true genomes. Moreover, the primary mechanisms of DNA insertion, including gene duplication and inactivation, transposable element insertion, and replication and recombination errors, do not occur at a constant rate. In fact, the presence of some non-coding DNA can have a feedback effect in which the likelihood of additional change is increased, be it by insertions (e.g., into non-coding regions, such that mutational consequences are minimized) or deletions (e.g., illegitimate recombination among LTR elements) or both (e.g., unequal crossing over or replication slippage enhanced by the presence of repetitive sequences). Premise 3 is rejected.
  • Evolution does not have a pre-defined goal. Evolutionary change occurs along trajectories that are channeled by constraints and history, but not by foresight. As long as a given combination of features allows an organism to fill some niche better than alternatives, it will persist. Not only this, but models like the one being discussed are inherently limited in that they include only one evolutionary process: adaptation. Evolution in the biological world also occurs by non-adaptive processes, and this is perhaps particularly true of the evolution of non-coding DNA. It is on these points that the analogy between evolutionary computation and biological evolution fundamentally breaks down. Premise 4 is rejected in the strongest possible terms.
  • Real populations of organisms are not constant in size, though one could argue that in some cases they are held close to the carrying capacity of an available niche. However, this assumes the existence of only one conceivable niche. Real populations can evolve to exploit different niches. Premise 5 is rejected.
  • With a few exceptions (e.g., DNA transposons), transposable elements are sexually transmitted parasites of the genome, and these elements make up the single largest portion of eukaryotic genomes (roughly half of the human genome, for example). Ignoring this fact makes the model inapplicable to the very question it seeks to address. Premise 6 is rejected.

The main problem with proofs-by-analogy such as this is that they disregard most of the characteristics that make biological questions complex in the first place. Non-coding DNA evolves not as part of a simple, goal-directed, constant-rate process, but one typified by the influence of non-adaptive processes (e.g., gene duplication and pseudogenization), selection at multiple levels (e.g, both intragenomic and organismal), and open-ended trajectories. An "evolutionary" simulation this may be, but a model of biological evolution it is not.

Finally, it is essential to note that "non-coding elements make future evolution possible" explanations, though invoked by an alarming number of genome biologists, contradict basic evolutionary principles. Natural selection cannot favour a feature, especially a potentially costly one such as the presence of large amounts of non-coding DNA, because it may be useful down the line. Selection occurs in the here and now, and is based on reproductive success relative to competing alternatives. Long-term consequences are not part of the equation except in artificial situations where there is a pre-determined finish line to which variants are made to race.

That said, there can be long-term consequences in which inter-lineage sorting plays a role. In terms of processes such as alternative splicing and exon shuffling, which rely on the existence of non-coding introns, an effect on evolvability is plausible and may help to explain why lineages of eukaryotes with introns are so common (Doolittle 1987; Patthy 1999; Carroll 2002). However, this is not necessarily linked to total non-coding DNA amount. For a process of inter-lineage sorting to affect genome size more generally, large amounts of non-coding DNA would have to be insufficiently detrimental in the short term to be removed by organism-level selection, and would have to improve lineage survival and/or enhance speciation rates, such that over time one would observe a world dominated by lineages with huge genomes. In principle, this would be compatible with the conclusions of the model under discussion, at least in broad outline. In practice, however, this is undone by evidence that lineages with exorbitant genomes are restricted to narrower habitats (e.g., Knight et al. 2005), are less speciose (e.g., Olmo 2006), and may be more prone to extinction (e.g., Vinogradov 2003) than those with smaller genomes.

Non-coding DNA does not accumulate "so that" it will result in longer-term evolutionary advantage. And even if this explanation made sense from an evolutionary standpoint, it is not the effect that is observed in any case. No computer simulation changes this.



Carroll, R.L. 2002. Evolution of the capacity to evolve. Journal of Evolutionary Biology 15: 911-921.

Doolittle, W.F. 1987. What introns have to tell us: hierarchy in genome evolution. Cold Spring Harbor Symposia on Quantitative Biology 52: 907-913.

Feverati, G. and F. Musso. 2007. An evolutionary model with Turing machines. arXiv.0711.3580v1.

Gregory, T.R. 2005. Genome size evolution in animals. In: The Evolution of the Genome (edited by T.R. Gregory). Elsevier, San Diego, pp. 3-87.

Knight, C.A., N.A. Molinari, and D.A. Petrov. 2005. The large genome constraint hypothesis: evolution, ecology and phenotype. Annals of Botany 95: 177-190.

Martin, C.C. and R. Gordon. 1995. Differentiation trees, a junk DNA molecular clock, and the evolution of neoteny in salamanders. Journal of Evolutionary Biology 8: 339-354.

Olmo, E. 2006. Genome size and evolutionary diversification in vertebrates. Italian Journal of Zoology 73: 167-171.

Patthy, L. 1999. Genome evolution and the evolution of exon shuffling -- a review. Gene 238: 103-114.

Pearson, A, 2007. Junking the genome. New Scienist 14 July: 42-45.

Vinogradov, A.E. 2003. Selfish DNA is maladaptive: evidence from the plant Red List. Trends in Genetics 19: 609-614.


Update: The authors respond on Genomicron.