The existence of functional, non-protein-coding DNA is all too frequently portrayed as a great surprise uncovered by genome sequencing projects, both in large media outlets and in scientific publications that should have better quality control in place. Eric Lander, writing a Human Genome Project 10th anniversary retrospective in Nature, explains the real surprise about non-coding DNA that was revealed by big omics projects. Despite ravings about the newly identified mysteries of the 'dark genome', it remains a fact that functional, non-protein-coding DNA has been known for more than half a century, well before such interesting things as micro RNAs, ribozymes, and long ncRNA were discovered. The diversity of functional (and dubiously functional) RNAs has been genuinely interesting, but, in my humble opinion, not nearly as surprising as the discovery made about the relatively small slice of the human genome that shows strong evolutionary conservation (and is therefore most likely to be functional). Lander writes:
The most surprising discovery about the human genome was that the majority of the functional sequence does not encode proteins. These features had been missed by decades of molecular biology, because scientists had no clue where to look.* Comparison of the human and mouse genomes showed a substantial excess of conserved sequence, relative to the neutral rate in ancestral repeat elements4. The excess implied that at least 6% of the human genome was under purifying selection over the past 100 million years and thus biologically functional. Protein-coding sequences, which comprise only ~1.5% of the genome, are thus dwarfed by functional conserved non-coding elements (CNEs). Subsequent comparison with the rat and dog genomes confirmed these findings.
In other words, of the conserved (and most likely functional) portion of the genome, the ratio of non-coding to protein-coding DNA is 3:1. That was mind-blowing. The other, non-conserved ~94% of the genome does contain some interesting features, but most of the real non-coding action is limited to a small 4% slice. *This particular sentence is flat-out wrong. It can most easily be refuted by a paper such as this one (note the pre-human genome project date): "An albumin enhancer located 10 kb upstream functions along with its promoter to direct efficient, liver-specific expression in transgenic mice." Pinkert CA, Ornitz DM, Brinster RL, Palmiter RD. Laboratory of Reproductive Physiology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, 19104. Genes Dev. 1987 May;1(3):268-76. Abstract: Transgenic mice were used to locate the cis-acting DNA elements that are important for efficient, tissue-specific expression of the mouse albumin gene in the adult. Chimeric genes with up to 12 kb of mouse albumin 5'-flanking region fused to a human growth hormone (hGH) reporter gene were tested. Remarkably, a region located 8.5-10.4 kb upstream of the albumin promoter was essential for high-level expression in adult liver and the region in between -8.5 and -0.3 kb was dispensable. The far-upstream region behaved like an enhancer in that its position and orientation relative to the albumin promoter were not critical; however, it did not function well with a heterologous promoter. Two of four DNase hypersensitive sites found in the 5'-flanking region of the albumin gene map to the far-upstream and promoter regions; the others may reflect regions involved in developmental or environmental control of this gene.