The publication of the complete human genome sequence in 2007 made the genetic code readily accessible to researchers and served as a platform for the countless advances in science,medicine, and sequencing technology which were to follow.

Following the completion of the human genome project, researchers set out to develop a comprehensive ‘map’ of which genes coded for which gene products, or proteins.This allowed researchers to track how a protein was affected when a gene sequence was altered and the resulting changes within the cell. Subsequently, a number of diseases and disorders were linked to changes-or mutations- within protein coding genes.

The sequencing of the human genome, therefore, provided a roadmap which researchers have used to navigate the labyrinthine terrain of scientific discovery.

All 25,000 of the protein coding genes in humans, however, only account for less than 2% of the human genome. Since the 1960s the remaining 98% of our genetic material- vast regions of non protein coding DNA and stretches of repetitive sequence- has historically been dismissed as ‘junk’ DNA. If it seems unlikely that we would be carrying such a staggeringly high proportion of irrelevant genetic material, that’s because it is. Not only would the extra baggage take up space within an already cramped nucleus, but every time a cell divides the extra DNA sequences would be replicated with the coding regions. The cell, therefore, would be wasting a lot of extra time and energy in copying ‘junk’ DNA.

It has since become apparent that these tracts of noncoding DNA are just as essential to our well-being as our genes. Many non protein coding regions of the genome act as regulators of gene activity, turning genes on and off. Mutations and alterations in these stretches of noncoding DNA, moreover, have been linked to various diseases and medical conditions. These discoveries have added an additional degree of complexity to an already intricate biological process.

The view of the genome has evolved in the past several years, and it appears that activity on the genomic level is much more dynamic and fluid than previously believed. And now, researchers are finally developing the technology to match it.

In 2007 the human genome had been literally ‘spelled out’ for us, but scientists still needed a way to make sense of it all, especially in light of recent developments. It was no longer enough to know the order of the genetic code; now, the way those 3 billion letters are packaged within the nucleus, the DNA elements controlling the activity of protein coding genes, and the proteins associated with genetic material are all just as important as the actual ‘instructions’ encoded in our DNA.

The vision of project ENCODE (the Encyclopedia of DNA Elements) is to orchestrate a massive effort to collect, analyze, and organize all of the relevant data into a functional map of the genome.

 This past week 30 papers were published in Nature, Genome Research, and Genome Biology, all detailing scientists’ efforts in compiling ENCODE and their results to date. The ENCODE project began in 2003, and nine years later we have still only reached the tip of the iceberg. In compiling data for ENCODE the definition of a ‘gene’ expanded to include even non protein coding DNA, raising the gene count in humans to approximately 50,000.

Researchers in project ENCODE have been astonishingly prolific in data collection and analysis, and in only a short time functional characteristics have been assigned to 80% of the genome to date.

In the course of data collection, a considerable emphasis has been placed on studying the vast stretches of DNA between protein coding genes. As described earlier, regulatory sites with the capacity to turn genes on and off are frequently found in these regions. The presence of these regulatory regions is not enough to promote or inhibit gene expression, however; this activity is often initiated by the binding of proteins to these regulatory sites. Researchers are actively compiling data on where these regulatory sites are located and which proteins are binding to them. If you remember anything from high school biology class, often students are taught that the DNA in ever cell in our body- no matter how different they may seem- is identical. For the most part this is true, save for those pesky DNA replication errors. 

The remarkable variety of differences between cells physically, functionally, and in composition therefore, is the result of different combinations of genes being turned 'on' and 'off'.  By analyzing variations in the genome between cells- ie, which sites are occupied in the genome and whcih protein they are occupied by, which genes are turned 'on' and which are turned 'off'- we can get a glimpse of the genome program that makes a cell unique.

Many of the proteins bound to these regulatory sites physically interact with proteins bound to distant regions of the genome. To accommodate this partnership, the DNA in between loops out to bring the two partner proteins close enough to stick to each other. The scope of project ENCODE includes measuring the distance between interacting proteins by ‘cutting out’ the loop of DNA in between the two partnered proteins and measuring the number of letters in the loop. This information will help scientists to grasp the range of activity and the interactions between non protein coding DNA and protein coding genes.

ENCODE is one of the first major efforts to compile such a broad scope of data on the human genome, and it has certainly set a lofty goal. ENCODE will most likely prove to be a veritable treasure trove of information for scores of scientists, and it is not likely that ENCODE researchers’ efforts will be slowing down anytime soon.