It's done - the independent sequence and assembly of the six billion base pairs from the genome of one person, Craig Venter of the J. Craig Venter Institute (JCVI), has been completed.
Two general versions of the human genome currently exist but those were a melding of DNA from various people. In the case of one version from Celera Genomics, it was a consensus assembly from five individuals, while a government-funded version was a haploid genome based on sequencing from a limited number of individuals.
It seems both versions greatly underestimated human genetic diversity.
In process since 2003, they used whole genome shotgun sequencing and highly accurate long reads from Sanger dideoxy automated DNA sequencing to produce additional data making the final 32 million sequences that were not available from Venter's data already contained in the Celera version.
From the combined data set of more than 20 billion base pairs, the researchers were able to assemble the human genome with an overall length of 2.810 billion base pairs. The genome was covered 7.5 times, ensuring that each set of contributing chromosomes was covered over 3.2 times for greater than 96% coverage of the two parental genomes. The team at JCVI compared and contrasted the new HuRef diploid genome sequence to earlier versions of published human genomes and found that the HuRef version improved upon both these early versions by providing more and correctly oriented base pairs.
Since the HuRef genome is diploid, each of the parental chromosomes could be directly compared to each other. One of the most surprising and important findings from this research was the high degree of genetic variation that was found between two chromosomes within a single individual.
“Each time we peer into the human genome, we uncover more valuable insight into our intricate biology,” said Dr. Venter. “With this publication, we have shown that human-to-human variation is more than seven-fold greater than earlier estimates, proving that we are in fact very unique individuals at the genetic level.” He added, “It is clear, however, that we are still at the earliest stages of discovery about ourselves, and only with continued sequencing of more individual genomes will we be able to garner a full understanding of how our genes influence our lives.”
Within the human genome, there are different kinds of DNA variants. The most studied type is single nucleotide polymorphisms, or SNPs. These have long been thought to be the most prevalent and perhaps the most important type of variant implicated in human traits and disease susceptibility. However, in this analysis of Dr. Venter’s genome, the team found a surprising number of other important variants. A total of 4.1 million variants covering 12.3 million base pairs of DNA were uncovered with more than 1.2 million new variants discovered.
Of the 4.1 million variations between chromosome sets, 3.2 million were SNPs, while nearly one million were other kinds of variants, such as insertion/deletions (“indels”), copy number variants, block substitutions, and segmental duplications. While the SNPs outnumbered the non-SNP types of variants, the non-SNP variants involved a larger portion of the genome. This suggests that human-to-human variation is much greater than previously thought. The researchers suggest that much more research needs to be done on these non-SNP variants to better understand their role in individual genomics.
According to Sam Levy, Ph.D., lead author and senior scientist at JCVI, “The ability to use unbiased, high throughput sequencing methods, coupled with advance computational analytic methods, enables us to characterize more comprehensively the wide variety of individual genetic variation. This offers us an unprecedented opportunity to study the prevalence and impact of these DNA variants on traits and diseases in human populations.”
Another important feature that is made possible by having an individual, diploid genome is the ability to begin to do better and more informed haplotype assemblies. Haplotypes are groups of linked variants. Through the government-sponsored HapMap project, many common haplotypes have been identified; however, these are based on averages of large ethnogeographic populations rather than individuals. Having individual haplotypes would enable researchers to understand and find more rare or individual variants that would explain and help predict diseases in that particular person—a truly personalized, individualized genomics paradigm. In the HuRef analysis, the team used the 4.1 million variant set and new algorithms to build haplotype assemblies that, when compared to the HapMap project, represented longer and more complete linkages. The JCVI researchers expect this number to improve significantly as additional sequence coverage is added to HuRef using a variety of new seque ncing technologies.
Long-range haplotype linkages will enable much more complete analysis of human variation and the genetic association with complex human traits, behaviors, and diseases. In the near future, the scientists believe that it will be possible to know from which parent various traits were inherited. Already in this analysis, the JCVI team has found more than 300 disease genes and 4,000 genes overall that exhibit different protein forms. This will be an important area for further study and analysis to determine how these altered proteins affect Dr. Venter’s health status.
Citation: Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. (2007) The diploid genome sequence of an individual human. PLoS Biol 5(10): e254. doi:10.1371/journal.pbio.0050254.