The term "genome" is oft-heard but seldom defined, and indeed has more than one meaning. Little wonder, then, that discussions about genome sequences and comparisons thereof can leave otherwise interested audiences more frustrated than enlightened. "What is a genome?" and "whose genome was sequenced?" are legitimate questions, and what follows is an attempt at clarification that is, by necessity, as much philosophical as scientific.

Definition #1: In a broad sense, a genome can be considered as the collective set of genes, non-coding DNA sequences, and all their variants that are located within the chromosomes of members of a given species. This definition does not consider variation among individuals within a species, and instead relates to distinctions between species. It is possible to apply such a definition because, for the most part, animal species do not share DNA extensively and hence  their respective gene pools remain distinct (in fact, this forms the basis for defining species under some views). Thus, even though humans and chimpanzees are about 98% identical in terms of their DNA sequences, there is still such as thing as a "human genome" and a "chimpanzee genome" rather than a continuum with humans and chimps at two mildly divergent extremes. This is even true of far closer (but now extinct) relatives of humans such as Neanderthals; on average, the sections of Neanderthal DNA that have been recovered and sequenced are 99.5% identical to that of humans -- but these, too, are considered to be part of a separate genome.

The genomic similarities described between species are usually based on comparing a few specific regions of DNA from a small number of representative individuals. If other factors are included in the comparison, such as insertions and deletions of DNA, then any two genomes will register a lower level of similarity -- say, 95% for chimpanzees and humans rather than 98%. And indeed, no one would ever mistake a chimpanzee genome for a human genome, in part because they differ in DNA amount and chromosome number (human chromosome 2 is a product of fusion of what remain as two separate chromosomes in other great apes).

Of course, individuals within species are not genetically identical to one another (monozygotic twins notwithstanding), which leads to definition #2.

Definition #2: Because the DNA sequences of even close family members are not identical, it can also be said that each individual carries a unique genome consisting of the DNA in his or her chromosomes. In this case, the focus is entirely on one species and the important factor is the variability that exists among individuals of that species. In terms of DNA sequences on a large scale, members of the same species are extremely similar: overall, any two human beings are probably about 99.9% the same genetically. Nevertheless, complete genome sequencing, though conducted primarily under definition #1, has revealed two major sources of variation among individuals. The first are known as single nucleotide polymorphisms (SNPs, "snips") and are as their name implies: differences at the level of single base pairs that are present in at least 1% of the population. It is estimated that there are some 3 million SNPs in the human genome (definition #1), with one occurring about every 100-300 base pairs along the more than 3 billion base pair sequence. The second major source of variation, first described in 2006, are known as copy number variants (CNVs). These involve differences among individuals in the insertion and deletion of larger DNA segments. CNVs have proved to be far more common than anyone would have imagined, and can result in differences not just in sequences but in the sizes of genomes among individuals (up to 20 million base pairs in humans, or about 0.5%).

The human genome, definition #1

Two independent research groups reported draft sequences of the human genome in February 2001: the publicly-funded and internationally collaborative Human Genome Project and the private company headed by J. Craig Venter known as Celera Genomics. The interaction of these two initiatives – typically branded as competitive, but also mutually informative – has been discussed many times. The question of how and why the two groups sequenced human DNA is not the subject of interest here – the question at present is whose DNA they analyzed.

The Human Genome Project, being a public effort, had an official policy of releasing all sequence data to public databases within 24 hours of completion, thereby making the information freely available to anyone who carried a copy of the "human genome" in their cells. In keeping with this outlook, the HGP implemented procedures intended to circumvent the focus on individuality and to keep their results in line with definition #1. Thus, they instituted a policy of voluntary donation by dozens of men and women from various ethnic backgrounds, provided samples with random numeric labels, shipped the samples to processing laboratories where they were re-labeled with new randomized codes, destroyed all records of previous labels, and then selected randomly from among the samples. Five to ten samples were collected for every one that was actually assayed, with the source of samples used unknown to both researchers and donors. In other words, the intent was to focus on definition #1 as much as possible and to provide a mixture, or at least a mystery, when it came to the source of the genome sequence of Homo sapiens.

Human nature being what it is, it is likely that most people would find this answer disappointing. Deep down, we want to know whose genome it was. The only information that has been available in this regard is that the largest portion of the source DNA came from a male donor in Buffalo, New York, code named “RPCI-11” (for Roswell Park Cancer Institute, where the genomic library was generated). No name, no other information, and yet somehow it seems satisfying to know that there really is an individual human – a real person in a specific part of the world who walked into a lab, stuck out his arm, and donated his blood – corresponding to all those A’s, T’s, G’s, and C’s.

The situation at Celera was quite different in terms of both data sharing and DNA sampling policies. Celera’s data were not made publicly available during the course of sequencing, and their sampling involved 20 donors, five of which were selected for analysis -- though evidently not entirely at random. In fact, it was later revealed that Celera’s president and lead investigator, J. Craig Venter, was the primary source of DNA for the sequence. Venter argued that revealing this fact would dispel the myth of a single "human genome" (i.e., an excessive emphasis on definition #1 that ignores the individual uniqueness inherent in definition #2). Others may have felt that sequencing his own genome made the resulting sequence the property of one individual rather than of humanity at large (i.e., adopting definition #2 exclusively at the expense of the broadly shared definition #1).

The human genome, definition #2

While the Human Genome Project and Celera's efforts generated single (partially composite) genome sequences, another major initiative is underway which focuses on variation among individuals at the genomic level; i.e., on definition #2. The International HapMap Project aims to identify associated collections of SNPs known as haplotypes, and currently includes samples from 270 people drawn from four major groups. Thirty sets of "trios" (two parents and a child) have come from the Yoruba people of Ibadan, Nigeria. Forty-five unrelated individuals from Tokyo and 45 from Beijing have provided samples. Thirty trios from residents of the United States with roots in western and northern Europe have also been included. SNP haplotypes may vary among populations and are important in the search for particular genes of medical significance. As with the DNA sequence information of the Human Genome Project, data from the HapMap Project are made freely available. A similar initiative to catalogue human diversity from the perspective of CNVs, the Copy Number Variation Project, has also been launched.

“Whose genome?” and individual identity

The question "whose genome was sequenced?" is predicated on concepts of individuality and personhood, which tend to be applied to the members of only a handful of species. Thus, one might be interested in which strain of fruit fly (Drosophila melanogaster), which population of sea urchins (Strongylocentrotus purpuratus), or which varieties of rice (Oryza sativa) had been  sequenced, but it would not make sense to ask “who” the fly, sea urchin, or rice plant was. The situation gains complexity when dealing with vertebrate genomes because humans associate closely and emotionally with members of some species and not with others. The desire (or not) to know “who” was sequenced correlates directly with this. By way of example, consider the fact that a single male pufferfish (Takifugu rubripes), a single female chicken (red jungle fowl, Gallus gallus), two female brown rats (Rattus norvegicus), and a small number of female mice (Mus musculus) of the B6 strain have been sequenced, but that there has not been much interest in “who” these individuals were – nor would many people even think to ask the question.

Now consider "man's best friend". Not only is it known that the two reported cani ne genome sequences were from individual dogs, but it is known who those dogs were: Craig Venter’s poodle, Shadow, and a boxer named Tasha. It was also widely noted that samples for the chimpanzee  genome were taken from a captive-born male named Clint who lived at the Yerkes National Primate Research Center in Atlanta, Georgia. Indeed, many a news story reported Clint’s untimely death in 2005 at the young age of 24. One might be tempted to argue that intelligence is the determining factor in this case – dogs and chimps are smart and have personalities, but pufferfish and rats do not. Perhaps. But surely the recently sequenced rhesus macaque (whatever her name was) should qualify under these criteria.

In the end, this post is not meant to be a statement about the apparent arbitrariness of our decisions to grant or deny individuality to members of other species. This is about genomes, and how definition #1 is applied intuitively and automatically when dealing with a species like mouse or rat, but that one cannot help but invoke definition #2 when dealing with a dog or human. The fact is that all of these species are composed of variable individuals, each with a unique genome under definition #2. Indeed, it is this variation that makes evolutionary divergence – and thus definition #1 – possible at all.

________

Originally posted on the old Genomicron 4/13/07.