Banner
    How many genes do you REALLY have?
    By Michael White | May 6th 2008 04:35 PM | Print | E-mail | Track Comments
    About Michael

    Welcome to Adaptive Complexity, where I write about genomics, systems biology, evolution, and the connection between science and literature,

    ...

    View Michael's Profile
    You've probably heard widely varying estimates for the number of protein-coding genes in the human genome. Back before the genome sequence came out, many scientists guessed that the number was around 100,000. When scientists first looked at the newly completed human genome sequence in 2001, they found about 27,000 genes, and ever since then I have seen estimates ranging from 20,000 to 30,000. The lowest estimate I've ever seen is on the first page of Carl Zimmer's new book, Microcosm , where he states that we have 18,000 protein-coding genes. So what is the answer? The current gold standard is a highly curated set of genes called the RefSeq genes which this moment includes 26,322 genes (you can check this at the UCSC Genome Browser), although that is probably a bit of an overestimate, since it's generally easier to get a gene into a database like this than it is to remove one. A better way to get a good number is to ask a someone who looks for genes in genomes. According to one of the best recent estimates, we have about 20,500 genes. Why does this number vary so much? Before we had the human genome sequence, people were just guessing, and thus you got estimates of 100,000 protein coding genes. With DNA sequence in hand, we can systematically search for genes. But genes are broken into pieces called exons, and it can sometimes be hard to tell which set of exons make up one gene. It's also hard to weed out non-functional pseudogenes which no longer produce any protein. The estimates are settling down though, so don't expect as much variation in the future. Zimmer's 18,000 is surely too low, but it's not as far off as other estimates that are casually tossed out in scientific papers. Unfortunately, just as we're settling on a number protein-coding genes, we're finding many new non-protein genes, which means a new debate is just getting started.