My academic adventures have exposed me to a fairly large amount of mathematics.  While I seem to end up wishing I had taken even more mathematics, computer science, and statistics coursework every week or three, I've nevertheless been pretty grateful for having this sort of background as I've become involved with plant genomics research.

Interacting with the bioinformaticians in our lab has given me an appreciation for something that should have occurred to me long ago regarding the way biologists typically analyze microarray data.  It seems like such a little thing, but its effects can potentially ripple through the statistical analysis of any large set of data.  It has slowly dawned on me that most biologists have been inappropriately using the assumption that the tests in a set of microarray data are independent.  This is not a big secret, and statisticians have been working on solutions for a while now.  I'll get to a couple of innovations that people have come up with to correct this fundamental error in a little while, but first I'll explain what I mean when I say that statistical tests in a microarray data set are not independent.

Whaddya Mean, "Not Independent"?

A difference test in statistics -- say, a t-test or an ANOVA -- measures some dependent variable across a set of independent variables.  In a microarray experiment, the dependent variables being measured are the expression of various genes; the independent variable is the experimental treatment.  For example, well-watered plants versus drought stressed plants would involve two categories of one independent variable.  Independent variables can be notoriously complex, with multiple types within multiple independent variables, and interactions between the independent variables.  This complexity can be dealt with by traditional statistical methods fairly well with little difficulty.  Independent variables are not what I am concerned with here, though.

The trouble comes in for microarray experiments when we look at the expression of not just one gene, but of the 40,000 or more genes monitored on a single array.  Typically, for every set of independent variables, we are investigating the effects on 40,000 or more dependent variables that we assume for the sake of simplicity to be independent from each other.  This allows us to repeatedly run our desired test, once per gene, just the same as we would do if we were examining, say, only the dependent variable of plant growth.  Yet these 40,000 genes do not act independently of each other.  We know of genes, such as transcription factors, which are involved in regulating the expression of other genes.  We also know that all of the genes in a sample came from the same genome (or population of genomes), and are collectively working together to maintain homeostasis in some tissue or organism.  Not only is it not possible to establish the independence of the statistical tests in a microarray data set, we know that the exact opposite is true.  No matter how stringent we make our multiple testing correction (e.g., false discovery rate or FDR), the assumption underlying our tests is incorrect.  This is about as wrong as assuming our data is distributed normally when we know for a fact it is bimodal.  Can we really trust the results of the tests if the assumptions don't fit?  Probably not.

Possible Solutions

I'm aware of one approach that can deal fairly well with this kind of situation as well as one I suspect is effective in avoiding the issue entirely.  I'm sure a good statistician could come up with others.  The first approach involves data reduction (1).  In this approach, we run tests to figure out which particular genes are dependent upon each other's expression (i.e., which genes are coregulated).  One need not determine the direction of causation here, merely the existence of correlation.  Correlated genes can be lumped together and considered as a single "dummy gene" for the sake of the statistical test.  This approach is nice because it can reduce the knowable dependencies in the data.  However, there is the theoretical possibility of conflating two different sets of genes that just happen to be coregulated with each other without actually having a biological relationship.  Situations also exist where genes are indirectly or nonlinearly regulated by other genes; such situations could escape detection using this method but would not be truly statistically independent.

The second approach involves Bayesian statistics (2), basically starting with a random statistical model to explain the microarray data, iterating through random changes to the model, and comparing the two models.  The model in each round that better explains the microarray data is kept for the next round, ultimately generating a model that explains the overall dataset as best possible given some set number of iterations.  I was introduced to the method last summer by one of its co-developers, and it seems to deal quite effectively with a number of challenges that face other methods of analyzing microarray data.  Nothing comes for free, of course, and particular sorts of experimental designs are necessary for this method to be useful.  Admittedly, Bayesian statistics is one of those topics I find myself sighing and wishing I'd had the foresight to pack into my coursework at some point.  I'm open to peanut-gallery opinions as to whether this method actually does avoid the independence issue in the way I have the fuzzy impression it does.


(1) Qin et al. 2008.  Bioinformatics 24(14): 1583-1589.
(2) Townsend and Hartl.  2002.  Genome Biology 3: research0071.1-0071.16.