Advanced Analysis

Once you have a list of significant genes, the direction your analysis takes will vary depending on the questions you're trying to answer. Here we present several common procedures. By themselves, these procedures cannot uncover biological truths about your system - all they can do is suggest new avenues of exploration.

By this point in the process, the biggest problem facing analysts is the sheer volume of the data. Normalization and Significance analysis have combined to remove most of the noise and to reduce the tens of thousands of genes you started with to a more managable list. This list is most likely still fairly large, though, and it would take years to exhaustively research each of the genes you've flagged as "significant". So a major aim at this step is to reduce the complexity of your data. There are three major methods for this: Unsupervised Learning (Clustering), Supervised Learning (Discrimination), and Good Old-Fashioned Detective Work.

Unsupervised Learning

Unsupervised Learning is the process of reducing the complexity of a gene list using nothing more than the expression values obtained across your samples (This is what makes it unsupervised - the algorithms have no knowledge of biological systems or even phenotypes). Unsupervised learning usually means some form of clustering, but also includes other algorithms like Principal Component Analysis.

The goal of clustering is to produce groups (or clusters) that behave similarly across a set of experiments. It is most common to cluster genes, although clustering samples or both genes and samples would be possible. Like any tool, clustering has strengths and weaknesses. It is useful for discovering relationships in data - if several genes behave similarly across samples, might they share a common pathway? It can be useful for predictive purposes - perhaps to find a set of ten to twenty genes that indicate a predisposition to cancer? It does not reveal hidden truths in the data. Clustering has a number of parameters the analyst must tweak, and as such is susceptible to bias. Finally, clustering will always work - no matter how uniform the data, it will always resolve into clusters. All of this simply means that clustering is a useful tool, but should not be considered in any way conclusive.

A heatmap for hierarchically clustered genes
A heatmap of hierarchically clustered genes. The rows are genes, and the columns are samples. Notice the dendrogram on the left side that defines how the genes cluster.

The most common representation of clusters is with a heat map. A heat map is a graphical representation of microarray in grid form, with columns representing samples and rows representing genes. The intersection of a gene and sample is colored according to its expression value - usually red indicates high expression, green indicates low expression, and the intensity of the color indicates how high or low it is. The genes are ordered according to the results of the cluster algorithm, so that the resulting image (hopefully) clearly shows the similarity of expression for each cluster across the samples.

The simplest form of clustering is hierarchical clustering. With this technique, each gene starts out in its own cluster. Nearby clusters are merged, and the merges are tracked. Eventually, all of the genes wind up in one large cluster - which isn't terribly useful. However, we can use the knowledge of when clusters were merged to get a relative idea of how "similar" two genes are. Typically, the series of merges is drawn as a tree-like structure called a dendrogram. This gives an implicit ordering to the genes, and is usually drawn above the heat map.

The other common types of clustering are k-means clustering and the closely related self-organizing map (SOM). With these schemes, the analyst chooses the number of clusters to create, k. Initially, the genes are randomly assigned to one of the k clusters. The software proceeds to move genes into the cluster that is closest to them and recalculate the position of the cluster, repeating until the clusters are stable.

The final type of unsupervised learning is principal component analysis (PCA). PCA is not really a form of clustering, in that it doesn't actually assign genes to different groups. What it does is allow an analyst to more easily find relationships and groups in complex data. Essentially, PCA attempts to reduce the dimensionality of data. Think about looking at a cloud from multiple angles. Even though it is the same cloud, what you see can change quite drastically as you move around it. PCA finds the "most interesting" view of your data in either 2 or 3 dimensions. By projecting into lower dimensions, it is easier to manually spot clusters of similar genes.

Supervised Learning

Supervised Learning uses expression data from microarrays, as well clinical or phenotypic data about samples, to be able to automatically discriminate between various samples. The most common approach is to use a machine learning construct like a Support Vector Machine, Neural Network, or Genetic Algorithm to train on a set of defined data. Once the algorithm has been properly trained, it can discriminate samples based solely on expression profile - for instance, it may be able to define the type of leukemia a patient has. For a more thorough overview of supervised learning, consult [Statnikov 2004]

Good Old-Fashioned Detective Work

Finally, you can research what is known about your significant genes. It's not uncommon to find an interesting pathway or ontology that is well represented among your geneset. What defines "interesting", as well as what you do with the data, is entirely project-dependent

A fairly simple exercise is to classify each gene in your list according to the pathways it belongs to, and then compare that list of pathways with a reference list for your organism. If 10% of your significant genes belong to a specific pathway, but only 2% of the genes overall do, it may be a sign that the pathway is involved in your study process. A similar analysis can be carried out for ontologies.