Gene set enrichment
Gene set enrichment (also functional enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Microarray and proteomics results often identify thousands of genes which are used for the analysis.[1]
Researchers performing high-throughput experiments that yield sets of genes (for example, genes that are differentially expressed under different conditions) often want to retrieve a functional profile of that gene set, in order to better understand the underlying biological processes. This can be done by comparing the input gene set to each of the bins (terms) in the Gene Ontology – a statistical test can be performed for each bin to see if it is enriched for the input genes. FunRich [2] can also be used for Gene Ontology enrichment analysis.
Background
While the completion of the Human Genome Project gifted researchers with an enormous amount of new information, it also left them with the problem of how to interpret and analyze the incredible amount of data. In order to seek out genes associated with diseases, researches utilized DNA microarrays, which measure the amount of gene expression in different cells. Researchers would perform these microarrays on thousands of different genes, and compare the results of two different cell categories, e.g. normal cells versus cancerous cells.[3] However, this method of comparison is not sensitive enough to detect the subtle differences between the expression of individual genes, because diseases typically involve entire groups of genes. Multiple genes are linked to a single biological pathway, and so it is the additive change in expression within gene sets that leads to the difference in phenotypic expression. Gene Set Enrichment Analysis focuses on the changes of expression in groups of genes, and by doing so, this method resolves the problem of the undetectable, small changes in the expression of single genes.[3]
Methods of Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome.[1] A database of these predefined set can be found at The Molecular Signatures Database (MSigDB).[4] In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, L, the focus is put on a gene set.[1] Researchers analyze whether the majority of genes in the set fall in the extremes of this list: the top and bottom of the list correspond to the largest differences in expression between the two cell types. If the gene set falls at either the top (over-expressed) or bottom (under-expressed), it is thought to be related to the phenotypic differences.
In the method that is typically referred to as standard GSEA, there are three steps involved in the analytical process.[1] The details of the mathematical methods can be found in appendix at www.pnas.org/content/102/43/15545.full#app-1, but the general steps are summarized below:
1.Calculate the Enrichment Score that represents the amount to which the genes in the set are over-represented at either the top or bottom of the list. This score is a Kolmogorov–Smirnov-like statistic.[1]
2. Estimate the statistical significance of the Enrichment Score. This calculation is done by a phenotypic-based permutation test in order to produce a null distribution for the ES.[1]
3. Adjust for multiple hypothesis testing for when a large number of gene sets are being analyzed at one time. The Enrichment Scores for each set are normalized and a false discovery rate is calculated.[1]
Limitations and Proposed Alternatives to standard GSEA
SEA
When Gene Set Enrichment Analysis was first proposed in 2003 by Subramanian et al., some immediate concerns were raised, regarding its methodology. These criticisms led to the use of the correlation-weighted Kolmogorov–Smirnov, the normalized ES, and the false discovery rate calculation, all of which are the factors that currently define standard GSEA.[5] However, GSEA has now also been criticized for the fact that its null distribution is superfluous, and too difficult to be worth calculating, as well as the fact that its Kolmogorov–Smirnov-like statistic is not as sensitive as the original.[5] As an alternative, the method known as Singular Enrichment Analysis (SEA), was proposed. This method assumes gene independence and uses a simpler to calcultate t-test. However, it is thought that these assumptions are in fact too simplifying, and gene correlation cannot be disregarded.[5]
SGSE
One other limitation to Gene Set Enrichment Analysis is that the results are very dependent on the algorithm that clusters the genes, and the number of clusters being tested.[6] Spectral Gene Set Enrichment (SGSE) is a proposed, unsupervised test. The method’s founders claim that it is a better way to find associations between MSigDB gene sets and microarray data. The general steps include:
1. Calculating the association between principal components and gene sets.[6]
2. Using the weighted Z-method to calculate association between the gene sets and the spectral structure of the data.[6]
Detailed methodology can be found at http://libdata.lib.ua.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=101996267&site=eds-live&scope=site
Tools for Performing GSEA
GSEA uses complicated statistics, so it requires a computer program to run the calculations. However, because GSEA has become standard practice in the last decade, there are many websites and downloadable programs that will provide the data sets and run the analysis.
PlantRegMap
GO annotation for 165 plant species and GO enrichment analysis, available at http://plantregmap.cbi.pku.edu.cn/go.php.
MSigDB
The Molecular Signatures Database hosts an extensive collection of annotated gene sets that can be used with most GSEA Software.
Broad Institute
The Broad Institute website is in cooperation with MSigDB and has a downloadable GSEA software, as well a general tutorial for those new to performing this analytical technique, which can be found at http://software.broadinstitute.org/gsea/doc/desktop_tutorial.jsp
DAVID
The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a bioinformatics tool that pools together information from most major bioinformatics sources, with the aim of analyzing large gene lists in a high-throughput manner.[7] DAVID goes beyond standard GSEA with additional functions like switching between gene and protein identifiers on the genome-wide scale,[7] however, it is important to note that the annotations used by DAVID have not been updated since January 2010 [8] which can have a considerable impact on practical interpretation of results. DAVID Protocol can be found at http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.html. In October 2016, DAVID Knowledgebase version 6.8 was released with a complete rebuilt.
AmiGO 2
The Gene Ontology Consortium has also developed their own online GO Term Enrichment tool allowing species-specific enrichment analysis versus the complete database, coarser-grained GO slims, or custom references.[9] GO Term Enrichment and documentation can be found at http://amigo.geneontology.org/amigo
FunRich
The Functional Enrichment Analysis (FunRich) tool is mainly used for the functional enrichment and network analysis of OMICS data. The new version of FunRich [2] software can be downloaded at http://funrich.org/download
Applications and Results of GSEA
GSEA and Genome-wide Association Studies
Single Nucleotide Polymorphisms, or SNPs, are single base mutations that may be associated with diseases. One base change has the potential to affect the protein that results from that gene being expressed; however, it also has the potential to have no effect at all. Genome-wide association studies are comparisons between healthy and disease genotypes to try to find SNPs that are overrepresented in the disease genomes, and might be associationed with that condition. Before GSEA, the accuracy of genome-wide SNP association studies was severely limited by a high number of false positives.[10] The theory that the SNPs contributing to a disease tend to be grouped in a set of genes that are all involved in the same biological pathway, is what the GSEA-SNP method is based on. This application of GSEA does not only aid in the discovery of disease-associated SNPs, but helps illuminate the corresponding pathways and mechanisms of the diseases.[10]
GSEA and Spontaneous Preterm Birth
Gene Set Enrichment methods led to the discovery of new suspect genes and biological pathways related to the Spontaneous Preterm Birth.[11] Exome sequences from women who had experienced SPTB were compared to those from females from the 1000 Genome Project, using a tool that scored possible disease-causing variants. Genes with higher scores were then run through different programs to group them into gene sets based on pathways and ontology groups. This study found that the variants were significantly clustered in sets related to several pathways, all suspects in SPTB.[11]
GSEA and Cancer Cell Profiling
Gene Set Enrichment Analysis can be used to understand the changes that cells undergo during carcinogenesis and metastasis. In a study, microarrays were performed on renal cell carcinoma metastases, primary renal tumors, and normal kidney tissue, and the data was analyzed using GSEA.[12] This analysis showed significant changes of expression in genes involved in pathways that have not been previously associated with the progression of renal cancer. From this study, GSEA has provided potential new targets for renal cell carcinoma therapy.10
GSEA and Schizophrenia
GSEA can be used to help understand the molecular mechanisms of complex disorders. Schizophrenia is a largely heritable disorder, but is also very complex, and the onset of the disease involves many genes interacting within multiple pathways, as well the interaction of those genes with environmental factors. For instance, epigenetic changes, like DNA methylation, are affected by the environment, but are also inherently dependent on the DNA itself. DNA methylation is the most well-studied epigenetic change, and was recently analyzed using GSEA in relation to schizophrenia-related intermediate phenotypes.[13] Researchers ranked genes for their correlation between methylation patterns and each of the phenotypes. They then used GSEA to look for an enrichment of genes that are predicted to be targeted by microRNAs in the progression of the disease.[13]
GSEA and Depression
GSEA can help provide molecular evidence for the association of biological pathways with diseases. Previous studies shown that long-term depression symptoms are correlated with changes in immune response and inflammatory pathways.[14] A study by Elovainio et al. was aimed at finding genetic and molecular evidence that supports this association. The researchers took blood samples from the Young Finns Study (participants were depression patients), and used genome-wide expression data, along with GSEA to find expression differences in gene sets related to inflammatory pathways. The study found that patients who rated with the most severe depression symptoms also had significant expression differences in those gene sets, and this result supports the association hypothesis.[14]
References
- 1 2 3 4 5 6 7 Subramanian, Aravind; Tamayo, Pablo; Mootha, Vamsi K.; Mukherjee, Sayan; Ebert, Benjamin L.; Gillette, Michael A.; Paulovich, Amanda; Pomeroy, Scott L.; Golub, Todd R. (2005-10-25). "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles". Proceedings of the National Academy of Sciences. 102 (43): 15545–15550. doi:10.1073/pnas.0506580102. ISSN 0027-8424. PMC 1239896. PMID 16199517.
- 1 2 Pathan, M; Keerthikumar, S; Ang, C. S.; Gangoda, L; Quek, C. Y.; Williamson, N. A.; Mouradov, D; Sieber, O. M.; Simpson, R. J.; Salim, A; Bacic, A; Hill, A; Stroud, D. A.; Ryan, M. T.; Agbinya, J. I.; Mariadasson, J. M.; Burgess, A. W.; Mathivanan, S (2015). "Technical brief funrich: An open access standalone functional enrichment and interaction network analysis tool". Proteomics. 15: n/a. doi:10.1002/pmic.201400515. PMID 25921073.
- 1 2 Moothka, V; et al. (2003). "PGC-1a-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes". Nature Genetics. 34 (3): 267.
- ↑ Liberzon, Arthur; Birger, Chet; Thorvaldsdóttir, Helga; Ghandi, Mahmoud; Mesirov, Jill P.; Tamayo, Pablo (2015-12-23). "The Molecular Signatures Database Hallmark Gene Set Collection". Cell Systems. 1 (6): 417–425. doi:10.1016/j.cels.2015.12.004. ISSN 2405-4712. PMC 4707969. PMID 26771021.
- 1 2 3 Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P. (2016-02-01). "The limitations of simple gene set enrichment analysis assuming gene independence". Statistical Methods in Medical Research. 25 (1): 472–487. doi:10.1177/0962280212460441. ISSN 0962-2802. PMC 3758419. PMID 23070592.
- 1 2 3 Frost, H Robert; Li, Zhigang; Moore, Jason H (2015-03-03). "Spectral gene set enrichment (SGSE)". BMC Bioinformatics. 16 (1). doi:10.1186/s12859-015-0490-7. PMC 4365810. PMID 25879888.
- 1 2 Huang, Da Wei; Sherman, Brad T; Lempicki, Richard A. "Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources". Nature Protocols. 4 (1): 44–57. doi:10.1038/nprot.2008.211.
- ↑ Wadi L, Meyer M, Weiser J, Stein LD, Reimand J. "Impact of knowledge accumulation on pathway enrichment analysis". doi:10.1101/049288.
- ↑ "Gene Ontology Consortium: going forward". Nucleic Acids Research. 43 (D1): D1049–D1056. 26 November 2014. doi:10.1093/nar/gku1179.
- 1 2 Holden, Marit; Deng, Shiwei; Wojnowski, Leszek; Kulle, Bettina (2008-12-01). "GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies". Bioinformatics. 24 (23): 2784–2785. doi:10.1093/bioinformatics/btn516. ISSN 1367-4803. PMID 18854360.
- 1 2 Manuck, Tracy A.; Watkins, Scott; Esplin, M. Sean; Parry, Samuel; Zhang, Heping; Huang, Hao; Biggio, Joseph R.; Bukowski, Radek; Saade, George. "242: Gene set enrichment investigation of maternal exome variation in spontaneous preterm birth (SPTB)". American Journal of Obstetrics and Gynecology. 214 (1): S142–S143. doi:10.1016/j.ajog.2015.10.280.
- ↑ Maruschke, Matthias; Hakenberg, Oliver W; Koczan, Dirk; Zimmermann, Wolfgang; Stief, Christian G; Buchner, Alexander (2014-01-01). "Expression profiling of metastatic renal cell carcinoma using gene set enrichment analysis". International Journal of Urology. 21 (1): 46–51. doi:10.1111/iju.12183. ISSN 1442-2042.
- 1 2 Hass, Johanna; Walton, Esther; Wright, Carrie; Beyer, Andreas; Scholz, Markus; Turner, Jessica; Liu, Jingyu; Smolka, Michael N.; Roessner, Veit (2015-06-03). "Associations between DNA methylation and schizophrenia-related intermediate phenotypes — A gene set enrichment analysis". Progress in Neuro-Psychopharmacology and Biological Psychiatry. 59: 31–39. doi:10.1016/j.pnpbp.2015.01.006. PMC 4346504. PMID 25598502.
- 1 2 Elovainio, Marko; Taipale, Tuukka; Seppälä, Ilkka; Mononen, Nina; Raitoharju, Emma; Jokela, Markus; Pulkki-Råback, Laura; Illig, Thomas; Waldenberger, Melanie. "Activated immune–inflammatory pathways are associated with long-standing depressive symptoms: Evidence from gene-set enrichment analyses in the Young Finns Study". Journal of Psychiatric Research. 71: 120–125. doi:10.1016/j.jpsychires.2015.09.017.