September 25, 2019

Gene enrichment analysis: Gene Ontology (GO) and Ingenuity Pathway Analysis (IPA)

GO = gene ontology. The "meaning" of the gene (biological function, molecule type, cell location, disease association, etc).

Gene enrichment analysis
is a search for patterns in gene expression, gene upregulation or downregulation, in lists of total significant genes, etc. Most genes are associated with some pathways, for example: ACTB (beta actin) and cell migration and cell proliferation and mitosis, TP63 (p53) and cell cycle arrest and apoptosis, ALDH2 and aldehyde metabolism, CDC7 and cell cycle control, CDK2 and cell cycle control, CDK7 and cell cycle control and transcription, etc.

The analysis looks at the specific genes and their associated biological pathways or molecular functions or disease associations or upstream regulators, then calculates p-values for how often those associations come up compared to how often you would expect them to come up in a random list of genes. If an association comes up really often (like cell cycle control), is it because it has a large list of associated genes since it is a well-studied pathway, or is it because your specific list of genes is "enriched" for that pathway? The analysis helps you determine that. 

The most common statistical test for this type of analysis is a Fisher's Exact Test, which calculates the significance of an overlap. The overlap you are interested in is [your genes] compared to [the genes associated with a pathway]. The length of genes in either list is taken into account, since the likelihood of an overlap is higher with longer lists. You calculate a p-value for every pathway to determine which pathways are significant for your gene list. 




There are many resources for a gene enrichment analysis. When using any of the suggested tools, beware that the size of your input matters. For human datasets, 200-2000 genes is generally considered a good range for analysis. You can go out of this range a little (e.g. 150-2500) but too few or too many genes will make the overlap analysis less interesting or biologically meaningful. For example, what does it matter if you have a 100% overlap with the "T-Cell Development" pathway if your input contains 60% of the genes in the entire human genome? You probably have 100% overlap with many pathways in that case, but you won't know which are false positives just due to the excessive input! Your results won't be useful for biological interpretation. 

Too few input genes, and you will get the opposite problem. You will not get many overlaps at all, and most of your overlaps will be random chance overlaps to large pathways.



----------------------------------------------------------------------------------------------

Resources to run enrichment analysis


The Gene Ontology Resource

http://geneontology.org/

Cost: free 

Interface: website tool, though several R packages exist also. For DNA methylation array data, use R package missMethyl since it takes into account that genes have multiple methylation probes.

What it does: takes your list of genes and provides enrichment analysis for GO terms. Each individual gene is associated with some GO terms. Enrichment analysis tells you if your specific list of genes has GO terms that appear more often than you would expect from a random list of genes.

There are different GO term categories:
  • biological process = what is it involved in? (e.g. apoptosis, DNA replication, translation)
  • molecular function = what does it do? (e.g. helicase, RNA-binding, copper-binding, receptor)
  • cellular component = where is it? (e.g. nuclear, extracellular, peroxidase)

Pro: open source, free, doesn't require programming knowledge.

Con: gene ontology terms can be vague, does not provide direction (e.g. do these genes repress or promote apoptosis?)

~ ~ ~

Ingenuity Pathway Analysis or "IPA" (QIAGEN)

Ingenuity Pathway Analysis (IPA) is a software for gene enrichment analysis by the QIAGEN company. They have live webinars every now and then. If you look under “Support”, then “Training” on their main page, you can see some recorded webinars.

Cost: licensed. Cedars-Sinai has an institutional license for its researchers. Request an account through the service center.

Interface: Java-based desktop software

What it does: takes your list of genes plus additional data from RNA-sequencing if you have it (P-values, FDR values, expression intensity, fold-change direction and magnitude) and provides enrichment analysis for a few different categories. It performs enrichment analysis for:

  • Canonical pathways (e.g. EIF2 signaling, mitochondrial dysfunction)
  • Diseases and biological functions (e.g. tumor morphology, infectious disease, cancer)
  • Molecular and cellular functions (e.g. cell death and survival, protein synthesis, cellular development, cellular movement)
  • Physiological systems development and function (e.g. organismal survival, cardiovascular system development and function, tissue development)
  • Upstream regulators (genes, proteins, endogenous chemicals, and drugs that are known to regulate your list of genes)
IPA also identifies networks that fit many of your input genes, and adds genes (or proteins or chemicals) not on your list which are also part of the networks. 


Pro: more specific terms than GeneOntology.org, predicts directions of enriched terms if enough is known about your input genes and enough of them go in the same direction, doesn't require programming knowledge.

Con: proprietary database, license isn't free so users tend to be scientists with institutional access. Cedars-Sinai pays for a group license so that a certain number of users can log in at once. If the limit is reached, you will get an error and won't be able to use the software until someone logs off. 

Software download link: 
https://analysis.ingenuity.com/pa/installer/select

Software download is free, but you won't be able to use IPA without a license (not free).

~ ~ ~

KEGG: Kyoto Encyclopedia of Genes and Genomes

https://www.genome.jp/kegg/

Cost: small fee for maintenance costs. It used to be free.

Interface: R package or desktop tool, might be outdated.

This is another gene ontology database but I don't use it.

Recommended reading: "Don't use KEGG!!!" by Genome Spot (2024)

~ ~ ~

Gene Set Enrichment Analysis (GSEA) R package


Cost: free

Interface: R package or desktop tool


Read:



The previous methods don't require programming knowledge. GSEA is a popular method that uses an R package.

~ ~ ~

Last updated 11/17/2023 to add figures, then 9/12/2024 to add GSEA

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...