March 17, 2022

Why do we adjust p-values for multiple comparisons in big data analysis?

For RNA-sequencing, DNA methylation arrays, GWAS, and other big data projects, we run a statistical test to compare two groups for each row in the data matrix (each gene, each methylation site, each SNP, etc). That gives us a p-value for each reading. 

However, with big data, p<0.05 is only suggestive, not sufficiently strict. When working with thousands (miRNA-seq), tens of thousands (mRNA-seq or total RNA-seq), or >450,000 sites (DNA methylation), the occurence of false positives is much greater. This webcomic shows why we don't rely only on simple p-values to make conclusions: "Significant" (xkcd #882)

To reduce risk of false positives and narrow down to significant values, we adjust the original p-values for multiple comparisons. These two methods are popular:

  • Benjamini-Hochberg's False Discovery Rate (FDR)
    • Most common p-value adjustment for RNA-seq data.
    • Sometimes used for DNA methylation data.
    • Citation: Benjamini, Y., & Hochberg, Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289-300.
    • In R, make a new "FDR" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
      data$FDR = p.adjust(data$pvalues, method="BH")
      or, synonymously,
      data$FDR = p.adjust(data$pvalues, method="fdr")
    • The R package DESeq2 creates a "padj" column using the p.adjust() R function as above. 

  • Bonferroni correction
    • Much stricter (more conservative) than the Benjamini-Hochberg FDR method.
    • Leaves less significant values overall. Drastically reduces false positives (type I error), but increases false negatives (type II error).
    • Sometimes used for DNA methylation data if other quality control steps suggest the data has high inflation (higher risk of false positives).
    • Not typically used for RNA-seq.
    • In R, make a new "bonferroni" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
      data$bonferroni = p.adjust(data$pvalues, method="bonferroni")

Additional reading:

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...