March 17, 2022

Why do we adjust p-values for multiple comparisons in big data analysis?

For RNA-sequencing, DNA methylation arrays, GWAS, and other big data projects, we run a statistical test to compare two groups for each row in the data matrix (each gene, each methylation site, each SNP, etc). That gives us a p-value for each reading. 

However, with big data, p<0.05 is only suggestive, not sufficiently strict. When working with thousands (miRNA-seq), tens of thousands (mRNA-seq or total RNA-seq), or >450,000 sites (DNA methylation), the occurence of false positives is much greater. This webcomic shows why we don't rely only on simple p-values to make conclusions: "Significant" (xkcd #882)

To reduce risk of false positives and narrow down to significant values, we adjust the original p-values for multiple comparisons. These two methods are popular:

  • Benjamini-Hochberg's False Discovery Rate (FDR)
    • Most common p-value adjustment for RNA-seq data.
    • Sometimes used for DNA methylation data.
    • Citation: Benjamini, Y., & Hochberg, Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289-300.
    • In R, make a new "FDR" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
      data$FDR = p.adjust(data$pvalues, method="BH")
      or, synonymously,
      data$FDR = p.adjust(data$pvalues, method="fdr")
    • The R package DESeq2 creates a "padj" column using the p.adjust() R function as above. 

  • Bonferroni correction
    • Much stricter (more conservative) than the Benjamini-Hochberg FDR method.
    • Leaves less significant values overall. Drastically reduces false positives (type I error), but increases false negatives (type II error).
    • Sometimes used for DNA methylation data if other quality control steps suggest the data has high inflation (higher risk of false positives).
    • Not typically used for RNA-seq.
    • In R, make a new "bonferroni" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
      data$bonferroni = p.adjust(data$pvalues, method="bonferroni")

Additional reading:

No comments:

Post a Comment

How to format final figures for publication

General figure guidelines File types and file sizes TIFF images with LZW compression to reduce the file size PDF files for vector images Not...