For RNA-sequencing, DNA methylation arrays, GWAS, and other big data projects, we run a statistical test to compare two groups for each row in the data matrix (each gene, each methylation site, each SNP, etc). That gives us a p-value for each reading.
However, with big data, p<0.05 is only suggestive, not sufficiently strict. When working with thousands (miRNA-seq), tens of thousands (mRNA-seq or total RNA-seq), or >450,000 sites (DNA methylation), the occurence of false positives is much greater. This webcomic shows why we don't rely only on simple p-values to make conclusions: "Significant" (xkcd #882)
To reduce risk of false positives and narrow down to significant values, we adjust the original p-values for multiple comparisons. These two methods are popular:
- Benjamini-Hochberg's False Discovery Rate (FDR)
- Most common p-value adjustment for RNA-seq data.
- Sometimes used for DNA methylation data.
- Citation: Benjamini, Y., & Hochberg, Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289-300.
- In R, make a new "FDR" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
data$FDR = p.adjust(data$pvalues, method="BH")
or, synonymously,
data$FDR = p.adjust(data$pvalues, method="fdr") - The R package DESeq2 creates a "padj" column using the p.adjust() R function as above.
- Bonferroni correction
- Much stricter (more conservative) than the Benjamini-Hochberg FDR method.
- Leaves less significant values overall. Drastically reduces false positives (type I error), but increases false negatives (type II error).
- Sometimes used for DNA methylation data if other quality control steps suggest the data has high inflation (higher risk of false positives).
- Not typically used for RNA-seq.
- In R, make a new "bonferroni" column in your complete dataset using the full list of p-values, before adding any filters for expression or fold-change or otherwise subsetting the data:
data$bonferroni = p.adjust(data$pvalues, method="bonferroni")
Additional reading:
- Mohieddin Jafari & Naser Ansari-Pour. (2019) "Why, When and How to Adjust Your P Values?" Cell J. 20(4): 604–607. doi: 10.22074/cellj.2019.5992
- R's p.adjust() function documentation
No comments:
Post a Comment