September 2, 2023

Vocabulary for reading DNA methylation papers

 If you are new to the field of DNA methylation, here is some introductory vocabulary.

  • 27k array - short for Infinium HumanMethylation27 BeadChip (approx 27,000 methylation sites). An early DNA methylation arrays that samples CpG sites across the human genome. Not used anymore. 
  • 450k array - short for Illumina HumanMethylation450K BeadChip (>450,000 methylation sites). This used to be the gold standard for array-based DNA methylation measurements for several years, though now the EPIC array is available with more methylation sites.
  • Beta value and M value are two terms used to describe the methylation measurement.
    • Beta values range from 0 to 1 and describe the proportion of methylation for a specific site in the sample, from completely unmethylated to completely methylated. 
    • M values can be negative or positive and are a result of data normalization. M values are more useful than beta values as input for statistical models. If you want to describe differentially methylated probes (DMPs) then you need M values for your analysis.
  • Beta values or sometimes "delta beta", annoyingly, can also refer to the model coefficient result from a generalized linear model. In that case, they would describe if a site is more or less methylated in group1 versus group2. Positive and negative values refer to the direction of higher methylation. Consult the manuscript to figure out the direction.
  • Bisulfite conversion - this is a chemical change done by the lab to make it easier to distinguish unmethylated versus methylated sites on the DNA. It is done before the DNA methylation measurement step, whether that be bisulfite sequencing or bisulfite PCR or a methylation array. Bisulfite conversion does NOT mean bisulfite sequencing.
  • Bisulfite sequencing - this is a method of measuring DNA methylation in a sample, using bisulfite conversion followed by DNA sequencing. 
  • Bonferroni correction - this is a way of correcting p-values when you have a lot of measurements and thus a higher risk of false positives. The Bonferroni method is much stricter than the FDR method.
    • P<0.05 is "nominally" significant, meaning it seems significant for a single site or gene but if you're looking at many sites or genes then you have a higher chance that your data has some false positives, and therefore P<0.05 isn't good enough for big data. You need to adjust for multiple comparisons to cut down the risk of false positives.
    • FDR<0.05 is actually significant for big data.
    • Bonferroni<0.05 is actually significant for big data and much stricter than FDR<0.05
    • Bonferroni<0.05 data points are also going to meet criteria for FDR<0.05 and P<0.05, since Bonferroni<0.05 is stricter than both
  • CpG - dinucleotide grouping of a cytosine base followed by a guanine base on the same DNA strand. It is not the same as C and G binding across DNA strands. CpG methylation (with a methyl group on the C base) is what people usually mean when they say DNA methylation, since it's the most common form of DNA methylation.
  • CpH - shorthand for CpA, CpC, and CpT, because H=A,C,T in nucleotide acid notation. CpH methylation is rare but happens.
  • Custom array - some research groups measure DNA methylation at sites of their choosing by working with a company to design custom arrays. 
  • Differentially methylated probe (DMP) - refers to specific methylation sites that are statistically different between group1 versus group2. For example, I could say that cg05451842 (a specific CpG location) is more methylated in group1 with P-value of 4E-9 (another way of writing 4x10-9). The "differentially methylated" part implies that a statistical comparison was performed. DMPs are identified with a generalized linear model using a matrix of values (all samples measured and all probes on the array) in order to identify which are statistically significantly different between group1 and group2
  • Differentially methylated region (DMR) - refers to DNA regions that have statistically different methylation between group1 versus group2. For example, I could say that there is a DMR upstream of ZNF300 that is significant and that this DMR contains 15 methylation sites.
  • DMP  - see differentially methylated probe (it's a specific cytosine location on the genome)
  • DMR  - see differentially methylated region (regions are composed of several probes)
  • DNA methylation - this typically refers to the addition of a methyl (CH3) group onto a cytosine base (C base) of DNA. The most common locations of DNA methylation are dinucleotide CpG sites, where the cytosine base is followed immediately by a guanine base on the same strand. There also exist CpH sites, where the H is shorthand for either A, C, or T. CpH methylation (CpA, CpC, and CpT methylation) is rare but happens. We are not exactly sure what it does but it appears slightly more common in stem cells.
  • DVPs - differentially variable positions. This isn't the same at DMPs which are comparing average methylation between groups. DVPs are comparing average variation of methylation between groups. For example, normal human methylation at a particular site might vary a lot for the control group, but maybe the experimental study group is always very methylated or very unmethylated at a specific site, with much less variation within-group, compared to the variation seen within the control group. This might tell you that this specific site is correlated to the studied condition, even if not directly causal by itself since the control group also has subjects hyper- or hypomethylated at this site.
  • EPIC array - short for Infinium MethylationEPIC BeadChip (>860,000 methylation sites, exact number varies depending on the EPIC array version). The EPIC array covers the human genome at twice the methylation sites compared to the older 450k array. As of writing (8/2023), it is the most comprehensive DNA methylation array for the human genome.
  • FDR or False Discovery Rate = Benjamini-Hochberg's False Discovery Rate. This is a way of correcting p-values when you have a lot of measurements and thus a higher risk of false positives. The FDR value is like a stricter p-value. It is a number between 0 to 1, for example "significance was defined as FDR<0.05", but is sometimes shown as a percentage such as "significance was identified as FDR less than 5%", which means the same thing.
    • All FDR<0.05 data points are also P<0.05
    • But not all P<0.05 data points reach FDR<0.05
  • Genome-wide significance means a methylation site is highly significant at a strict threshold that takes into account the total number of methylation sites measured on an array that samples across the whole genome. For the EPIC array, Mansell et al 2019 did the math and suggested p-value < 9E-8 as a threshold to define genome-wide significance. In practice, this produces very similar results to Bonferroni<0.05, plus or minus a methylation site.
  • Globally significant - interchangeable with genome-wide significant. 
  • Imprinted site or imprinted region - site or region where the offspring's methylation status is inherited from a specific parent's allele due to methylation regulation during reproduction (including egg and sperm methylation maintenance) and DNA replication post-fertilization (including biases in which allele is used to spread methylation signatures).
    • Imprinting is the result of different evolutionary pressures. Imprinting may suggest evolutionary conflict over what is more beneficial for the mother's genes versus the father's genes. For example, in times of famine and war, is it more beneficial for a pregnant woman's body to miscarry to conserve her limited resources (and maybe increase her or her other children's chance of survival) or to devote more resources to keep the fetus and pass on her (and the father's) genes? 
    • See Court et al 2014, "Genome-wide parent-of-origin DNA methylation analysis reveals the intricacies of human imprinting and suggests a germline methylation-independent mechanism of establishment". Genome Res. 2014 Apr;24(4):554-69. doi:10.1101/gr.164913.113. [PMID: 24402520, PMCID: PMC3975056]
  • Island, "CpG Island" - this is a region of DNA that is enriched for (has more than normal frequency for) CpG sites.
  • LINE assays = LINE-1 assays = a method of comparing overall methylation between two or more groups. Human DNA has repeats of the LINE-1 retrotransposon sequence all throughout our genome and measuring how much they are methylated is used as an indirect way to estimate overall methylation of all genomic DNA. The LINE assay is an antibody and fluorescence assay to compare the proportion of methylated and unmethylated LINE-1 DNA regions. It is done with genomic DNA, a 96-well plate, a PCR machine, and a plate reader. You don't get specific information about other genes or methylation sites, only the LINE-1 transposons. It is not a replacement for bisulfite sequencing or array measurements. LINE assays are used when you suspect that group1 has more methylation overall (everywhere) compared to group2. 
  • "Methylation site" and "probe" are often used interchangeably by methylation array papers, but they are not exactly the same. The methylation site is the actual location on the DNA where the C base is methylated. Methylation sites exist regardless of DNA methylation measurement method. A "probe" is a way that methylation array technology identifies specific methylation sites using binding chemistry. The array has an annotation file from the manufacturer with information on different probe sites, with probeIDs used to identify methylation sites, for example "cg03513874" and "cg05451842" are the names of different probes.
  • M value - methylation value derived from the minfi workflow for DNA methylation array data normalization. A matrix of M values (rows=probeIDs, columns=samples) is used to run analysis to identify differentially methylated probes. 
  • Masking - masking is bad and means that the methylation measurement at a specific site is unreliable. Masking happens when other factors affect the measurement, not just the presence or absence of a methyl group at that probe site. Generally, probes that are known to have masking problems are removed before final analysis. For the EPIC array, Zhou et al 2016 published a quality analysis paper and EPIC array manifest at Nucleic Acids Research that includes masking information. Dr Wanding Zhou also self-publishes updated versions of his EPIC array manifest on his website.
  • Parent-of-origin, also see "imprinting". This term is used when the research subject's methylation status at a specific CpG site or DNA region is inherited from either the mother or the father's allele, rather than random chance. 
  • Q-Q plot = Quantile-Quantile plot = a plot created after running a generalized linear model and getting p-values for probeIDs. The p-values are plotted in order of significance. The plot is used to estimate inflation (exaggeration of significance) in your p-values. You want most p-value points to meet at the x=y line as closely as possible in the beginning (these are not significant points), and then a tail at the end to suddenly jump up (these are your significant points). If everything is above the x=y line, then you have inflation and every methylation site is more "significant" than expected, which is unlikely to describe the real biology. Your model might not be accounting for confounding variables if you have high inflation and you should consider re-running it with co-variables considered.
  • Reference genome
    • hg19 = GRCh37 = the older human reference genome that a lot of DNA methylation papers still use
    • hg38 = GRCh38 = the newer human reference genome
    • It is important to know the reference genome because the chromosome location of a methylation site is different between references due to addition/removal of polymorphisms as part of the reference. The chromosome location is the "address" of the methylation site, and the reference genome is the map version that you are using to locate that address.

Suggested lectures:

--------

Updated 10/4/2023

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...