Here I list of available resources and annotation files for Illumina methylation arrays to measure cytosine nucleotide DNA methylation.
There are three different probeType column values: "cg", "ch", "rs"
- "cg" - CpG sites, the most common DNA methylation sites
- "ch" - CpH sites (where the C base is followed by something other than a G base, H=A,T,C). Rare methylation sites. They are slightly more common in stem cells, but still not as common as CpG sites.
- "rs" - methylation sites that overlap single nucleotide polymorphisms (SNPs). These should be excluded from most analysis.
- If you don't have the probeType column in your data, you can identify these sites by searching the probeID prefixes, for example: grepl("^cg", df$probeID) is a regular expression for R programming that will return a column of TRUE/FALSE depending on the text values in column df$probeID. The caret symbol (^) in "^cg" means that the text match needs to be at the beginning to be TRUE. An example probeID that would return TRUE is "cg07881041".
Official Illumina manifest files for methylation arrays
A "manifest" is a list of all items (sometimes samples, in this case methylation probes) and information about them. Illumina provides manifests for each of their methylation arrays with information such as probeID, probe type, probe chemistry details, chromosome position, and associated genes.
Infinium MethylationEPIC BeadChip (>860k sites)- Annotation files (Infinium MethylationEPIC Product Files, get the "manifest" files)
- Infinium MethylationEPIC v1.0 B5 Manifest File (CSV Format), Mar 13, 2020
- GRCh36, GRCh37, and GRCh38 human reference genomes
- See annotation website for pdf explaining each annotation column
- Annotation files (Infinium HumanMethylation450K v1.2 Product Files, get the "manifest" files)
- HumanMethylation450K v1.2 Manifest File (CSV Format), May 23, 2013
- GRCh36 and GRCh37 human reference genomes
- Annotation spreadsheet (Excel spreadsheet), Aug 27, 2013
- Uses GRCh36 human genome reference
Programming tip: how to import an official Illumina manifest into R quickly
## 450k manifest
#file.manifest = "humanmethylation450_15017482_v1-2.csv"
## EPIC manifest
file.manifest = "infinium-methylationepic-v-1-0-b5-manifest-file.csv.gz"
file.exists(file.manifest)
## [1] TRUE
annot = data.table::fread(input = file.manifest,
header=TRUE,
skip=7,
strip.white = TRUE,
na.strings=c("",NA,"NA"),
data.table = FALSE)
I highly recommend using the R package data.table and its function fread when loading DNA methylation array files. It is much faster than the base read.csv function for spreadsheets with >450,000 rows such as these.
I set variable data.table=FALSE so that the function returns a dataframe object, instead of a data table object. I prefer working with either dataframes or matrices in R.
dim(annot)
## [1] 865918 52
colnames(annot)
## [1] "IlmnID"
## [2] "Name"
## [3] "AddressA_ID"
## [4] "AlleleA_ProbeSeq"
## [5] "AddressB_ID"
## [6] "AlleleB_ProbeSeq"
## [7] "Infinium_Design_Type"
## [8] "Next_Base"
## [9] "Color_Channel"
## [10] "Forward_Sequence"
## [11] "Genome_Build"
## [12] "CHR"
## [13] "MAPINFO"
## [14] "SourceSeq"
## [15] "Strand"
## [16] "UCSC_RefGene_Name"
## [17] "UCSC_RefGene_Accession"
## [18] "UCSC_RefGene_Group"
## [19] "UCSC_CpG_Islands_Name"
## [20] "Relation_to_UCSC_CpG_Island"
## [21] "Phantom4_Enhancers"
## [22] "Phantom5_Enhancers"
## [23] "DMR"
## [24] "450k_Enhancer"
## [25] "HMM_Island"
## [26] "Regulatory_Feature_Name"
## [27] "Regulatory_Feature_Group"
## [28] "GencodeBasicV12_NAME"
## [29] "GencodeBasicV12_Accession"
## [30] "GencodeBasicV12_Group"
## [31] "GencodeCompV12_NAME"
## [32] "GencodeCompV12_Accession"
## [33] "GencodeCompV12_Group"
## [34] "DNase_Hypersensitivity_NAME"
## [35] "DNase_Hypersensitivity_Evidence_Count"
## [36] "OpenChromatin_NAME"
## [37] "OpenChromatin_Evidence_Count"
## [38] "TFBS_NAME"
## [39] "TFBS_Evidence_Count"
## [40] "Methyl27_Loci"
## [41] "Methyl450_Loci"
## [42] "Chromosome_36"
## [43] "Coordinate_36"
## [44] "SNP_ID"
## [45] "SNP_DISTANCE"
## [46] "SNP_MinorAlleleFrequency"
## [47] "Random_Loci"
## [48] "MFG_Change_Flagged"
## [49] "CHR_hg38"
## [50] "Start_hg38"
## [51] "End_hg38"
## [52] "Strand_hg38"
colnames(annot)
## [1] "IlmnID"
## [2] "Name"
## [3] "AddressA_ID"
## [4] "AlleleA_ProbeSeq"
## [5] "AddressB_ID"
## [6] "AlleleB_ProbeSeq"
## [7] "Infinium_Design_Type"
## [8] "Next_Base"
## [9] "Color_Channel"
## [10] "Forward_Sequence"
## [11] "Genome_Build"
## [12] "CHR"
## [13] "MAPINFO"
## [14] "SourceSeq"
## [15] "Strand"
## [16] "UCSC_RefGene_Name"
## [17] "UCSC_RefGene_Accession"
## [18] "UCSC_RefGene_Group"
## [19] "UCSC_CpG_Islands_Name"
## [20] "Relation_to_UCSC_CpG_Island"
## [21] "Phantom4_Enhancers"
## [22] "Phantom5_Enhancers"
## [23] "DMR"
## [24] "450k_Enhancer"
## [25] "HMM_Island"
## [26] "Regulatory_Feature_Name"
## [27] "Regulatory_Feature_Group"
## [28] "GencodeBasicV12_NAME"
## [29] "GencodeBasicV12_Accession"
## [30] "GencodeBasicV12_Group"
## [31] "GencodeCompV12_NAME"
## [32] "GencodeCompV12_Accession"
## [33] "GencodeCompV12_Group"
## [34] "DNase_Hypersensitivity_NAME"
## [35] "DNase_Hypersensitivity_Evidence_Count"
## [36] "OpenChromatin_NAME"
## [37] "OpenChromatin_Evidence_Count"
## [38] "TFBS_NAME"
## [39] "TFBS_Evidence_Count"
## [40] "Methyl27_Loci"
## [41] "Methyl450_Loci"
## [42] "Chromosome_36"
## [43] "Coordinate_36"
## [44] "SNP_ID"
## [45] "SNP_DISTANCE"
## [46] "SNP_MinorAlleleFrequency"
## [47] "Random_Loci"
## [48] "MFG_Change_Flagged"
## [49] "CHR_hg38"
## [50] "Start_hg38"
## [51] "End_hg38"
## [52] "Strand_hg38"
If you are working with GRCh37 (hg19) human genome reference, you want to use columns "CHR" and "MAPINFO" (equivalent to "Start_hg19") for probe position when making Manhattan plots.
Additional resources for the EPIC array
- Zhou W, Laird PW and Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research, 2017
- Discusses the chemistry of each probe
- Identifies probes that are unreliable due to various masking issues
- Provides an EPIC array annotation file with a column summarizing masking issues. Keep only MASK_general==FALSE probes to drop unreliable probes. R code line:
- probes = subset(probes, MASK_general==FALSE)
- Dr Wanding Zhou's updated EPIC array annotations (github)
- More up to date annotations, including an updated MASK_general column
- Fixes redundant gene symbols
- Files available in GRCh37 (hg19) and GRCh38 (hg38) versions of the human genome
- To subset to only autosome chromosomes, drop chromosomes X, Y, MT, and NA.
Workflow with minfi
- From .idat files to MethylSet - an example with the 450k array, an older array than the EPIC array.
- .idat files are the raw data from each sample
- MethylSet objects contain information about methylated and unmethylated signals
- SWAN = Subset-quantile within array normalization
- Input a MethylSet object, get back a MethylSet object
Video lectures
- "Statistics for Genomics: DNA methylation" [YouTube, Rafael Irizarry, July 2012]
- Explains the definition & identification of CpG islands using Hidden Markov Models
Analysis tutorials
- "A cross-package Bioconductor workflow for analysing methylation array data" by Jovana Maksimovic*, Belinda Phipson and Alicia Oshlack
- 2: Differential methylation analysis
- 2.7: Probe-wise differential methylation analysis using a matrix of M values with limma
- 2.8: Differential methylation analysis of regions. Three methods discussed.
- charm::dmrFind - slow "unless you have the computer infrastructure to parallelise them, as they use permutations to assign significance"
- minfi::bumphunter (Jaffe et al. 2012; Aryee et al. 2014) - slow "unless you have the computer infrastructure to parallelise them, as they use permutations to assign significance"
- DMRcate::dmrcate (Peters et al. 2015) - "As it is based on limma, we can directly use the design and contMatrix we previously defined."
- 2.9: Customising visualisations of methylation data
- 3.1: Gene ontology testing
Interpreting Quantile-Quantile (Q-Q) plots
- How to describe the shapes of different QQ plots
- What QQ plot shapes tell us about data distributions [pdf]
- Comparing histograms vs QQ plots to show skewed data
- Inflation in QQ plots is a problem for the EPIC array and methylation data in general. DNA methylation array data doesn't truly satisfy the data independence and normal distribution assumptions for a perfect QQ plot. Citation: Mansell et al. (2019) Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array. BMC Genomics. 20:399
Useful websites
- UCSC Genome Browser
- UCSC Genome Browser - Table Browser - used to pull information for lists of genes or regions
- LocusZoom - Create Plots of Genetic Data
-----------------------------------------------
Last updated April 29, 2024
No comments:
Post a Comment