Notes from Tania: DNA methylation: Illumina methylation array annotation sources

Here I list of available resources and annotation files for Illumina methylation arrays to measure cytosine nucleotide DNA methylation.

There are three different probeType column values: "cg", "ch", "rs"

"cg" - CpG sites, the most common DNA methylation sites
"ch" - CpH sites (where the C base is followed by something other than a G base, H=A,T,C). Rare methylation sites. They are slightly more common in stem cells, but still not as common as CpG sites.
"rs" - methylation sites that overlap single nucleotide polymorphisms (SNPs). These should be excluded from most analysis.
If you don't have the probeType column in your data, you can identify these sites by searching the probeID prefixes, for example: grepl("^cg", df$probeID) is a regular expression for R programming that will return a column of TRUE/FALSE depending on the text values in column df$probeID. The caret symbol (^) in "^cg" means that the text match needs to be at the beginning to be TRUE. An example probeID that would return TRUE is "cg07881041".

Official Illumina manifest files for methylation arrays

A "manifest" is a list of all items (sometimes samples, in this case methylation probes) and information about them. Illumina provides manifests for each of their methylation arrays with information such as probeID, probe type, probe chemistry details, chromosome position, and associated genes.

Infinium MethylationEPIC BeadChip (>860k sites)

Annotation files (Infinium MethylationEPIC Product Files, get the "manifest" files)
Infinium MethylationEPIC v1.0 B5 Manifest File (CSV Format), Mar 13, 2020
GRCh36, GRCh37, and GRCh38 human reference genomes
See annotation website for pdf explaining each annotation column

Illumina HumanMethylation450K BeadChip (450k sites)

Annotation files (Infinium HumanMethylation450K v1.2 Product Files, get the "manifest" files)
HumanMethylation450K v1.2 Manifest File (CSV Format), May 23, 2013
GRCh36 and GRCh37 human reference genomes

Infinium HumanMethylation27 BeadChip (27k sites)

Annotation spreadsheet (Excel spreadsheet), Aug 27, 2013
Uses GRCh36 human genome reference

Programming tip: how to import an official Illumina manifest into R quickly

## 450k manifest
#file.manifest = "humanmethylation450_15017482_v1-2.csv"  

## EPIC manifest
file.manifest = "infinium-methylationepic-v-1-0-b5-manifest-file.csv.gz"  

file.exists(file.manifest)

## [1] TRUE

annot = data.table::fread(input = file.manifest,
                          header=TRUE, 
                          skip=7,
                          strip.white = TRUE,
                          na.strings=c("",NA,"NA"),
                          data.table = FALSE)

I highly recommend using the R package data.table and its function fread when loading DNA methylation array files. It is much faster than the base read.csv function for spreadsheets with >450,000 rows such as these.

I set variable data.table=FALSE so that the function returns a dataframe object, instead of a data table object. I prefer working with either dataframes or matrices in R.

dim(annot)

## [1] 865918     52

colnames(annot)

##  [1] "IlmnID"                               
##  [2] "Name"                                 
##  [3] "AddressA_ID"                          
##  [4] "AlleleA_ProbeSeq"                     
##  [5] "AddressB_ID"                          
##  [6] "AlleleB_ProbeSeq"                     
##  [7] "Infinium_Design_Type"                 
##  [8] "Next_Base"                            
##  [9] "Color_Channel"                        
## [10] "Forward_Sequence"                     
## [11] "Genome_Build"                         
## [12] "CHR"                                  
## [13] "MAPINFO"                              
## [14] "SourceSeq"                            
## [15] "Strand"                               
## [16] "UCSC_RefGene_Name"                    
## [17] "UCSC_RefGene_Accession"               
## [18] "UCSC_RefGene_Group"                   
## [19] "UCSC_CpG_Islands_Name"                
## [20] "Relation_to_UCSC_CpG_Island"          
## [21] "Phantom4_Enhancers"                   
## [22] "Phantom5_Enhancers"                   
## [23] "DMR"                                  
## [24] "450k_Enhancer"                        
## [25] "HMM_Island"                           
## [26] "Regulatory_Feature_Name"              
## [27] "Regulatory_Feature_Group"             
## [28] "GencodeBasicV12_NAME"                 
## [29] "GencodeBasicV12_Accession"            
## [30] "GencodeBasicV12_Group"                
## [31] "GencodeCompV12_NAME"                  
## [32] "GencodeCompV12_Accession"             
## [33] "GencodeCompV12_Group"                 
## [34] "DNase_Hypersensitivity_NAME"          
## [35] "DNase_Hypersensitivity_Evidence_Count"
## [36] "OpenChromatin_NAME"                   
## [37] "OpenChromatin_Evidence_Count"         
## [38] "TFBS_NAME"                            
## [39] "TFBS_Evidence_Count"                  
## [40] "Methyl27_Loci"                        
## [41] "Methyl450_Loci"                       
## [42] "Chromosome_36"                        
## [43] "Coordinate_36"                        
## [44] "SNP_ID"                               
## [45] "SNP_DISTANCE"                         
## [46] "SNP_MinorAlleleFrequency"             
## [47] "Random_Loci"                          
## [48] "MFG_Change_Flagged"                   
## [49] "CHR_hg38"                             
## [50] "Start_hg38"                           
## [51] "End_hg38"                             
## [52] "Strand_hg38"

If you are working with GRCh37 (hg19) human genome reference, you want to use columns "CHR" and "MAPINFO" (equivalent to "Start_hg19") for probe position when making Manhattan plots.

Additional resources for the EPIC array

Zhou W, Laird PW and Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research, 2017

Discusses the chemistry of each probe
Identifies probes that are unreliable due to various masking issues
Provides an EPIC array annotation file with a column summarizing masking issues. Keep only MASK_general==FALSE probes to drop unreliable probes. R code line:

probes = subset(probes, MASK_general==FALSE)

Dr Wanding Zhou's updated EPIC array annotations (github)

More up to date annotations, including an updated MASK_general column
Fixes redundant gene symbols
Files available in GRCh37 (hg19) and GRCh38 (hg38) versions of the human genome
To subset to only autosome chromosomes, drop chromosomes X, Y, MT, and NA.

Workflow with minfi

From .idat files to MethylSet - an example with the 450k array, an older array than the EPIC array.

.idat files are the raw data from each sample
MethylSet objects contain information about methylated and unmethylated signals
SWAN = Subset-quantile within array normalization

Input a MethylSet object, get back a MethylSet object

Video lectures

"Statistics for Genomics: DNA methylation" [YouTube, Rafael Irizarry, July 2012]

Explains the definition & identification of CpG islands using Hidden Markov Models

Analysis tutorials

"A cross-package Bioconductor workflow for analysing methylation array data" by Jovana Maksimovic*, Belinda Phipson and Alicia Oshlack

2: Differential methylation analysis
2.7: Probe-wise differential methylation analysis using a matrix of M values with limma
2.8: Differential methylation analysis of regions. Three methods discussed.

charm::dmrFind - slow "unless you have the computer infrastructure to parallelise them, as they use permutations to assign significance"
minfi::bumphunter (Jaffe et al. 2012; Aryee et al. 2014) - slow "unless you have the computer infrastructure to parallelise them, as they use permutations to assign significance"
DMRcate::dmrcate (Peters et al. 2015) - "As it is based on limma, we can directly use the design and contMatrix we previously defined."

2.9: Customising visualisations of methylation data
3.1: Gene ontology testing

Interpreting Quantile-Quantile (Q-Q) plots

How to describe the shapes of different QQ plots
What QQ plot shapes tell us about data distributions [pdf]
Comparing histograms vs QQ plots to show skewed data
Inflation in QQ plots is a problem for the EPIC array and methylation data in general. DNA methylation array data doesn't truly satisfy the data independence and normal distribution assumptions for a perfect QQ plot. Citation: Mansell et al. (2019) Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array. BMC Genomics. 20:399

Useful websites

UCSC Genome Browser
UCSC Genome Browser - Table Browser - used to pull information for lists of genes or regions
LocusZoom - Create Plots of Genetic Data

-----------------------------------------------

Last updated April 29, 2024

Notes from Tania

January 5, 2021

DNA methylation: Illumina methylation array annotation sources