January 5, 2021

DNA methylation: Illumina methylation array annotation sources

Here I list of available resources and annotation files for Illumina methylation arrays to measure cytosine nucleotide DNA methylation.

There are three different probeType column values: "cg", "ch", "rs"

  • "cg" - CpG sites, the most common DNA methylation sites
  • "ch" - CpH sites (where the C base is followed by something other than a G base, H=A,T,C). Rare methylation sites. They are slightly more common in stem cells, but still not as common as CpG sites.
  • "rs" - methylation sites that overlap single nucleotide polymorphisms (SNPs). These should be excluded from most analysis.
  • If you don't have the probeType column in your data, you can identify these sites by searching the probeID prefixes, for example: grepl("^cg", df$probeID) is a regular expression for R programming that will return a column of TRUE/FALSE depending on the text values in column df$probeID. The caret symbol (^) in "^cg" means that the text match needs to be at the beginning to be TRUE. An example probeID that would return TRUE is "cg07881041". 

Official Illumina manifest files for methylation arrays

A "manifest" is a list of all items (sometimes samples, in this case methylation probes) and information about them. Illumina provides manifests for each of their methylation arrays with information such as probeID, probe type, probe chemistry details, chromosome position, and associated genes.  

Infinium MethylationEPIC BeadChip (>860k sites)
  • Annotation files (Infinium MethylationEPIC Product Files, get the "manifest" files)
  • Infinium MethylationEPIC v1.0 B5 Manifest File (CSV Format), Mar 13, 2020
  • GRCh36, GRCh37, and GRCh38 human reference genomes
  • See annotation website for pdf explaining each annotation column

  • Annotation files (Infinium HumanMethylation450K v1.2 Product Files, get the "manifest" files)
  • HumanMethylation450K v1.2 Manifest File (CSV Format), May 23, 2013
  • GRCh36 and GRCh37 human reference genomes


Programming tip: how to import an official Illumina manifest into R quickly

## 450k manifest
#file.manifest = "humanmethylation450_15017482_v1-2.csv"  

## EPIC manifest
file.manifest = "infinium-methylationepic-v-1-0-b5-manifest-file.csv.gz"  

file.exists(file.manifest)
## [1] TRUE
annot = data.table::fread(input = file.manifest,
                          header=TRUE, 
                          skip=7,
                          strip.white = TRUE,
                          na.strings=c("",NA,"NA"),
                          data.table = FALSE)
I highly recommend using the R package data.table and its function fread when loading DNA methylation array files. It is much faster than the base read.csv function for spreadsheets with >450,000 rows such as these.

I set variable data.table=FALSE so that the function returns a dataframe object, instead of a data table object. I prefer working with either dataframes or matrices in R.

dim(annot)
## [1] 865918     52

colnames(annot)
##  [1] "IlmnID"                               
##  [2] "Name"                                 
##  [3] "AddressA_ID"                          
##  [4] "AlleleA_ProbeSeq"                     
##  [5] "AddressB_ID"                          
##  [6] "AlleleB_ProbeSeq"                     
##  [7] "Infinium_Design_Type"                 
##  [8] "Next_Base"                            
##  [9] "Color_Channel"                        
## [10] "Forward_Sequence"                     
## [11] "Genome_Build"                         
## [12] "CHR"                                  
## [13] "MAPINFO"                              
## [14] "SourceSeq"                            
## [15] "Strand"                               
## [16] "UCSC_RefGene_Name"                    
## [17] "UCSC_RefGene_Accession"               
## [18] "UCSC_RefGene_Group"                   
## [19] "UCSC_CpG_Islands_Name"                
## [20] "Relation_to_UCSC_CpG_Island"          
## [21] "Phantom4_Enhancers"                   
## [22] "Phantom5_Enhancers"                   
## [23] "DMR"                                  
## [24] "450k_Enhancer"                        
## [25] "HMM_Island"                           
## [26] "Regulatory_Feature_Name"              
## [27] "Regulatory_Feature_Group"             
## [28] "GencodeBasicV12_NAME"                 
## [29] "GencodeBasicV12_Accession"            
## [30] "GencodeBasicV12_Group"                
## [31] "GencodeCompV12_NAME"                  
## [32] "GencodeCompV12_Accession"             
## [33] "GencodeCompV12_Group"                 
## [34] "DNase_Hypersensitivity_NAME"          
## [35] "DNase_Hypersensitivity_Evidence_Count"
## [36] "OpenChromatin_NAME"                   
## [37] "OpenChromatin_Evidence_Count"         
## [38] "TFBS_NAME"                            
## [39] "TFBS_Evidence_Count"                  
## [40] "Methyl27_Loci"                        
## [41] "Methyl450_Loci"                       
## [42] "Chromosome_36"                        
## [43] "Coordinate_36"                        
## [44] "SNP_ID"                               
## [45] "SNP_DISTANCE"                         
## [46] "SNP_MinorAlleleFrequency"             
## [47] "Random_Loci"                          
## [48] "MFG_Change_Flagged"                   
## [49] "CHR_hg38"                             
## [50] "Start_hg38"                           
## [51] "End_hg38"                             
## [52] "Strand_hg38"

If you are working with GRCh37 (hg19) human genome reference, you want to use columns "CHR" and "MAPINFO" (equivalent to "Start_hg19") for probe position when making Manhattan plots.

Additional resources for the EPIC array

  • Zhou W, Laird PW and Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probesNucleic Acids Research, 2017 
    • Discusses the chemistry of each probe
    • Identifies probes that are unreliable due to various masking issues
    • Provides an EPIC array annotation file with a column summarizing masking issues. Keep only MASK_general==FALSE probes to drop unreliable probes. R code line:
      • probes = subset(probes, MASK_general==FALSE)
  • Dr Wanding Zhou's updated EPIC array annotations (github) 
    • More up to date annotations, including an updated MASK_general column
    • Fixes redundant gene symbols
    • Files available in GRCh37 (hg19) and GRCh38 (hg38) versions of the human genome
    • To subset to only autosome chromosomes, drop chromosomes X, Y, MT, and NA.

Workflow with minfi

  • From .idat files to MethylSet - an example with the 450k array, an older array than the EPIC array.
    • .idat files are the raw data from each sample
    • MethylSet objects contain information about methylated and unmethylated signals
    • SWAN = Subset-quantile within array normalization
      • Input a MethylSet object, get back a MethylSet object

Video lectures


Analysis tutorials


Interpreting Quantile-Quantile (Q-Q) plots


Useful websites

-----------------------------------------------
Last updated April 29, 2024

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...