May 14, 2021

R code snippets from Tania (in progress)

Little things I re-use often.

Timestamp code


timenow = function() {
time1 = format(Sys.time(),"%Y-%m-%d") #makes a date stamp
time2 = format(Sys.time(),"%H%M%p") #makes a time stamp
x = paste0(time1,"-",time2)
return(x)
} #/end function timenow

timenow()

Example output: 

"2021-05-14-0825AM"


Calculate means for a dataframe from multiple columns

df$mean.of.three.groups.means = base::rowMeans(df[, c("mean.group1", "mean.group2", "mean.group3")], na.rm=TRUE)

 

knitr header to automatically set the date and format with a floating table of contents:




The single quotes in line 4 are from the [~ `] key above the Tab key, not the [" '] key on your keyboard.

The indentation and new lines after "output" are important for knitr to run properly.


Summary columns for a matrix (min, mean, max, stdev, variance)

Example: selecting an expression threshold for RNA-seq

The example below uses a TPM matrix with expression values for all samples and all genes from RNA-sequencing. I want summary information for female and male samples for chromosome Y genes. The female samples don't have chromosome Y (confirmed by karyotype tests) so in theory their chromosome Y signal should be zero. However, sequencing is not that clean. There is some low level signal ("sequencing noise"). By looking at the TPM values of chromosome Y genes in females, I can identify a TPM threshold to drop sequencing noise, and then assume that anything above that threshold in any chromosome is real signal ("expressed genes").    

Select only chromosome Y genes from a data frame with multiple columns describing all genes (including "chr" and "ensemblID"). Pseudoautosomal genes were removed ahead of time. The pseudoautosomal regions of chromosome Y can't be distinguished from chromosome X, so they aren't useful for selecting an expression threshold.
chrY = subset(allgenes.annotated, chr=="Y")

TPM matrix subsets: rows = ensembl gene IDs, columns = sample IDs
tpm.females.Y = tpm[rownames(tpm) %in% chrY$ensemblID, female.patients]
tpm.males.Y = tpm[rownames(tpm) %in% chrY$ensemblID, male.patients]

Create a summary file
## checkpoint before making dataframe, want TRUE
all(rownames(tpm.females.Y) == rownames(tpm.males.Y))
## make summary dataframe
tpm.Y.summary = data.frame(
  row.names=rownames(tpm.females.Y),  ## ensembl transcript ids
  check.rows = TRUE, ## check for consistency of length and names
  stringsAsFactors = FALSE,
                
  ## Females group        
  min.Female.TPM = apply(tpm.females.Y, 1,  FUN=min),
  mean.Female.TPM = rowMeans(tpm.females.Y, na.rm=T),
  max.Female.TPM = apply(tpm.females.Y, 1,  FUN=max),
  
  stdev.Female.TPM = apply(tpm.females.Y, 1,    FUN=function(x){stats::sd(x, na.rm=TRUE)}),
  variance.Female.TPM = apply(tpm.females.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)}),
  
  
  ## Males group
  min.Male.TPM = apply(tpm.males.Y, 1,  FUN=min),
  mean.Male.TPM = rowMeans(tpm.males.Y, na.rm=T),
  max.Male.TPM = apply(tpm.males.Y, 1,  FUN=max),
  
  stdev.Male.TPM = apply(tpm.males.Y, 1,    FUN=function(x){stats::sd(x, na.rm=TRUE)}),
  variance.Male.TPM = apply(tpm.males.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)})
  )  ##/data.frame

dim(tpm.Y.summary)

It is necessary to use apply() if you want to apply a function to either rows or columns in a dataset or matrix, where 1=rows, 2=columns.

  • This produces the minimum value per row (per gene) from all columns: apply(data, 1, FUN=min
  • This produces the minimum value in the whole dataset: min(data)
  • na.rm=TRUE tells R to ignore any missing values.


Install all dependencies for a package

Example for ggplot2:

deps = tools::package_dependencies("ggplot2", recursive = TRUE)$ggplot2
for (dep in deps) {
  try(install.packages(dep))
}



Fix error "unable to open connection to X11 display"

This worked for me:
options(bitmapType='cairo')

You can test that it worked by testing a quick plot:
x = 1:5
y = 5*x
plot(x,y)

You can also see if X11 is FALSE by typing:
capabilities()


Split GENCODE's version of the ensembl IDs and identify pseudoautosomal genes

GENCODE marks genes that map to the pseudoautosomal regions of chromosome Y with a prefix "ENSGR" and assigns the regular "ENSG0" to the chromosome X version. The sequences are the same. Ensembl.org does not recognize the ENSGR version since its ID assignments must be for unique sequences (GENCODE is human annotated).

## split "ENSG00000000003.14_TSPAN6" into "ENSG00000000003.14" and "TSPAN6", separator="_"
print("Split the column names, e.g. ENSG00000000003.14_TSPAN6")
df <- df0 %>% tidyr::separate(Ensembl_full, c("Ensembl_ID_version", "Gene"), 
                              sep="_", extra="merge", remove=FALSE)

  ## show what happens if we have extra separators (row indexes taken from the R warning)
  #print(df[c(17980,18549,18758,19676,22380),c(1:6,ncol(df)-1,ncol(df))])   
      ## ENSG00000223695.1 RP4-633O19__A.1 (second part doesn't get separated)
  
## split Ensembl Gene ID to remove version decimal, e.g. "ENSG00000223695.1", remove ".1"
print("Make a clean copy of the Ensembl Gene ID, e.g. remove .1 from ENSG00000223695.1")
df <- df %>% tidyr::separate(Ensembl_ID_version, c("GENCODE_ensembl", NA), 
                             sep="\\.", extra="warn", remove=TRUE)     
## identify pseudoautosomal X/Y genes ------------------------------
pseudo.ensemblY = df[grepl("ENSGR", df$GENCODE_ensembl),]$GENCODE_ensembl
pseudo.ensemblX = gsub("ENSGR", "ENSG0", pseudo.ensemblY)

## create column for X and Y genes in the pseudoautosomal region
df$pseudoautosome = df$GENCODE_ensembl %in% c(pseudo.ensemblX, pseudo.ensemblY)
print("Number of rows with pseudoautosome genes? Expect 48")
print(table(df$pseudoautosome))
        # FALSE  TRUE 
        # 25264    48 



Last updated December 2021

No comments:

Post a Comment

How to format final figures for publication

General figure guidelines File types and file sizes TIFF images with LZW compression to reduce the file size PDF files for vector images Not...