Notes from Tania: R code snippets from Tania (in progress)

Little things I re-use often.

Timestamp code

timenow = function() {
time1 = format(Sys.time(),"%Y-%m-%d") #makes a date stamp
time2 = format(Sys.time(),"%H%M%p") #makes a time stamp
x = paste0(time1,"-",time2)
return(x)
} #/end function timenow
timenow()

Example output:

"2021-05-14-0825AM"

Calculate means for a dataframe from multiple columns

df$mean.of.three.groups.means = base::rowMeans(df[, c("mean.group1", "mean.group2", "mean.group3")], na.rm=TRUE)

knitr header to automatically set the date and format with a floating table of contents:

The single quotes in line 4 are from the [~ `] key above the Tab key, not the [" '] key on your keyboard.

The indentation and new lines after "output" are important for knitr to run properly.

Summary columns for a matrix (min, mean, max, stdev, variance)

Example: selecting an expression threshold for RNA-seq

The example below uses a TPM matrix with expression values for all samples and all genes from RNA-sequencing. I want summary information for female and male samples for chromosome Y genes. The female samples don't have chromosome Y (confirmed by karyotype tests) so in theory their chromosome Y signal should be zero. However, sequencing is not that clean. There is some low level signal ("sequencing noise"). By looking at the TPM values of chromosome Y genes in females, I can identify a TPM threshold to drop sequencing noise, and then assume that anything above that threshold in any chromosome is real signal ("expressed genes").

Select only chromosome Y genes from a data frame with multiple columns describing all genes (including "chr" and "ensemblID"). Pseudoautosomal genes were removed ahead of time. The pseudoautosomal regions of chromosome Y can't be distinguished from chromosome X, so they aren't useful for selecting an expression threshold.

chrY = subset(allgenes.annotated, chr=="Y")

TPM matrix subsets: rows = ensembl gene IDs, columns = sample IDs

tpm.females.Y = tpm[rownames(tpm) %in% chrY$ensemblID, female.patients]
tpm.males.Y = tpm[rownames(tpm) %in% chrY$ensemblID, male.patients]

Create a summary file

## checkpoint before making dataframe, want TRUE
all(rownames(tpm.females.Y) == rownames(tpm.males.Y))

## make summary dataframe
tpm.Y.summary = data.frame(
  row.names=rownames(tpm.females.Y),  ## ensembl transcript ids
  check.rows = TRUE, ## check for consistency of length and names
  stringsAsFactors = FALSE,
                
  ## Females group        
  min.Female.TPM = apply(tpm.females.Y, 1,  FUN=min),
  mean.Female.TPM = rowMeans(tpm.females.Y, na.rm=T),
  max.Female.TPM = apply(tpm.females.Y, 1,  FUN=max),
  
  stdev.Female.TPM = apply(tpm.females.Y, 1,    FUN=function(x){stats::sd(x, na.rm=TRUE)}),
  variance.Female.TPM = apply(tpm.females.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)}),
  
  
  ## Males group
  min.Male.TPM = apply(tpm.males.Y, 1,  FUN=min),
  mean.Male.TPM = rowMeans(tpm.males.Y, na.rm=T),
  max.Male.TPM = apply(tpm.males.Y, 1,  FUN=max),
  
  stdev.Male.TPM = apply(tpm.males.Y, 1,    FUN=function(x){stats::sd(x, na.rm=TRUE)}),
  variance.Male.TPM = apply(tpm.males.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)})
  )  ##/data.frame

dim(tpm.Y.summary)

It is necessary to use apply() if you want to apply a function to either rows or columns in a dataset or matrix, where 1=rows, 2=columns.

This produces the minimum value per row (per gene) from all columns: apply(data, 1, FUN=min)
This produces the minimum value in the whole dataset: min(data)
na.rm=TRUE tells R to ignore any missing values.

Install all dependencies for a package

Example for ggplot2:

deps = tools::package_dependencies("ggplot2", recursive = TRUE)$ggplot2
for (dep in deps) {
try(install.packages(dep))
}

[Source]

Fix error "unable to open connection to X11 display"

This worked for me:

options(bitmapType='cairo')

You can test that it worked by testing a quick plot:

x = 1:5

y = 5*x

plot(x,y)

You can also see if X11 is FALSE by typing:

capabilities()

Split GENCODE's version of the ensembl IDs and identify pseudoautosomal genes

GENCODE marks genes that map to the pseudoautosomal regions of chromosome Y with a prefix "ENSGR" and assigns the regular "ENSG0" to the chromosome X version. The sequences are the same. Ensembl.org does not recognize the ENSGR version since its ID assignments must be for unique sequences (GENCODE is human annotated).

## split "ENSG00000000003.14_TSPAN6" into "ENSG00000000003.14" and "TSPAN6", separator="_"
print("Split the column names, e.g. ENSG00000000003.14_TSPAN6")
df <- df0 %>% tidyr::separate(Ensembl_full, c("Ensembl_ID_version", "Gene"), 
                              sep="_", extra="merge", remove=FALSE)

  ## show what happens if we have extra separators (row indexes taken from the R warning)
  #print(df[c(17980,18549,18758,19676,22380),c(1:6,ncol(df)-1,ncol(df))])   
      ## ENSG00000223695.1 RP4-633O19__A.1 (second part doesn't get separated)
  
## split Ensembl Gene ID to remove version decimal, e.g. "ENSG00000223695.1", remove ".1"
print("Make a clean copy of the Ensembl Gene ID, e.g. remove .1 from ENSG00000223695.1")
df <- df %>% tidyr::separate(Ensembl_ID_version, c("GENCODE_ensembl", NA), 
                             sep="\\.", extra="warn", remove=TRUE)

## identify pseudoautosomal X/Y genes ------------------------------
pseudo.ensemblY = df[grepl("ENSGR", df$GENCODE_ensembl),]$GENCODE_ensembl
pseudo.ensemblX = gsub("ENSGR", "ENSG0", pseudo.ensemblY)

## create column for X and Y genes in the pseudoautosomal region
df$pseudoautosome = df$GENCODE_ensembl %in% c(pseudo.ensemblX, pseudo.ensemblY)
print("Number of rows with pseudoautosome genes? Expect 48")
print(table(df$pseudoautosome))
        # FALSE  TRUE 
        # 25264    48

Last updated December 2021

Notes from Tania

May 14, 2021

R code snippets from Tania (in progress)