Little things I re-use often.
Timestamp code
It uses date formats from R.
timenow = function() {
time1 = format(Sys.time(),"%Y-%m-%d") #makes a date stamp
time2 = format(Sys.time(),"%H%M%p") #makes a time stamp
x = paste0(time1,"-",time2)
return(x)
} #/end function timenowtimenow()
Example output:
"2021-05-14-0825AM"
Calculate means for a dataframe from multiple columns
df$mean.of.three.groups.means = base::rowMeans(df[, c("mean.group1", "mean.group2", "mean.group3")], na.rm=TRUE)
knitr header to automatically set the date and format with a floating table of contents:
The indentation and new lines after "output" are important for knitr to run properly.
Summary columns for a matrix (min, mean, max, stdev, variance)
Example: selecting an expression threshold for RNA-seq
The example below uses a TPM matrix with expression values for all samples and all genes from RNA-sequencing. I want summary information for female and male samples for chromosome Y genes. The female samples don't have chromosome Y (confirmed by karyotype tests) so in theory their chromosome Y signal should be zero. However, sequencing is not that clean. There is some low level signal ("sequencing noise"). By looking at the TPM values of chromosome Y genes in females, I can identify a TPM threshold to drop sequencing noise, and then assume that anything above that threshold in any chromosome is real signal ("expressed genes").
Select only chromosome Y genes from a data frame with multiple columns describing all genes (including "chr" and "ensemblID"). Pseudoautosomal genes were removed ahead of time. The pseudoautosomal regions of chromosome Y can't be distinguished from chromosome X, so they aren't useful for selecting an expression threshold.
chrY = subset(allgenes.annotated, chr=="Y")
TPM matrix subsets: rows = ensembl gene IDs, columns = sample IDs
tpm.females.Y = tpm[rownames(tpm) %in% chrY$ensemblID, female.patients]
tpm.males.Y = tpm[rownames(tpm) %in% chrY$ensemblID, male.patients]
Create a summary file
## checkpoint before making dataframe, want TRUE
all(rownames(tpm.females.Y) == rownames(tpm.males.Y))
## make summary dataframe
tpm.Y.summary = data.frame(
row.names=rownames(tpm.females.Y), ## ensembl transcript ids
check.rows = TRUE, ## check for consistency of length and names
stringsAsFactors = FALSE,
## Females group
min.Female.TPM = apply(tpm.females.Y, 1, FUN=min),
mean.Female.TPM = rowMeans(tpm.females.Y, na.rm=T),
max.Female.TPM = apply(tpm.females.Y, 1, FUN=max),
stdev.Female.TPM = apply(tpm.females.Y, 1, FUN=function(x){stats::sd(x, na.rm=TRUE)}),
variance.Female.TPM = apply(tpm.females.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)}),
## Males group
min.Male.TPM = apply(tpm.males.Y, 1, FUN=min),
mean.Male.TPM = rowMeans(tpm.males.Y, na.rm=T),
max.Male.TPM = apply(tpm.males.Y, 1, FUN=max),
stdev.Male.TPM = apply(tpm.males.Y, 1, FUN=function(x){stats::sd(x, na.rm=TRUE)}),
variance.Male.TPM = apply(tpm.males.Y, 1, FUN=function(x){stats::var(x, na.rm=TRUE)})
) ##/data.frame
dim(tpm.Y.summary)
It is necessary to use apply() if you want to apply a function to either rows or columns in a dataset or matrix, where 1=rows, 2=columns.
- This produces the minimum value per row (per gene) from all columns: apply(data, 1, FUN=min)
- This produces the minimum value in the whole dataset: min(data)
- na.rm=TRUE tells R to ignore any missing values.
Install all dependencies for a package
Example for ggplot2:
deps = tools::package_dependencies("ggplot2", recursive = TRUE)$ggplot2for (dep in deps) {try(install.packages(dep))}
[Source]
Fix error "unable to open connection to X11 display"
This worked for me:
options(bitmapType='cairo')
You can test that it worked by testing a quick plot:
x = 1:5
y = 5*x
plot(x,y)
You can also see if X11 is FALSE by typing:
capabilities()
Split GENCODE's version of the ensembl IDs and identify pseudoautosomal genes
GENCODE marks genes that map to the pseudoautosomal regions of chromosome Y with a prefix "ENSGR" and assigns the regular "ENSG0" to the chromosome X version. The sequences are the same. Ensembl.org does not recognize the ENSGR version since its ID assignments must be for unique sequences (GENCODE is human annotated).
## split "ENSG00000000003.14_TSPAN6" into "ENSG00000000003.14" and "TSPAN6", separator="_"
print("Split the column names, e.g. ENSG00000000003.14_TSPAN6")
df <- df0 %>% tidyr::separate(Ensembl_full, c("Ensembl_ID_version", "Gene"),
sep="_", extra="merge", remove=FALSE)
## show what happens if we have extra separators (row indexes taken from the R warning)
#print(df[c(17980,18549,18758,19676,22380),c(1:6,ncol(df)-1,ncol(df))])
## ENSG00000223695.1 RP4-633O19__A.1 (second part doesn't get separated)
## split Ensembl Gene ID to remove version decimal, e.g. "ENSG00000223695.1", remove ".1"
print("Make a clean copy of the Ensembl Gene ID, e.g. remove .1 from ENSG00000223695.1")
df <- df %>% tidyr::separate(Ensembl_ID_version, c("GENCODE_ensembl", NA),
sep="\\.", extra="warn", remove=TRUE)
## identify pseudoautosomal X/Y genes ------------------------------
pseudo.ensemblY = df[grepl("ENSGR", df$GENCODE_ensembl),]$GENCODE_ensembl
pseudo.ensemblX = gsub("ENSGR", "ENSG0", pseudo.ensemblY)
## create column for X and Y genes in the pseudoautosomal region
df$pseudoautosome = df$GENCODE_ensembl %in% c(pseudo.ensemblX, pseudo.ensemblY)
print("Number of rows with pseudoautosome genes? Expect 48")
print(table(df$pseudoautosome))
# FALSE TRUE
# 25264 48
Last updated December 2021
No comments:
Post a Comment