Notes from Tania: R programming pre-tutorial: differential expression analysis with DESeq2 and total RNA-seq data from Gonzalez et al 2018

Here, I throw data at you and help you practice differential expression analysis using the data from Gonzalez et al 2018 and tips for the official DESeq2 R package tutorial. This dataset is small enough that you can run it on a personal laptop with 8 GB of RAM and an average computer processor. Enjoy!

Prerequisites:

R programming: How to get started

R programming lesson #1: load data, subset, and write to a new file

Download un-normalized ("raw") gene counts from NCBI GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109082

Supplementary file: "GSE109082_genecounts.txt.gz"
Download using the http link. You can read this file into R just like a csv spreadsheet with function df=read.csv(...)

By the way, you don't need to call your data df. That's just a standard example variable name for dataframes used in many R tutorials.

The counts data is a matrix of gene read counts per sample.

Rows = genes

Columns = samples

Here, the counts are already integers because this data comes from gene alignment.

Side note: some RNA-seq primary analysis uses "pseudo-alignment" methods (e.g. kallisto) and results in a counts matrix that has decimal numbers. You can't use that for DESeq2, which expects integers. In cases like that, you need to round the counts and convert to integers before continuing.

The genes are all labeled with Ensembl Gene IDs ("ENSG....") and you can look up more details that here: https://ensembl.org/

Download the sample manifest:

"Gonzalez2018_sample_info_for_DESeq2.xlsx" [download here]

Subset to the 39 samples with column "sexdiffs.39villi.FINAL"=="OK" for the sex differences analysis.

Alternatively, if I didn't give you the sample spreadsheet, you could download a different and uglier version from NCBI GEO directly using these instructions:

Go to the NCBI GEO page for your dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109082
Under section "Download family" near the bottom, click "Series Matrix File(s)"
Download file "GSE109082_series_matrix.txt.gz"
Open with software that can de-compress gzip files, e.g. PeaZip or 7-Zip. Extract the .txt file.
Right-click the .txt file and "Open with..." Excel software.
Scroll down until you see sample data. Here it starts at row 44 with "!Sample_title": "F01 rna-seq", "F02 rna-seq", "F03 rna-seq", etc. Very inconvenient but the information is there.
Scroll around until you find the information you need. For this sex differences analysis, that is row 54, "!Sample_characteristics_ch1", which has values such as "Sex: Female", "Sex: Female", etc.
All the information submitted during the NCBI GEO upload is there. It's just not pretty or particularly easy to use!

Alternatively 2, you can download NCBI GEO metadata for each sample using R package GEOquery. See this YouTube tutorial by chatomics: "How to Download GEO Metadata Easily".

Download the original citation:

Gonzalez, Tania L., Tianyanxin Sun, Alexander F. Koeppel, Bora Lee, Erica T. Wang, Charles R. Farber, Stephen S. Rich, Lauren W. Sundheimer, Rae A. Buttle, Yii-Der Ida Chen, Jerome I. Rotter, Stephen D. Turner, John Williams III, Mark O. Goodarzi, Pisarska MD. "Sex differences in the late first trimester human placenta transcriptome."Biol Sex Differ 2018 Jan 15. 9:4. [PMID: 29335024; PMCID: PMC5769539; open access]

There are three DESeq2 analyses performed:

Sex differences comparison with all samples - main results
Sex differences subanalysis with early CVS samples - supplemental
Sex differences subanalysis with late CVS samples - supplemental

Computer requirements:

Differential expression analysis uses counts tables, not raw RNA-seq files, so you actually don't need that much computer RAM. For small sample sizes such as this example (n=39 total), 8 GB RAM is plenty.

For larger datasets (over 60 samples), you will have an easier time with 16 GB RAM. You might still be able to use 8 GB RAM, but your computer will take much longer on some steps (e.g. 30 min vs 3 min) or you might get "out of memory" errors.

Some data transformations for visualization (making plots) require more RAM than the differential expression analysis itself. For example, variance stabilizing transformation (vst) needs much less computer memory than a regularized log (rlog) transformation. For learning purposes, just use vst when you follow along in the tutorial. Later, when you have access to more memory, you can decide which to use. I find that rlog helps highlight outliers better than vst, so it can be useful for initial quality control. Either one is fine, though.

Follow tutorial for DESeq2 differential expression analysis, with suggested changes:

I recommend the tutorial from the authors of the DESeq2 R package:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

However, the tutorial over-complicates the data input. You don't need the "tximport", "readr", "tximportData" or "pasilla" R packages. Skip those steps and import the data like this:

cts = as.matrix(read.csv("GSE109082_genecounts.txt.gz", sep="\t", header=TRUE, stringsAsFactors=FALSE))

coldata = read.csv("Gonzalez2018_sample_info_for_DESeq2.csv", skip=3, header=TRUE, stringsAsFactors=FALSE)

Subset coldata to only the 39 samples before continuing. You can do that with R or manually.

Follow starting at section "Count matrix input" after the part where you create the cts object (use my instructions above instead of theirs). When you make dds object, use column Sex instead of column condition. This column is directly coming from the coldata object which has the sample data.

Reminder that you can view the first 6 rows of coldata using function head(coldata) or open the whole spreadsheet as a new tab in RStudio using View(coldata)

Don't View(cts) since the counts matrix is much larger! Reminder that you can see the dimensions (number rows, number columns) of your datasets by using dim(cts) or dim(coldata) before deciding to view data.

Skip the other input sections of the tutorial. You only need to input data once.

During differential expression analysis with the dds object, set optional parameter betaPrior to TRUE to get more similar results to the publication. Gonzalez et al 2018 data was run with an older version of DESeq2 before they changed the size factor estimate calculation and introduced separate function lfcShrink. Thus, you won't get exactly the same results with the newer DESeq2 R package, but results will be closer to the publication if you do the following:

dds = DESeq(dds,
test="Wald",
betaPrior=TRUE)

This is a good reminder of why you should keep track of R package versions and variables used (even default variables) since these can change at a later date, affecting results and reproducibility.

At section "Note on factor levels" - reminder that you are using Sex instead of condition for the code to replicate the sex differences analysis in Gonzalez et al 2018. Just replace it everywhere they say condition. You also need to replace the example levels c("untreated","treated") with your actual levels in column Sex. Those are going to be "F" and "M". In the Gonzalez et al 2018 paper, all fold-changes are F/M, so ref="M" when you relevel. The reference group is the denominator for fold-change.

Skip section "Speed-up and parallelization thoughts"

Skip section "Independent hypothesis weighing"

You can do the rest. Try it out and see how it goes! Take your time and take notes in your code with #comments after the hashtag

Final tips

Make sure to double check your work, search for definitions for new terms, look through the Gonzalez et al 2018 paper if you need more sample information, etc. Check your fold-changes compared to mine to make sure that you correctly set your reference group. Since this is sex differences data, you can also directly check that Y-linked chromosome genes are upregulated in the male group as expected.

Due to a major update to the DESeq2 R package around 2018, results are overall similar but not exactly. A few borderline genes are no longer significant or have become significant when they weren't before. However, the top hits should be similar.

Your results will still have Ensembl Gene IDs but we can link those to more gene information later (gene name, chromosome, gene description, etc). Stephen Turner has an R package for gene annotation called annotables which is very useful. If he doesn't have the Ensembl release (genome version) you need, there is also R package biomaRt from Ensembl.

Notes from Tania

August 1, 2024

R programming pre-tutorial: differential expression analysis with DESeq2 and total RNA-seq data from Gonzalez et al 2018