Notes from Tania: How to download raw sequencing data using NCBI GEO accessions

This BiteSize Bio YouTube tutorial is very helpful for understanding RNA-seq depositories and what data is stored at NCBI GEO versus the SRA.

For my walk-through example, I am using our placenta single cell sequencing published in 2020.

Data details

Tianyanxin Sun, Tania L Gonzalez, ..., Margareta D Pisarska. "Sexually dimorphic crosstalk at the maternal-fetal interface", Journal of Clinical Endocrinology & Metabolism, Volume 105, Issue 12, December 2020, Pages e4831–e4847.

Published: https://doi.org/10.1210/clinem/dgaa503

Preprint: https://www.biorxiv.org/content/10.1101/641118v1

Sequencing data deposited into NCBI GEO superseries GSE131875

GSE131696: Single cell RNA-sequencing of CVS samples

N=6 first trimester placenta, 3 female and 3 male
Tissue is from leftover donated chorionic villus sampling, a prenatal genetic diagnosis exam in late first trimester
All patients delivered full term live and healthy babies

How to get data from NCBI GEO and SRA

The NCBI GEO super series includes two sequencing projects, including a non-single cell sequencing.

Start at the NCBI GEO submission page for the actual sequencing you want. In this example, use the single cell sequencing:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131696

NCBI GEO contains meta data for sequencing as well as processed files, including counts which you can use for differential expression analysis using R packages such as DESeq2 or edgeR or limma. If that is all you need, then download it from the NCBI GEO page at the very bottom, under "Supplementary file". The extension .gz indicates a compressed file (it is like .zip or .7z, but for Linux). RStudio can read .gz files directly on Windows and Mac OS, without any decompression needed.

However, if you need raw sequencing data, the click on "More..." to get information on the individual samples in this sequencing project. There are only 6 samples, so I'll show how to download data from an individual sample. Pick the first one, sample ID 72406, accession GSM3814116.

Click on "GSM3814116" to reach this webpage:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3814116

For raw sequencing data, follow the SRA link from NCBI GEO. Scroll down to the bottom on the individual sample's NCBI GEO page and click the SRA accession "SRX5890937":

That takes you to the short read archive (SRA) page, which is where the sequence data is actually contained. See information about this specific sample ID 72406:

https://www.ncbi.nlm.nih.gov/sra?term=SRX5890937

Click on the "Run" link.

Click on "SRR9116838" to get to:

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR9116838

...This tells you more information about the raw sequencing data, including that it has 46.9% GC content (versus A and T bases).

Notice that there are tabs at the top of the page: Metadata, Alignment, Analysis, Reads, and Data access.

Click on the "Data access" tab to find links to download the raw sequencing data. Click on your preferred download source.

NCBI = National Institutes of Health's database
AWS = Amazon Web Services

That page is where you can get the raw data. For this sample, the SRA formatted data is 2.8Gb. The download time estimate jumps between 16-22 minutes as the download speed changes.

The original format is also available on this page. It is usually a bam or fastq file.

Downloading samples individually is very do-able for 6 samples. However, if you need to download many more samples, consider switching to the NIH's SRA toolkit for batch downloading. Also see the SRA Explorer.

Notes from Tania

November 16, 2021

How to download raw sequencing data using NCBI GEO accessions

Data details

How to get data from NCBI GEO and SRA

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools