November 16, 2021

How to download raw sequencing data using NCBI GEO accessions

This BiteSize Bio YouTube tutorial is very helpful for understanding RNA-seq depositories and what data is stored at NCBI GEO versus the SRA.

For my walk-through example, I am using our placenta single cell sequencing published in 2020. 

Data details


How to get data from NCBI GEO and SRA

The NCBI GEO super series includes two sequencing projects, including a non-single cell sequencing. 

Start at the NCBI GEO submission page for the actual sequencing you want. In this example, use the single cell sequencing:

NCBI GEO contains meta data for sequencing as well as processed files, including counts which you can use for differential expression analysis using R packages such as DESeq2 or edgeR or limma. If that is all you need, then download it from the NCBI GEO page at the very bottom, under "Supplementary file". The extension .gz indicates a compressed file (it is like .zip or .7z, but for Linux). RStudio can read .gz files directly on Windows and Mac OS, without any decompression needed.




However, if you need raw sequencing data, the click on "More..." to get information on the individual samples in this sequencing project. There are only 6 samples, so I'll show how to download data from an individual sample. Pick the first one, sample ID 72406, accession GSM3814116.




Click on "GSM3814116" to reach this webpage:

For raw sequencing data, follow the SRA link from NCBI GEO. Scroll down to the bottom on the individual sample's NCBI GEO page and click the SRA accession "SRX5890937":



That takes you to the short read archive (SRA) page, which is where the sequence data is actually contained. See information about this specific sample ID 72406:




Click on the "Run" link. 






...This tells you more information about the raw sequencing data, including that it has 46.9% GC content (versus A and T bases).

Notice that there are tabs at the top of the page: Metadata, Alignment, Analysis, Reads, and Data access.

Click on the "Data access" tab to find links to download the raw sequencing data. Click on your preferred download source.
  • NCBI = National Institutes of Health's database
  • AWS = Amazon Web Services




That page is where you can get the raw data. For this sample, the SRA formatted data is 2.8Gb. The download time estimate jumps between 16-22 minutes as the download speed changes.



The original format is also available on this page. It is usually a bam or fastq file.

Downloading samples individually is very do-able for 6 samples. However, if you need to download many more samples, consider switching to the NIH's SRA toolkit for batch downloadingAlso see the SRA Explorer.

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...