This BiteSize Bio YouTube tutorial is very helpful for understanding RNA-seq depositories and what data is stored at NCBI GEO versus the SRA.
Data details
- Tianyanxin Sun, Tania L Gonzalez, ..., Margareta D Pisarska. "Sexually dimorphic crosstalk at the maternal-fetal interface", Journal of Clinical Endocrinology & Metabolism, Volume 105, Issue 12, December 2020, Pages e4831–e4847.
- Published: https://doi.org/10.1210/clinem/dgaa503
- Sequencing data deposited into NCBI GEO superseries GSE131875
- GSE131696: Single cell RNA-sequencing of CVS samples
- N=6 first trimester placenta, 3 female and 3 male
- Tissue is from leftover donated chorionic villus sampling, a prenatal genetic diagnosis exam in late first trimester
- All patients delivered full term live and healthy babies
How to get data from NCBI GEO and SRA
The NCBI GEO super series includes two sequencing projects, including a non-single cell sequencing.
Start at the NCBI GEO submission page for the actual sequencing you want. In this example, use the single cell sequencing:
NCBI GEO contains meta data for sequencing as well as processed files, including counts which you can use for differential expression analysis using R packages such as DESeq2 or edgeR or limma. If that is all you need, then download it from the NCBI GEO page at the very bottom, under "Supplementary file". The extension .gz indicates a compressed file (it is like .zip or .7z, but for Linux). RStudio can read .gz files directly on Windows and Mac OS, without any decompression needed.
However, if you need raw sequencing data, the click on "More..." to get information on the individual samples in this sequencing project. There are only 6 samples, so I'll show how to download data from an individual sample. Pick the first one, sample ID 72406, accession GSM3814116.
Click on "GSM3814116" to reach this webpage:
For raw sequencing data, follow the SRA link from NCBI GEO. Scroll down to the bottom on the individual sample's NCBI GEO page and click the SRA accession "SRX5890937":
That takes you to the short read archive (SRA) page, which is where the sequence data is actually contained. See information about this specific sample ID 72406:
Click on the "Run" link.
Click on "SRR9116838" to get to:
...This tells you more information about the raw sequencing data, including that it has 46.9% GC content (versus A and T bases).
Notice that there are tabs at the top of the page: Metadata, Alignment, Analysis, Reads, and Data access.
Click on the "Data access" tab to find links to download the raw sequencing data. Click on your preferred download source.
- NCBI = National Institutes of Health's database
- AWS = Amazon Web Services
That page is where you can get the raw data. For this sample, the SRA formatted data is 2.8Gb. The download time estimate jumps between 16-22 minutes as the download speed changes.
The original format is also available on this page. It is usually a bam or fastq file.
Downloading samples individually is very do-able for 6 samples. However, if you need to download many more samples, consider switching to the NIH's SRA toolkit for batch downloading. Also see the SRA Explorer.
No comments:
Post a Comment