Notes from Tania: Study questions for sequencing research articles

Study questions for a sequencing article:

What is the article citation? Include the full article title, a PubMed link to the article, and a link to the article's original publisher.

Who is the corresponding author and what is their institutional affiliation? The corresponding author is usually the last author on the author list. They will have an asterisk or another symbol by their name. This is the person who oversees the project and makes publication decisions.

What samples did the study use (tissue type and gestational age and source of sample)? Define any acronyms used. Some examples:

Placenta - what gestational age? If pre-delivery, were the samples from terminations or from continuing pregnancies?
Cord blood - this is collected at delivery. Were the pregnancies "full term" or "pre-term" or otherwise described?
Plasma - when collected?
Serum - when collected?

Did they collect the samples themselves? If they are analyzing data from another paper, what is the citation for that other study? Check methods to see if they indicate an NCBI GEO accession ID (begins with "GSE").

What sequencing or array method(s) did they use? Some examples:

total RNA-sequencing: sequences all RNAs after depletion of ribosomal RNAs.
mRNA-sequencing: only sequences the polyadenylated RNAs (with polyA tails) using oligo(dT) primers or probes during the library preparation.
miRNA-sequencing, microRNA-sequencing, or small RNA-sequencing: sequences small RNAs within a specific size range.
ATAC-sequencing: sequencing to identify accessible chromatin regions, where DNA is available for protein binding. This is used to find regulatory DNA regions and identify how genes might be controlled.
Bisulfite sequencing: a sequencing-based method to measure DNA methylation
DNA methylation array: an array-based method to measure DNA methylation, but which one? The two most common arrays are the 450k array (older) and the EPIC array (twice the size at 860k sites). Some papers also use custom arrays.
Vocabulary to help distinguish methods:

"Total RNA" = all RNA isolated from the sample before any depletion or enrichment step. The use of the term "total RNA" does not indicate the sequencing method. Most RNA in cells is actually ribosomal RNA so pre-sequencing steps usually include a negative selection (remove rRNA) or a positive selection (pick up only polyA-tail RNA).
Library preparation is the step to convert RNA into cDNA before RNA-sequencing. The cDNA "library" is what is actually sequenced.
DNA is bisulfite converted for methylation arrays also, so the word "bisulfite" in methods does not by itself indicate bisulfite sequencing.

Did they use bulk sequencing, single cell sequencing, single nuclei sequencing, or multiple methods? Bulk means a whole sample was processed at once. Single cell and single nuclei sequencing require separating (or "dissociating") the tissue and capturing individual cells or nuclei.

Vocabulary:

10x Genomics = company that provides many lab materials and analysis tools for single cell and single nuclei sequencing
Cell Ranger = software from 10x Genomics for primary analysis of single cell/nuclei sequencing
PCA = principal components analysis. It uses the most variable genes to see how samples separate.
PISA = alternative to Cell Ranger for primary analysis of single cell/nuclei sequencing
primary analysis = mapping raw sequencing reads to a genome reference and creating FASTQ files
secondary analysis = using FASTQ files to analyze the biology of the samples (for example, cell composition and differential expression analysis)
UMAP = similar to PCA, it uses the most variable genes to see how cells separate. They may be color-coded by cell type.
Seurat = software for secondary analysis of single cell/nuclei data
scRNA = single cell RNA
snRNA = single nuclei RNA

How were the samples stored between collection and nucleotide isolation (or collection and tissue dissociation)?

How were samples frozen? Did they use some cryopreservative such as RNAlater, CryoStor CS10, CryoStor CS5, DMSO, glycerol, etc? Do they specify a percentage such as 10% DMSO?
What temperature was used to store samples long-term?

For bulk studies --

What method was used for DNA or RNA isolation?
Was it a commercial kit (which one) or a phenol/chloroform method or something else?
How much DNA or RNA did they use per sample for library preparation? For example, 500 ng, 1 ug?

For single cell or single nuclei studies --

Was there are red blood cell (RBC) lysis step?
Was there a cell sorting step and if so, what population of cells did they capture and how did they identify it?
During data processing, what thresholds did they use to filter out cells? For example, only keeping cells with < 5% mitochondrial genes.
What cell types did they identify? What cell markers (genes) did they use to identify these cell types?
Also see:

What study groups were compared in the analysis? Describe how the groups were defined.

How many samples are in each group? Define any acronyms or uncommon words. Note that some articles list two numbers: the starting number of samples, then the number of samples they used for the final analysis after describing their criteria for inclusion/exclusion.

Final number analyzed in sequencing or array analysis?
How many cells or nuclei? Per sample, per group, or total number?

How did they define significance? What statistics and thresholds did they use?

Example statistical tests: Benjamini-Hochberg false discovery rate (FDR), Bonferroni, Wilcoxon rank-sum test, Fisher's Exact Test

How many significant genes, probes, and/or regions did they find?

RNA-seq studies will report differentially expressed genes (DEGs) and use words like "upregulated" and "downregulated", or "[group1]-biased" or "[group2]-biased" to indicate direction of higher expression.
DNA methylation studies will report differentially methylated probes (DMPs) and regions (DMRs).
The exact vocabulary might vary from paper to paper (e.g. "CpG sites" instead of "probes").

What genes did they highlight? Pick 1-3 genes and briefly discuss the main results. Keep in mind that saying a gene is "significant" doesn't give enough information. What about it is significant?

Bad: "ZNF300 was significant."
Good: "ZNF300 gene expression was significantly higher in females, compared to males."
Better: "In first trimester human placenta, ZNF300 gene expression was 1.58-fold higher in females, compared to males, with FDR<0.05."
(You don't need to repeat the sample type or species every time, but it's helpful to clarify at least once. The magnitude of the change is helpful to record, too, especially when other genes might be >10-fold different.)

What were their main conclusions? Write 1-5 bullet point notes or sentences. What are the highlights of the paper? What did they accomplish and find?

--------------------------------------------------------

Last updated July 7, 2025 - added more single cell vocabulary

Notes from Tania

June 12, 2023

Study questions for sequencing research articles

Study questions for a sequencing article:

No comments:

Post a Comment

How to format final figures for publication