October 10, 2024

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. Whenever I find something useful, I add it.

scRNA-seq introductions

How to make R objects for single cell data, e.g. SingleCellExperiment, SummarizedExperiment  

Getting Started with Seurat v4  (Satija lab tutorials list)

  • Many tutorials here, for different scRNA-seq goals

Guided clustering tutorial with 3000 PBMC cells

  • Setup Seurat object
  • Standard pre-processing workflow & quality control
  • Data normalization
  • Identifying highly variable features (genes)
  • Clustering, UMAP/tSNE plots
  • Differential gene expression analysis

Basics of single cell analysis with Bioconductor

University of Cambridge intro to single cell RNA-seq analysis

  • Identification of low-quality cells using MADs values



Lectures, textbooks, video tutorials, & interpretation

Cell filtering

  • Doublets have higher nFeature_RNA and nCount_RNA, but no change for percent_mt (biostatsquid tutorial)
  • Cells with high percent_mt are more likely to be dead or dying cells (Marquez-Jurado 2018

Determining the optimal number of clusters with elbow plots

DoubletFinder tutorial & DoubletFinder github 

  • DoubletFinder assumes you've done full pre-processing: preliminary QC (filter out lowest quality cells and genes), run NormalizeData, FindVariableGenes, ScaleData, RunPCA, and RunTSNE (source).
  • Split the Seurat object by sample name first
  • Run DoubletFinder for each individual sample (that's why you need the split object)
  • Run it before merging samples: "Do not apply DoubletFinder to aggregated scRNA-seq data representing multiple distinct samples (e.g., multiple 10X lanes)." You want to run DoubletFinder on each sample individually to avoid the algorithm incorrectly "finding" doublets that can't exist, e.g. doublets formed by two separate samples that were not captured together (see discussion). 
  • OSCA.advanced tutorial on doublet detection using a SingleCellExperiment object and scDblFinder R package. "The findDoubletClusters() function from the scDblFinder package identifies clusters with expression profiles lying between two other clusters (Bach et al. 2017). "
  • BiostatsSQUID tutorial on using scDoublFinder starting with a Seurat object. It converts to a SingleCellExperiment object also.

National Cancer Institute: "Getting Started with Seurat: Differential Expression and Classification"

  • Seurat differential expression analysis with FindAllMarkers and FindMarkers
  • Cell-level cell type annotation using SingleR


Orchestrating Single-Cell Analysis (OSCA) with Bioconductor by Robert Amezquita, Aaron Lun, Stephanie Hicks, Raphael Gottardo (2021)

OSCA Basics: Basics of Single-Cell Analysis with Bioconductor by Robert Amezquita, Aaron Lun, Stephanie Hicks, Raphael Gottardo (2024)
  • Quality control - various QC metrics, identifying & removing low quality cells, diagnostic plots
  • Normalization - library size, deconvolution, spike-ins, scaling and log-transformation
  • Feature selection - quantifying variation, sequencing noises, batch effects, etc
  • Dimensionality reduction - PCA plots
  • Clustering - k means clustering, hierarchical clustering, subclustering
  • Marker gene detection - dot plots, expression plots
  • Cell type annotation - using other references, specific genes, markers, diagnostic heatmaps
  • Using references
  • Annotation diagnostics
  • Using multiple references
  • Exploiting cell ontology
  • Example dataset from pancreas

scran:
  • Automated PC choice
  • Graph-based clustering
  • Identifying marker genes
  • Detecting correlated genes
  • Converting to other formats allows for pseudobulk analysis with edgeR or DESeq2

Seurat:

10x Genomics tutorial:

Loupe Browser from 10x Genomics

YouTube tutorials:

Data visualization - types of plots and how to make them

Data visualization methods in Seurat  - ridge plots, violin plots, feature plots, dot plots, heatmaps, visualizing coexpression

Split Dot Plot  - color code by an additional variable such as a condition

Dot Plots: "How to make a multi-group dotplot for single-cell RNAseq data" - including how to add different color coding, how to split by groups, how to add color gradients, clustered dot plots

Clustered dot plot using ComplexHeatmap

Let's Plot 7: Clustered Dot Plots in the ggverse  (Eye Informatician)

Gene Ontology analysis and data visualization (Melbourne Bioinformatics)

Custom formatting all Seurat plots from Seurat functions. Example code:

plots <- VlnPlot(object = combined, features = c('Arg1', 'Tnf'), split.by = 'group', pt.size = 0, combine = FALSE, log = TRUE)
plots <- lapply(
  X = plots,
  FUN = function(p) p + ggplot2::scale_fill_manual(values = c('red', 'black'))
)
CombinePlots(plots = plots, legend = 'right')


tSNE vs UMAP, two methods to show clustering:


SCpubr - an R package to make publication ready plots for single cell RNA-seq

  • Dim plots - dimensional reduction, similar to PCA or UMAP plots
  • Feature plots - dim plot with a continuous scale for gene expression visualization across clusters
  • Nebulosa plots - computes a density plot for specific gene markers so you can see where they are most expressed
  • Bee Swarm plots
  • Violin plots
  • Ridge plots - multiple violin plots together
  • Dot plots - show gene expression of different markers across different clusters
  • Bar plots
  • Box plots
  • Geyser plots
  • Alluvian plots
  • Sankey plots
  • Chord Diagram plots - circos plots
  • Volcano plots


Cell labeling, label transfer, single cell reference mapping

Mapping and annotating query datasets (Satija lab, Oct 2023)

Web Resources for Cell Type Annotation  (10x Genomics Analysis Guide, 2024)

Azimuth: App for reference based single cell analysis - helps annotate clusters. You can upload the Seurat object .rds file to the app and get predictions. Troubleshoot error with Seurat v5.
 

Cell proportions


R package: scProportionTest (github, not peer-reviewed)
  • Explanation from the author:
    • Monte-Carlo/permutation test
    • Random sample some number of cells for each condition
    • Pool cells between samples for each condition
    • Randomly segregate the cells back into the two conditions, maintaining original sample sizes
    • Re-calculate proportion differences, compare to actual observed proportion differences
    • Repeat 10,000 times (R package default is 1000)
  • Critique
    • Direction of change may not match that obtained from a boxplot
    • Each cell is an observation (would be better if each sample was an observation?)

R package speckle's propeller method (peer-reviewed)

Compositional analysis - Python based analysis of cell proportion differences.

"If we ignore the compositionality of the data, and use univariate methods like Wilcoxon rank-sum tests or scDC, a method which performs differential cell-type composition analysis by bootstrap resampling[Cao et al., 2019], we may falsely perceive cell-type population shifts as statistically sound effects, although they were induced by inherent negative correlations of the cell-type proportions."

"While single-cell datasets of sufficient size and replicate number have only been around for a few years, the same statistical property has also been discussed in the context of microbial analysis[Gloor et al., 2017]. There, some popular approaches include ANCOM-BC [Lin and Peddada, 2020] and ALDEx2 [Fernandes et al., 2014]. However, these approaches often struggle with single-cell datasets due to the small number of experimental replicates. This issue has been tackled by scCODA[Büttner et al., 2021], which we are going to introduce and apply to our dataset in the following section."

  • scCODA = Bayesian approach. Assumes a log-linear relationship between covariates and cell abundance.

Combining samples

Theory - 

Q & A -

Tutorials -


How to define batch -
This assumes you have a small spreadsheet "donor_metadata" which includes rows=samples and columns=metadata including a column labeled "batch". The "ID" columns are the sample names and these should match the IDs used during import of individual Seurat datasets.

rownames(donor_metadata) <- donor_metadata$ID

## Create dataframe with batch info for every cell
cellBatch = dplyr::left_join(
  x = data.frame(
            rownames = rownames(pbmc@meta.data),
            ID = pbmc@meta.data$orig.ident),
  y = donor_metadata[, c("ID", "batch")],
  by = "ID")
head(cellBatch)


How to assign Azimuth labels and split layers by batch - 
## Azimuth labeling
Layers(pbmc)
pbmc <- JoinLayers(pbmc)  ## to fix Azimuth error
pbmc <- Azimuth::RunAzimuth(pbmc, reference = "pbmcref")
pbmc

Layers(pbmc)

## See cell type annotations added
head(pbmc@meta.data, 10)

## Split layers only AFTER running Azimuth. Define the column to use for batches.
pbmc[["RNA"]] <- split(pbmc[["RNA"]], f = pbmc$batch)
Layers(pbmc[["RNA"]])

## Run normalizations and scaling
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc)
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc)
 

Batch correction (integration)

13.3.1 Batch correction: canonical correlation analysis (CCA) using Seurat (Broad Institute) - old method, but code is still helpful for learning

  • Uses separate Seurat objects (old way)

Integrative analysis in Seurat v5 - recommended new method

  • Streamlined one-line integrative analysis (new way)
  • Uses one Seurat object created by merging different Seurat objects, then splitting layers to define batches
  • Includes example code for different integration methods, including CCA and Harmony

Harmony R package  (Korsunsky et al 2019 Nature Methods) - method for batch correction of single cell data

Benchmarking atlas-level data integration in single-cell genomics (Luecken et al 2021, Nature Methods)

  • scANVI, Scanorama and scVI perform best for scRNA-seq
  • scATAC-seq integration performance depends on feature space (genes) & most methods performed poorly for scATAC
  • "scATAC-seq batch effects were only consistently overcome by LIGER and Harmony, which prioritize batch removal over conservation of biological variation."

"Revisiting the issue of differential expression after data integration" (GitHub discussion, 2022)

  • "I understand that the current recommendation from the Seurat authors is that differential expression (DE) analysis should NOT be performed using the integrated data, but on the original RNA data with or w/o log normalization depending on DE algorithms used."
  • "But as far as I understand, the uncorrected values should still contain the original batch effects intrinsic to the data, and if the datasets/batches are very different, such a comparison would be meaningless (i.e. batch and biological differences are completely confounded). Am I missing anything?"
  • "...another limit is that corrected expression values are only returned for the "integration features" (by default 2000 genes). There is yet another related issue that the Seurat authors recommend against using the pearson residues from SCTransform for DE (#4032)."
  • Response: "If we have more than one samples per condition, we could perform batch effect correction at the pseudo bulk level like what we do in a bulk RNAseq analysis."


Differential Expression Testing

Differential expression testing (Seurat)

  • p_val : p-value (unadjusted)
  • avg_log2FC : log fold-change of the average expression between the two groups. Positive values indicate that the feature is more highly expressed in the first group.
  • pct.1 : The percentage of cells where the feature is detected in the first group
  • pct.2 : The percentage of cells where the feature is detected in the second group
  • p_val_adj : Adjusted p-value, based on Bonferroni correction using all features in the dataset.


Differential expression across conditions (Seurat integration subsection)

Ignore data integration for differential expression analysis. Per Satija lab:

"We had anticipated extending Seurat to actively support DE using the pearson residuals of sctransform, but have decided not to do so. In some cases, Pearson residuals may not be directly comparable across different datasets, particularly if there are batch effects that are unrelated to sequencing depth. While it is possible to correct these differences using the SCTransform-based integration workflow for the purposes of visualization/clustering/etc., we do not recommend running differential expression directly on Pearson residuals. Instead, we recommend running DE on the standard RNA assay."


Receptor-Ligand interactions

LIANA: a LIgand-receptor ANalysis frAmework  - an R package and python tool for identifying and scoring receptor-ligand interactions in datasets


Spatial transcriptomics

Analysis of spatial datasets (Sequencing-based)

Analysis of spatial datasets (Imaging-based)

STELLAR ( Python based tool ) from Stanford to annotate single cell data, can be used for cross tissue and cross donor spatial transcriptomics data

Visium HD Multi-sample Integration Analysis: an R Tutorial in Google Colab



Velocity analysis (trajectory analysis)

  • Citation: "RNA velocity of Single Cells", La Manno et al 2018
  • Estimation basis: the relative abundance of nascent (unspliced) and mature (spliced) mRNA --> rate of RNA splicing and degradation
  • Need BAM files for RNA velocity analysis
  • Compare spliced and unspliced genes in BAM files
  • Latent time represents the cell's internal clock
  • Basic knowledge of Python programming Linux command line is needed

Also called "trajectory analysis":


Different tools:
  • Use velocyto (takes 2-3 hours to run) - use BAM files to get .loom files of spliced/unspliced counts
    • velocyto.py for Python
    • velocyto.R for R

  • Alternatively, use Seurat R package and scVelo Python tool.


scATAC-seq

  • Gene Ontology analysis
  • GeneActivity function from the Signac package to quantify activity from ATAC data
  • CoveragePlot function to plot peaks around a gene, from a specific cell type


Multiomics: scRNA-seq and scATAC-seq

Integrating scRNA-seq and scATAC-seq data  (Satija lab)

Integrating scRNA-seq and scATAC-seq data, transfer labels from RNA to ATAC data (Melbourne Bioinformatics

Integrative analysis in Seurat v5 (Satija lab, Oct 2023)

"For this vignette, we use a dataset of human PBMC profiled with seven different technologies , profiled as part of a systematic comparative analysis (pbmcsca). The data is available as part of our SeuratData package."

Azimuth annotation for scRNA-seq and scATAC-seq data (Satija lab)

Signac: a comprehensive R package for the analysis of single-cell chromatin data (Stuart lab)

Muon: Python-based single nuclei multiomics tool - for sn/scRNA, snATAC and CITE-seq multiomics



Scanpy and AnnData format (Python)

  • pooch package used to download preprocessed data
  • adata.X =  the active data matrix being used for analysis (e.g. normalized data). Most functions including data visualization tools will use the .X data unless told otherwise.
  • adata.obs = cell annotations (rows correspond to .X)
  • adata.obs_names = index of cell names
  • adata.obs_keys() = get list of all cell annotations available, e.g. n_genes, percent_mito, n_counts
  • adata.obs["is_low_quality"] = adata.obs["percent_mito"] > 0.03 #how to add cell annotation
  • adata.var = gene annotations (columns correspond to (.X)
  • adata.var_names = index of gene names

SnapATAC2 tutorial by kzhang.org

  • Reads h5ad objects
  • All elements are lazily loaded (requires less RAM)


Tips for reducing h5ad file size from biomage.net:

  • use compression=gzip in write_h5ad function
    • adata.write_h5ad(fn, compression='gzip')
  • matrix data stored as float32 rather than float64
  • any metadata columns (obs/var) that are 64bit can probably become 32bit
  • any string columns often are much smaller if made categorical
  • remove any extra ‘layers’ that aren’t desired in the final version
  • ensure X & raw.X are both sparse.csr_matrix if more than 50% values are zeroes


Seurat vs Scanpy back and forth

Seurat = R tool

Scanpy = Python tool

Conversion from Seurat to AnnData changes metadata column

cellgeni/schard (github link): load scanpy h5ad into R as list, SingleCellExperiment, or Seurat object

  • R package [recommended by]
  • By default loads only X, obs (cell metadata), var (genes), and obsm (reduced dimensions info)
  • Can also load images for spatial datasets
  • Allows loading just metadata instead of the whole object

cellgeni/sceasy (github link): convert between single cell formats

  • R packages
  • Needs R package reticulate which also needs Python

Scarf - an alternative to Seurat and Scanpy for computers with less RAM

Citation: Dhapola et al 2022, Nature Communicationshttps://www.nature.com/articles/s41467-022-32097-3

  • "Using six scRNA-Seq datasets with increasing cell numbers, we found that under the same analysis parameters (see Methods), Scarf had substantially lower memory consumption than Scanpy and Seurat (Fig. 1b). Importantly, using less than 16 GB of memory (RAM) (commonly available in modern laptops), even the largest set of four million cells were efficiently processed by Scarf."
  • Python package
  • pip install scarf[extra]

Scarf download and tutorials:

https://github.com/parashardhapola/scarf

https://osf.io/cbu6a/

https://scarf.readthedocs.io/en/latest/vignettes/cell_subsampling_tutorial.html


scRNA-seq data analysis for non-programmers

Galaxy  - software for nonprogrammers to use for scRNA-seq analysis


Loupe Browser from 10x Genomics 

Background reading - general



Background reading - placenta, endometrium

  • Arutyunyan A,... Vento-Tormo R. "Spatial multiomics map of trophoblast development in early pregnancy." Nature, 2023. [PMID: 36991123; PMCID: PMC10076224]
    • Human placenta and decidua frozen into blocks for spatial experiments
    • Tissue cryopreserved with cold OCT medium and flash-frozen using a dry ice-isopentane slurry.
    • Single nuclei used for multiomics (snRNA-seq, snATAC-seq)

  • Ji K, Chen L, ..., Liu H. "Integrating single-cell RNA sequencing with spatial transcriptomics reveals an immune landscape of human myometrium during labour." Clin Trans Med, 2022. [PMID: 37095651 ; PMCID: PMC10126311 ]
    • Human myometrial tissue collected during C-section deliveries (singleton, uncomplicated full term)
    • n=6 TIL, term in labor
    • n=6 TNL, tern in non-labor
    • Tissue was washed with PBS, minced and enzymatically dissociated briefly:
      3 mg/ml collagenase IV, 2 mg/ml papain , and 120 Units/ml DNases I ) at 37C for 20 min . Cell suspension was passed through stacked 70-30um filters, then passed through the Dead Cell Removal Kit (Miltenyi). Washed with PBS + 0.04% BSA twice.

  • Koel M, Krjutskov, ... Altmae S. "Human endometrial cell-type-specific RNA sequencing provides new insights into the embryo–endometrium interplay." Human Reproduction , 2022 . [PMID: 36339249 ; PMCID: PMC9632455 ]
    • Human endometrium cells sorted with FACS, then bulk RNA-seq
    • n=16 healthy women from Estonia and Spain, mean age 29.7, normal BMI, no hormonal medication for 3 months; normal serum levels of progesterone, prolactin, and testosterone; negative for STIs, no uterine pathologies or endometriosis or PCOS, at least one live birth.
    • Per woman, n=2 endometrial biopsies within the same menstrual cycle (early secretory & mid-secretory/receptive phases)
    • NCBI GEO accession GSE97929 (32 samples): 16 paired endometrial samples 

  • Sun T*,  Gonzalez TL*, ..., Pisarska MD. “Sexually dimorphic crosstalk at the maternal-fetal interface.” J Clin Endocrinol Metab, 2020. [PMID: 32772088 ; PMCID: PMC7571453 ] *co-first authors.
    • Human placenta at late first trimester during CVS appointments
    • NCBI GEO accession GSE131696  (6 samples) = Single cell RNA-seq
    • NCBI GEO accession GSE131874  (8 samples) = Bulk total RNA-seq of matched decidua and placenta
    • Tissue was washed with PBS, minced and enzymatically dissociated:
      300U/ml collagenase , 0.25%  trypsin , and 200μg/ml DNase I  at  37C for 90 min . Cells spun 1200 rpm for 10 min, resuspended in Chang medium (which contains 16% serum), and treated with 1x red blood cell lysis buffer for 15 min, then cells were washed again and strained through a 70um filter. [ Details ]

No comments:

Post a Comment

How to format final figures for publication

General figure guidelines File types and file sizes TIFF images with LZW compression to reduce the file size PDF files for vector images Not...