Notes from Tania

October 10, 2024

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials.

scRNA-seq introductions

How to make R objects for single cell data, e.g. SingleCellExperiment, SummarizedExperiment

How to take a spreadsheet with a matrix and convert it to the format needed for many other single cell RNA-seq tutorials, e.g. if you download a .csv.gz file from NCBI GEO

Getting Started with Seurat v4 (Satija lab tutorials list)

Many tutorials here, for different scRNA-seq goals

Guided clustering tutorial with 3000 PBMC cells

Setup Seurat object
Standard pre-processing workflow & quality control
Data normalization
Identifying highly variable features (genes)
Clustering, UMAP/tSNE plots
Differential gene expression analysis

Basics of single cell analysis with Bioconductor

University of Cambridge intro to single cell RNA-seq analysis

Identification of low-quality cells using MADs values

Orchestrating Single-Cell Analysis with Bioconductor

How to get MD5 checksums to detect data corruption (for bioinformatics data curation)

What are MD5 checksums? Checksums are nonsense text strings used to "summarize" a file version. No matter the size of the file (1 kb or 30 GB), the checksum algorithm gives you a conveniently short nonsense string of letters and numbers. The exact same file will give you the exact same checksum every time. If you change a single character or pixel, you will get a different checksum.

MD5 is a specific popular algorithm to get checksums.

Why use checksums? The purpose of checksums is to notice data corruption, especially when downloading files from or uploading files to a server. Every time you transfer files between computers, there is risk of data corruption. For small files, the risk is small and you'll most likely notice, for example if your email attachment download fails due to an internet interruption.

For large files such as raw sequencing data files, it's a bigger issue and you might not notice right away (or ever) if the last few RNA-seq reads of a >30 million reads file are missing. Therefore, the best practice when downloading new sequencing is to create MD5 checksums yourself and compare them with the MD5 checksum created by the originating computer (the sequencing core's server). They should be the same. If not, something went wrong during file transfer! Try re-downloading the data.

Similarly, when you upload sequencing data to a public repository (e.g. NCBI GEO), you provide MD5 checksums so that the receivers (NCBI's data curators) can confirm the upload was successful.

How to get an MD5 checksum for an individual file? See example below using the Linux terminal. I created a text file containing only the phrase "hello pretend this is sequencing data". The checksum for that file is "b088d8d4d1d831af2d8d16147389aa7d". If I change the first letter to uppercase, the checksum completely changes.

Intro to human blood components, cell types, and cell markers

Blood components as separated by density gradient

1. Plasma: yellow layer on top, 55% of blood volume

2. Buffy coat (leukocytes and platelets): white layer in the middle, <1% of blood volume

White blood cells (leukocytes) = granulocytes (neutrophils, eosinophils, basophils) and PBMCs/agranulocytes (monocytes, dentritic cells, T cells, B cells, NK cells)
Platelets (thrombocytes) = the blood clotting cells, the smallest blood cell type
Related: buffy cone refers to the leukocyte-rich waste product from plateletpheresis procedures. It isn't the same as buffy coat , which is a layer from the density gradient, but both are sources of leukocytes.

3. Red blood cells (erythrocytes): red layer at the bottom, 44-45% of blood volume.

Adult red blood cells are anucleated (don't have a nucleus). The nucleus is lost during cell maturation.
Fetal red blood cells do have a nucleus . This affects cell density makes it more difficult to use density gradient medium protocols to separate cord blood into the three layers, compared to adult peripheral blood.

Learn more about blood components and blood cell types:

Blood (human), MACS Handbook by Miltenyi Biotec. Includes useful figures and tables.
"What is buffy coat in blood? Buffy coat preparation and buffy coat cell extraction" by Akadeum Life Science

Buffy coat components (leukocytes and platelets)

R programming pre-tutorial: differential expression analysis with DESeq2 and total RNA-seq data from Gonzalez et al 2018

Here, I throw data at you and help you practice differential expression analysis using the data from Gonzalez et al 2018 and tips for the official DESeq2 R package tutorial. This dataset is small enough that you can run it on a personal laptop with 8 GB of RAM and an average computer processor. Enjoy!

Prerequisites:

R programming: How to get started

R programming lesson #1: load data, subset, and write to a new file

Download un-normalized ("raw") gene counts from NCBI GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109082

Supplementary file: "GSE109082_genecounts.txt.gz"
Download using the http link. You can read this file into R just like a csv spreadsheet with function df=read.csv(...)

By the way, you don't need to call your data df. That's just a standard example variable name for dataframes used in many R tutorials.

The counts data is a matrix of gene read counts per sample.

Rows = genes

Columns = samples

Here, the counts are already integers because this data comes from gene alignment.

Side note: some RNA-seq primary analysis uses "pseudo-alignment" methods (e.g. kallisto) and results in a counts matrix that has decimal numbers. You can't use that for DESeq2, which expects integers. In cases like that, you need to round the counts and convert to integers before continuing.

The genes are all labeled with Ensembl Gene IDs ("ENSG....") and you can look up more details that here: https://ensembl.org/

How to autoclave items for the lab

Autoclaving is used to sterilize items for lab and medical use. An autoclave is a large pressurized oven. It's safe when used properly, but precautions are necessary to avoid burns and explosions.

R code to identify the FDR=0.05 equivalent P-value (for plotting purposes)

For Manhattan plots and other plots where you're plotting -log10(P-values) or untransformed P-values, you sometimes want to draw line to identify when FDR=0.05 is reached. But where do you draw the line? Below is R code that identifies the nearest P-value to FDR=0.05. You can then use variable FDR5.equiv.P to create a line on your Manhattan plot.

Code assumes you have a data frame with two columns, "Pval" and "FDR".

Advice: Flash talks

What is a flash talk?

Flash talks are brief formal presentations about your research, usually 1-3 minutes. They are mini-oral presentations to accompany research posters or journal articles. Other names include lightning talks, speed talks, rapid-fire talks. Their goal is to provide big picture ideas and make your research interesting to a broad audience.

They are similar to elevator pitches, except flash talks include a visual element (usually 1-3 PowerPoint slides). Sometimes the visual element is your full poster and nothing else, in which case you want to design your poster with this in mind. Add large visuals and large text.

Advice for flash talks (video links):

Winning Tips for Preparing a Successful Three-Minute Thesis 3MT® Presentation, OhioUPhysics, YouTube [12:32]

How to give a flash talk - tips and tricks for scientists, European Molecular Biology Laboratory (EMBL), YouTube [2:58]

The perfect pitch - explaining your research in one minute, Kungl. Ingenjörsvetenskapsakademien IVA, YouTube [7:32]

Notes from Tania

October 10, 2024

Bookmarks: single cell RNA-seq tutorials and tools

scRNA-seq introductions

September 20, 2024

How to get MD5 checksums to detect data corruption (for bioinformatics data curation)

September 16, 2024

Intro to human blood components, cell types, and cell markers

Blood components as separated by density gradient

Buffy coat components (leukocytes and platelets)

August 1, 2024

R programming pre-tutorial: differential expression analysis with DESeq2 and total RNA-seq data from Gonzalez et al 2018

Prerequisites:

Download un-normalized ("raw") gene counts from NCBI GEO:

June 20, 2024

How to autoclave items for the lab

May 11, 2024

R code to identify the FDR=0.05 equivalent P-value (for plotting purposes)

March 26, 2024

Advice: Flash talks

What is a flash talk?

Advice for flash talks (video links):

Bookmarks: single cell RNA-seq tutorials and tools