Notes from Tania: Resources

Showing posts with label Resources. Show all posts

February 13, 2024

Free Databases and Software for Molecular Biology, Genetics, and Bioinformatics [2024]

DNA, RNA, and protein sequences - databases
DNA, RNA, and protein sequences - software
Sequence alignments
Molecular interactions and pathways
Single cell datasets
Database-to-database ID conversions
Image software
General software
Science article repositories
(Click on "Read more..." first before using these anchor links.)

DNA, RNA, and protein sequences - databases

Ensembl - genome reference website with gene annotations. Go here for gene, genome sequence, and splicing isoform information. The Ensembl Gene IDs are useful for RNA-seq primary and secondary analysis. For example: the ID for the gene encoding actin beta is ENSG00000075624.

Computer-generated IDs for unique sequences. The pseudoautosomal genes from chromosomes X and Y are not duplicated and only represented in chromosome X.

GENCODE - genome reference website with gene annotations. Human and mouse data.

Human-annotated IDs. The pseudoautosomal genes from chromosomes X and Y are given separate IDs, keeping "ENSG0..." for the chromosome X copy (as per Ensembl.org) and replacing the first zero with an R for the chromosome Y copy, "ENSGR..."

UC Santa Cruz Genome Browser - database of genomic DNA annotations for various species. Previously only had annotations for the old human genome reference (version hg19, also called GRCh37) but currently adding annotations for the new version (hg38, GRCh38).

Los Angeles County: Statistics, Maps, Resources, Moving Information

Bookmarks of useful links for new Los Angeles County residents.

Email lists for Los Angeles County residents

Department of Consumer & Business Affairs - best concise, informative, infrequent newsletter for topics such as rent control changes, housing laws, emergency resources, how to find Covid test sites, etc. Very straight-forward and useful. Highly recommended.
LA County emergency alerts - emergency messages, shelter-in place, evacuation orders, etc.
City-specific emergency alert systems

Safety/health websites

LAcounty.gov/emergency/ - check here for any emergency situation. The county posts evacuation maps here.
Governor's office of emergency services - input an address to see natural hazards risk warnings (e.g. ground at higher risk for liquefication in an earthquake)
CAL FIRE incidents map - lists and shows maps for all local wildfires
Los Angeles County weather service - government weather info
ready.lacounty.gov - emergency or disaster preparedness resource, supply kit ideas
ready.lacounty.gov/flooding - for flood specific information and advice
covid19.lacounty.gov - case statistics, vaccination sites, testing sites
readforwildfire.org - preparation for wild fires

Tools for non-coders using large spreadsheets with gene data

Excel

Comma separated value (csv) spreadsheets vs Excel files

"Save As" a new Excel file (.xls, .xlsx) whenever you receive data in a .csv format.
The .csv format is a simple text file. You can open it with Notepad or Excel. It's great for programming, but not for working with Excel's many tools.
The .csv format does not support color coding, bold/italics/font formatting, multiple sheets, plots, data filtering, merged cells, or many other Excel features. You can lose data if you filter your csv file and save it (only the visible rows will remain) or if you make new sheets and save it (only the sheet you're using will save).
Avoid merging cells. It makes it hard for your bioinformatics collaborators (like me) to use your spreadsheets.
Avoid adding information only as color-coding or formatting. That should only be used to highlight things visually, but the information should be available as text in a column for long-term storage.
Make column header names unique and short whenever possible. It makes it easier to load spreadsheets into R.
Beware when opening .csv spreadsheets in Excel if they have gene name columns. Excel interprets some genes as dates instead of text. Sort your gene column by A-to-Z and then see my tutorial to fix this issue on your spreadsheets.

Sorting and filtering a spreadsheet

Highlight the table's entire header row
Select Data: Sort & Filter : Sort A to Z
Sort the table by column by using the little triangles that appear on each column in the sort area.
Be very careful when adding or moving columns to the front or the end of the table! The new columns may not be in the "block" of sorted data, so you can scramble the data when you sort it and the new columns don't sort with it. Highlight the whole table again including the new/moved columns and re-select the sort area.
Always save a copy of data before sorting in case data scrambling happens. If using other people's data, save your own copy to sort. Don't edit other people's original data.
Beware, don't paste values onto a filtered spreadsheet (they will overwrite values between the visible rows also)

DNA methylation: Illumina methylation array annotation sources

Here I list of available resources and annotation files for Illumina methylation arrays to measure cytosine nucleotide DNA methylation.

There are three different probeType column values: "cg", "ch", "rs"

"cg" - CpG sites, the most common DNA methylation sites
"ch" - CpH sites (where the C base is followed by something other than a G base, H=A,T,C). Rare methylation sites. They are slightly more common in stem cells, but still not as common as CpG sites.
"rs" - methylation sites that overlap single nucleotide polymorphisms (SNPs). These should be excluded from most analysis.
If you don't have the probeType column in your data, you can identify these sites by searching the probeID prefixes, for example: grepl("^cg", df$probeID) is a regular expression for R programming that will return a column of TRUE/FALSE depending on the text values in column df$probeID. The caret symbol (^) in "^cg" means that the text match needs to be at the beginning to be TRUE. An example probeID that would return TRUE is "cg07881041".

Official Illumina manifest files for methylation arrays

A "manifest" is a list of all items (sometimes samples, in this case methylation probes) and information about them. Illumina provides manifests for each of their methylation arrays with information such as probeID, probe type, probe chemistry details, chromosome position, and associated genes.

Infinium MethylationEPIC BeadChip (>860k sites)

Annotation files (Infinium MethylationEPIC Product Files, get the "manifest" files)
Infinium MethylationEPIC v1.0 B5 Manifest File (CSV Format), Mar 13, 2020
GRCh36, GRCh37, and GRCh38 human reference genomes
See annotation website for pdf explaining each annotation column

Illumina HumanMethylation450K BeadChip (450k sites)

Annotation files (Infinium HumanMethylation450K v1.2 Product Files, get the "manifest" files)
HumanMethylation450K v1.2 Manifest File (CSV Format), May 23, 2013
GRCh36 and GRCh37 human reference genomes

Infinium HumanMethylation27 BeadChip (27k sites)

Annotation spreadsheet (Excel spreadsheet), Aug 27, 2013
Uses GRCh36 human genome reference

Bookmarks: Tutorials for bioinformatics & computational science tools

Links to free tools and tutorials. To be updated occasionally...

How to create publication quality images (high resolution images)

Journals typically request that images for publication be at 300 dpi resolution for photos (e.g. microscope images), 600 dpi for images, and 1200 dpi for line art. PowerPoint and Excel default settings don't meet these requirements.

Use these tips to get high resolution images:

Gene enrichment analysis: Gene Ontology (GO) and Ingenuity Pathway Analysis (IPA)

GO = gene ontology. The "meaning" of the gene (biological function, molecule type, cell location, disease association, etc).

Gene enrichment analysis is a search for patterns in gene expression, gene upregulation or downregulation, in lists of total significant genes, etc. Most genes are associated with some pathways, for example: ACTB (beta actin) and cell migration and cell proliferation and mitosis, TP63 (p53) and cell cycle arrest and apoptosis, ALDH2 and aldehyde metabolism, CDC7 and cell cycle control, CDK2 and cell cycle control, CDK7 and cell cycle control and transcription, etc.

The analysis looks at the specific genes and their associated biological pathways or molecular functions or disease associations or upstream regulators, then calculates p-values for how often those associations come up compared to how often you would expect them to come up in a random list of genes. If an association comes up really often (like cell cycle control), is it because it has a large list of associated genes since it is a well-studied pathway, or is it because your specific list of genes is "enriched" for that pathway? The analysis helps you determine that.

The most common statistical test for this type of analysis is a Fisher's Exact Test, which calculates the significance of an overlap. The overlap you are interested in is [your genes] compared to [the genes associated with a pathway]. The length of genes in either list is taken into account, since the likelihood of an overlap is higher with longer lists. You calculate a p-value for every pathway to determine which pathways are significant for your gene list.

Free Databases and Software for Molecular Biology, Genetics, and Basic Bioinformatics [2016]

SEQUENCE DATABASES AND TOOLS

Uniprot.org - Go here to find information on specific proteins. Find information on functional and structural domains, calculate pI, calculate molecular weight, find homologs, get expression information and protein ontology notes. "The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information."
Ensembl - Go here for gene, genome sequence, and splicing isoform information. The Ensembl Gene IDs are useful for RNA-seq analysis. For example: the ID for the gene encoding actin beta is ENSG00000075624.
GenBank - NIH Genetic Sequence Database - Use this to find annotated sequence data for genes of interest. For example, look up the cDNA sequence of actin beta, click on "Send:" in the top right corner, choose "Complete Record", choose destination "File", and download format "Genbank" or "Genbank (full)". That GenBank file can be opened with any sequence software that supports annotations (e.g. SnapGene, SeqBuilder, Benchling, ApE) and the result will be the annotation showing up right next to the DNA sequence. My favorite free software for viewing annotated files is SnapGene Viewer.
Primer-BLAST - Use this tool to design specific primers for a gene of interest for qPCR or regular PCR.
GeneMANIA.org - Make protein signaling and interaction networks. Input a list of proteins and see if they are known to interact. The results are from published data, so unstudied proteins will mistakenly look like they don't interact with anything. For an example network, try this input:

CDX2
MMP2
HLA-G

The Bio-Analytic Resource (BAR) for Plant Biology - Useful tools specifically geared for plant biologists. Genome browsers, expression mappers (eFP browsers), etc.
bioDBnet: db2db - Database to Database Conversions - Really useful for RNA-seq and other large-scale experiments! If you only have a list of gene names or IDs, use db2db to generate a list of gene name synonyms, gene descriptions, biotypes (e.g. protein_coding, lincRNA), accession IDs in different databases, etc. Try using the Ensembl Gene ID input for actin beta (ENSG00000075624).
Clustal W - Multiple Sequence Alignment - Use this to check multiple sequences (DNA or protein). I like this version of Clustal W specifically because it gives me the ability to alter the parameters for the alignment. If you are aligning similar sequences, but one of them has an intron or another interruption, the default parameters will result in a poor alignment. In order to improve it, reduce the "gap extension penalty" so that the alignment score doesn't become awful due to the intron interrupting one of the sequences. Otherwise, the winning alignment will be a useless one full of 1-3 base gaps all over the place. When aligning differently spliced sequences that are otherwise expected to be similar, keep the "gap open penalty" high and the "gap extension penalty" low to get a better result. This is what I do when I manually check the sequencing results of an unknown splice product against the genomic sequence.

SOFTWARE

SnapGene Viewer showing annotated map of EGFP-HyP5SM.

SnapGene can find common features to fix poorly-annotated plasmid files.

SnapGene Viewer - This is my favorite sequences annotation software. This software will read the annotated files generated by GenBank (.gb) as well as sequence formats from for-pay software like the LaserGene Suite's SeqBuilder. If uploading the raw sequence of a plasmid, SnapGene Viewer will search for common sequences and helpfully annotate known promoters, epitope tags, selectable markers, origins of replication, reporter genes, terminators, and many other sequences. It is useful for complex annotations (allowing multiple notes, color coding, breaks in the annotated sequence, etc). It also reads DNA chromatogram files (.abi) and can be used to analyze sequencing results, but the free version does not allow for easy alignment to a reference gene. For DNA sequencing analysis, I prefer Benchling.

Checking restriction digest sites in DsRed2 using SerialCloner.

SerialCloner - Great offline tool for cloning, especially restriction digest cloning. I also use it to check it my primers are specific to my gene or if they may bind another part of the plasmid. Its sequence alignment tool will find and report less-than-perfect matches on either strand so that I can manually decide if a primer will work for me. Con: It only shows its best match.

Looking at the consensus sequences of plant L5 ribosomal proteins with BioEdit.

BioEdit - A useful tool for generating and saving sequence alignments. No longer maintained, but still worth a download. It was designed for Windows XP, but I have used it with Windows 7/8/10. If you have trouble, right-click the .exe file, select "Properties", go to the "Compatibility" tab, and run under compatibility settings for Windows XP.

ONLINE SOFTWARE

Benchling - Online sequence annotation, alignment, and sequence data analysis software. The annotation capabilities are very simple right now, so I prefer SnapGene for that. I mainly use Benchling for checking sequencing results. Upload the DNA chromatogram file (.abi) from the sequencing results and the predicted DNA (e.g. the gene you are cloning into a plasmid) as reference. Align the two sequences and check the chromatogram for SNPs.

Notes from Tania

February 13, 2024

Free Databases and Software for Molecular Biology, Genetics, and Bioinformatics [2024]

Table of contents

DNA, RNA, and protein sequences - databases

September 8, 2022

Los Angeles County: Statistics, Maps, Resources, Moving Information

Email lists for Los Angeles County residents

Safety/health websites

June 23, 2021

Tools for non-coders using large spreadsheets with gene data

Excel

Sorting and filtering a spreadsheet

January 5, 2021

DNA methylation: Illumina methylation array annotation sources

Official Illumina manifest files for methylation arrays

October 15, 2020

Bookmarks: Tutorials for bioinformatics & computational science tools

July 11, 2020

How to create publication quality images (high resolution images)

September 25, 2019

Gene enrichment analysis: Gene Ontology (GO) and Ingenuity Pathway Analysis (IPA)

October 25, 2016

Free Databases and Software for Molecular Biology, Genetics, and Basic Bioinformatics [2016]

SEQUENCE DATABASES AND TOOLS

SOFTWARE

ONLINE SOFTWARE

Bookmarks: single cell RNA-seq tutorials and tools