September 14, 2020

How to download FASTA files from NCBI

FASTA format

FASTA files are text files used to store sequence information. The format is 2 lines of text:

>title line starting with a greater-than sign

sequence with no double line breaks or special characters


FASTA file extensions are usually .fasta or .fa but can also be .txt because, either way, they are just text files. Open them as text files with Notepad. For large sequences, Notepad++ works better (more memory allocation). Word or any other text file reader will work as well.

Select an accession ID: 

  1. Search for your gene of interest in the database NCBI Gene, e.g. DDX3X.
  2. Go to section "NCBI Reference Sequences (RefSeq)" and click on an accession ID



Most useful NCBI accession ID prefixes for sequences from GenBank cDNA and EST data. These are curated sequences:

  • NR_ = RNA transcript (noncoding or partial or undescribed)
  • NM_ = mRNA transcript (polyadenylated RNA transcript)
  • NP_ = protein sequence

Other NCBI accessions from eukaryotic genome annotation pipelines. These map to the genome but haven't been curated yet (thus may be less useful):
  • XR_
  • XM_
  • XP_

Download a FASTA file (.fa text file) from NCBI Nucleotide

After clicking on the accession ID, you'll be taken to NCBI Nucleotide with the sequence information, e.g. page for DDX3X.


Click on "Send to:" on the upper right corner.



Select Complete Record, then File, then format "FASTA". Click Create File.




The resulting FASTA file can be opened with any text file reader. Here are the first few lines for DDX3X:


Retrieve a FASTA sequence (no download)

If you just want to cut and paste, and don't need a saved file, click on the "FASTA" link on the upper left side.






Alignments

Clustal Omega & Clustal W

You can manually compare your FASTA-formatted sequence with other FASTA sequences by aligning them with Clustal Omega (more recent version) or Clustal W

Use ClustalW if comparing intron-containing and intronless sequences (e.g. genomic versus coding sequence) because you can reduce the Gap Open Penalty to zero to get a better alignment.

NCBI BLAST

The FASTA sequence can also be used for NCBI BLAST tools to compare your sequence to whole databases.
  • Nucleotide blast (blastn) - compares your nucleotide sequence to known nucleotide sequences. 
  • blastx - translates your nucleotide sequence input to amino acids, then compares the results to known protein sequences. Conveniently, blastx tries all 6 reading frames of your input (3 forward, 3 reverse).

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...