Notes from Tania: How to get MD5 checksums to detect data corruption (for bioinformatics data curation)

What are MD5 checksums? Checksums are nonsense text strings used to "summarize" a file version. No matter the size of the file (1 kb or 30 GB), the checksum algorithm gives you a conveniently short nonsense string of letters and numbers. The exact same file will give you the exact same checksum every time. If you change a single character or pixel, you will get a different checksum.

MD5 is a specific popular algorithm to get checksums.

Why use checksums? The purpose of checksums is to notice data corruption, especially when downloading files from or uploading files to a server. Every time you transfer files between computers, there is risk of data corruption. For small files, the risk is small and you'll most likely notice, for example if your email attachment download fails due to an internet interruption.

For large files such as raw sequencing data files, it's a bigger issue and you might not notice right away (or ever) if the last few RNA-seq reads of a >30 million reads file are missing. Therefore, the best practice when downloading new sequencing is to create MD5 checksums yourself and compare them with the MD5 checksum created by the originating computer (the sequencing core's server). They should be the same. If not, something went wrong during file transfer! Try re-downloading the data.

Similarly, when you upload sequencing data to a public repository (e.g. NCBI GEO), you provide MD5 checksums so that the receivers (NCBI's data curators) can confirm the upload was successful.

How to get an MD5 checksum for an individual file? See example below using the Linux terminal. I created a text file containing only the phrase "hello pretend this is sequencing data". The checksum for that file is "b088d8d4d1d831af2d8d16147389aa7d". If I change the first letter to uppercase, the checksum completely changes.

RNA-seq data curation example

Download data files from the sequencing core to a hard drive.

Option 1: Manually download files one by one from a website interface (tedious but ok if only a few files).
Option 2: Sign into the FTP server and download files using FTP software such as FileZilla. The sequencing core will need to give you the login information. This is better! FTP software makes downloading many files easier and also gives you warnings if it detects file transfer failures.

Identify the files containing raw RNA-seq data

Connect your hard drive to a computer running Linux. It doesn't need to be a fancy expensive server. If you have an old desktop or laptop, you can install Linux on it.

FASTQ is the most common filetype for RNA-seq raw data. It is essentially a huge text file. Look for extension .fastq.gz for gzip-compressed raw data. This may be one file per sample for bulk mRNA-seq, or two if RNA-seq reads are paired, or more for single cell RNA-seq data sequenced deeply.

How to get MD5 checksums for many files at once?

If you have a desktop interface on Linux, open the "Files" window and navigate to your hard drive, then right-click the desired data folder to open a terminal window.

Get MD5 checksums for files with extension ".fastq.gz" using one of the following two methods:

If all your files are in one folder, simply use the following code to make a file "md5sums.txt":

md5sum *.fastq.gz > md5sums.txt
If your files in different subfolders, then first locate files with the ".fastq.gz" extension, then feed that list to the md5sum function, then appends results to a text file named "md5sums_for_fastq.txt":

find . -name "*.fastq.gz" -exec md5sum {} \; >> md5sums_for_fastq.txt

Since the point is to compare YOUR checksums with the checksums created by the sequencing core, make sure your checksum filenames are unique and not overwriting any existing checksum files that you downloaded along with the data, otherwise that defeats the purpose.

FASTQ files are large, so getting the MD5 checksums may take a while. Let the Linux command run over lunch or overnight if you have many files. Don't interrupt the process.

Compare checksums

Once you have two checksum files, you can compare checksums from the source server (from the sequencing core) and the hard drive where you saved the data (your version). If you find a mismatch, re-download that file and check again. If you keep getting mismatches, contact the sequencing core.

See example below with the core's checksums text file (left) and mine (right). The samples are in the same order and checksums match.

Filenames were cropped out of the image but the file format is:

MD5checksum1 filename1.fastq.gz

MD5checksum2 filename2.fastq.gz

MD5checksum3 filename3.fastq.gz

Etc...

Notes from Tania

September 20, 2024

How to get MD5 checksums to detect data corruption (for bioinformatics data curation)

RNA-seq data curation example

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools