November 22, 2022

Advice: Raw data storage for bioinformatics (terabytes of data)

Good habits and advice

  • Download sequencing data from the sequencing core or company within 30 days of receiving it. They won't store it forever! 
  • Keep multiple copies of your important data.
  • Don't open your original results spreadsheet files with Excel - make copies specifically to open with Excel to avoid autosaving corrupted or filtered data. Beware the Excel calendar issue with gene names.
  • Always eject or unmount data drives to prevent data corruption. Do this before you unplug them physically.
  • Check MD5 checksums. This helps you detect data corruption during downloads, storage, uploads, etc. If you find a mismatch, you can re-download or re-upload to correct it.

Drive formatting


  • FAT32
    • Oldest file format
    • Common file format for USB flash drives (max 32 GB drive size)
    • Now you can have partitions up to 2 TB, previously only up to 32 GB. Don't use this format for larger capacity drives or you will "lose" space.
    • Individual files cannot be over 4GB which is a problem for bioinformatics
    • Don't use for bioinformatics raw data backups
    • Do use this format for your smart TVs: format a 32 GB USB drive as FAT32 and load it with videos to view with friends and family
  • NTFS
    • Modern file system that Windows prefers
    • Read/write on Windows XP to present version
    • Read-only on Macs OS X and some Linux distributions (limitation!)
      • Works fine on Ubuntu 20.04 LTS (per my own experience)
    • Large file size and partition size limits, so good for bioinformatics data storage
    • Supports change journals (good if computer crashes)
    • Does not unmount with computer hibernation which can be a data loss problem
    • Ubuntu isn't good at repairing damaged partitions
    • Fully shut down the Windows computer before disconnecting the drive, otherwise you may have issues mounting the drive on Ubuntu
  • exFAT
    • Introduced in 2006 and still under patent
    • Optimized for flash drives, similar to FAT32 but with less limitations
    • Read/write capabilities on Windows XP to present version, Mac OS X
    • Linux may require additional software to read/write exFAT drives (due to patents, support isn't included with the default install files)
    • Large file size and partition size limits, so good for bioinformatics data storage
    • No journaling so more prone to data corruption than NTFS

Generally: use NTFS or exFAT. See if your collaborators have a preference, otherwise either one is ok.


Checksums

Whenever you download or upload large data between servers and drives, check data integrity using checksums. Read: How to get MD5 checksums to detect data corruption.

If downloading, the sequencing facility or data repository should have their own checksum. You make a separate checksum file and compare to their checksums. 

Split the MD5 checksum text file into two columns, checksum and filename, using Excel's "Convert Text to Columns" feature. Assuming your checksums and the originating server's checksums are in the same order, you can quickly check equality with Excel, for example if comparing checksums in cell A2 and cell D2:

    =A2=D2

Excel will evaluate this as TRUE or FALSE. You can everything to be TRUE.

Initializing hard drives or SSD drives

If you buy a new HDD or SSD and your computer does not recognize it when plugged in, you may need to "initialize it". Instructions for Windows 10 | Microsoft official instructions

***Note these instructions are for new and empty drives. Initializing and formatting deletes data.***

Search for and open "Disk Management" ("Create and format hard disk partitions")
If your disk is plugged into USB, you'll get a window saying "You must initialize a disk before Logical Disk Manager can access it."

Choose a partition style: pick GPT (GUID Partition Table) since it supports volumes larger than 2TB, so it's likely what you want for bioinformatics raw data storage. 



Disk 1 will change from "Not Initialized" to "Initializing" to "Online".

Next, right-click the newly initialized disk and select "New Simple Volume"


This opens a pop-up. Select Next...




Select the maximum simple volume size in the next window (default is usually the maximum). Select Next...




Assign a drive letter: avoid C and D and anything else that may already be in use.




Format partition: pick exFAT or NTFS. I prefer NTFS.

Format... and done! Your drive should be recognized by Windows now.





CMR vs SMR - two types of hard drive recording technologies

When purchasing, check that the product specifications mention the drive's read and write speeds, NOT just the theoretical speed of the USB connection. Although USB3 is capable of very fast speeds, you are limited by the hard drive's underlying technology. You're not going to actually get 5 GB/second transfers with any HDD, no matter what the marketing department tells you. 




For data storage, avoid HDD that use shingled magnetic recording (SMR) technology. SMR technology holds data in a cache and then write it, then fills up the cache again and then write it, repeatedly. It is painfully slow when transferring large >20 GB files. SMR drives have slow to read/write and get slower over time, often 15-40 MB/second. 

For faster read/write speed with large files, get HDD that use conventional magnetic recording (CMR) technology for better performance. These can get 2-4 times the speed of SMR technology-based drives, around 30-120 MB/second.

The smaller 2.5" HDD drives (5 TB max) tend to use slower SMR technology and often also have slower rotational speeds (5400 rpm). 

The larger 3.5" HDD drives with higher capacities (>5 TB) tend to use faster CMR technology and might have higher rotational speeds (7200 rpm). Check for advertised read/write speeds when selecting which drive to buy.

For HDD, USB 3.0 is fine. You won't get an extra speed boost with USB 3.1 or 3.2 because the underlying technology is the limiting factor, not the USB speed.

Cold storage (long-term in a drawer somewhere)

HDD are ideal for long-term data storage, especially if you need "cold storage", meaning leaving HDD drives in an air conditioned office drawer somewhere, unpowered ("cold"). They can last for years unpowered and are less likely to fail unpowered than SSD drives. However, even with HDD, it is good practice to check your data drives by connecting them to a computer (and power) once a year.

SSD drives are reliable for "hot" (powered) storage, meaning connected to a computer regularly. Because SSD drives are so much faster than any HDD drive, I recommend using SSD for the actual data analysis and other day-to-day work, then HDD for cold storage. 

Additional reading


---------------------------------------
Updated:
4/29/2025 to add Excel formulas and more screenshots.
6/4/2025 to add CMR vs SMR notes.
8/28/2025 to add cold storage section.

No comments:

Post a Comment

How to format final figures for publication

General figure guidelines File types and file sizes TIFF images with LZW compression to reduce the file size PDF files for vector images Not...