Good habits and advice
- Download sequencing data from the sequencing core or company within 30 days of receiving it. They won't store it forever!
- Keep multiple copies of your important data.
- Don't open your original results spreadsheet files with Excel - make copies specifically to open with Excel to avoid autosaving corrupted or filtered data. Beware the Excel calendar issue with gene names.
- Always eject or unmount data drives to prevent data corruption. Do this before you unplug them physically.
- Check MD5 checksums. This helps you detect data corruption during downloads, storage, uploads, etc. If you find a mismatch, you can re-download or re-upload to correct it.
Drive formatting
- FAT32 (not recommended for bioinformatics)
- Oldest file format
- Common file format for USB flash drives (max 32 GB drive size)
- Now you can have partitions up to 2 TB, previously only up to 32 GB. Don't use this format for larger capacity drives or you will "lose" space.
- Individual files cannot be over 4GB which is a problem for bioinformatics
- Don't use for bioinformatics raw data backups
- Do use this format for USB flash drives to use with your smart TVs (e.g. to view videos, photos)
- NTFS
- Modern file system that Windows prefers
- Great for external drives that will be used with both Windows and Linux.
- Read/write on Windows XP to present version
- Read/write on Ubuntu 20 and later at least (per my own experience)
- Read-only on Macs OS X by default
- Large file size and partition size limits, so good for bioinformatics data storage.
- Supports change journals (good if computer crashes, helps prevent data corruption). This makes NTFS better than exFAT if you don't need MacOS support.
- Does not unmount from Linux with computer hibernation which can be a data loss problem. Linux isn't good at repairing damaged partitions.
- Fully shut down the Windows computer before disconnecting the drive, otherwise you may have issues mounting the drive on Linux.
- Beware that NTFS is more prone to file fragmentation, so occasionally run disk defragmentation on any HDD that is NTFS formatted.
- Beware that NTFS drives will lock to read-only if the computer goes to sleep (to prevent data corruption) and you might have issues writing files until you reboot the computer. Personally, I notice this problem occasionally in Windows but it seems to only affect my ability to make or edit Word, Excel, and Powerpoint files. I can still create files with R or Python, or save Microsoft Office files to my desktop and then copy/paste the file to the partially locked external drive, which is strange. Rebooting always fixes the issue and returns full read/write access.
- exFAT
- Introduced in 2006 and still under patent
- Optimized for flash drives, similar to FAT32 but with less limitations
- It has great cross-platform compatibility. Best for external drives that will be passed around to Windows, Linux, and MacOS users.
- Read/write capabilities on Windows XP to present version, Mac OS X
- Linux may require additional software to read/write exFAT drives (due to patents)
- Large file size and partition size limits, so acceptable for bioinformatics data storage.
- Con: No journaling so more prone to data corruption than NTFS. If you unplug the drive while it is writing a file, the whole drive may be corrupted, not just that individual file. For this reason, I don't recommend it for storing bioinformatics data if NTFS will work.
- Con: It does not store file permission information. You can't make any files read-only. All users have read/write access, so don't use for internal drives on computers with multiple users.
- ext4
- Linux-only file format, optimized for fast read/write performance (less file fragmentation, faster speeds than NTFS or exFAT)
- Best for Linux internal drives, especially any that will be used for large file data analysis. Anecdotally, some Linux applications only work with ext4 and not NTFS.
- Con: ext4-formatted drives won't be readable with Windows or MacOS, so don't use ext4 for external data drives that you need to use with other operating systems.
Generally: use NTFS for external drives used with both Windows and Linux, or for internal drives for Windows; use exFAT for external data drives if you have collaborators that use MacOS also; use ext4 for internal drives for Linux.
Checksums
Whenever you download or upload large data between servers and drives, check data integrity using checksums. Read: How to get MD5 checksums to detect data corruption.
If downloading, the sequencing facility or data repository should have their own checksum. You make a separate checksum file and compare to their checksums.
Split the MD5 checksum text file into two columns, checksum and filename, using Excel's "Convert Text to Columns" feature. Assuming your checksums and the originating server's checksums are in the same order, you can quickly check equality with Excel, for example if comparing checksums in cell A2 and cell D2:
=A2=D2
Excel will evaluate this as TRUE or FALSE. You want everything to be TRUE.
Initializing hard drives or SSD drives
If you buy a new HDD or SSD and your computer does not recognize it when plugged in, you may need to "initialize it". Instructions for Windows 10 | Microsoft official instructions
***Note these instructions are for new and empty drives. Initializing and formatting deletes data.***
Search for and open "Disk Management" ("Create and format hard disk partitions")
If your disk is plugged into USB, you'll get a window saying "You must initialize a disk before Logical Disk Manager can access it."
Choose a partition style: pick GPT (GUID Partition Table) since it supports volumes larger than 2TB, so it's likely what you want for bioinformatics raw data storage.
Disk 1 will change from "Not Initialized" to "Initializing" to "Online".
Next, right-click the newly initialized disk and select "New Simple Volume"
Select the maximum simple volume size in the next window (default is usually the maximum). Select Next...
Assign a drive letter: avoid C and D and anything else that may already be in use.
Format partition: pick exFAT or NTFS. I prefer NTFS.
Format... and done! Your drive should be recognized by Windows now.
CMR vs SMR - two types of hard drive recording technologies
When purchasing, check that the product specifications mention the drive's read and write speeds, NOT just the theoretical speed of the USB connection. Although USB3 is capable of very fast speeds, you are limited by the hard drive's underlying technology. You're not going to actually get 5 GB/second transfers with any HDD, no matter what the marketing department tells you.
"WD 5TB Passport - slow writes? (5-10 MB/s)" (Reddit r/DataHoarder post)
Best Hard Drives 2025: Our top HDD picks for desktop PCs, NAS, and more (tomshardware.com)
Find if your drive is SMR by HDD model number (NAScompares.com)
For data storage, avoid HDD that use shingled magnetic recording (SMR) technology. SMR technology holds data in a cache and then write it, then fills up the cache again and then write it, repeatedly. It is painfully slow when transferring large >20 GB files. SMR drives have slow to read/write and get slower over time, often 15-40 MB/second.
For faster read/write speed with large files, get HDD that use conventional magnetic recording (CMR) technology for better performance. These can get 2-4 times the speed of SMR technology-based drives, around 30-120 MB/second.
The smaller 2.5" HDD drives (5 TB max) tend to use slower SMR technology and often also have slower rotational speeds (5400 rpm).
The larger 3.5" HDD drives with higher capacities (>5 TB) tend to use faster CMR technology and might have higher rotational speeds (7200 rpm). Check for advertised read/write speeds when selecting which drive to buy.
For HDD, USB 3.0 is fine. You won't get an extra speed boost with USB 3.1 or 3.2 because the underlying technology is the limiting factor, not the USB speed.
Cold storage (long-term in a drawer somewhere)
HDD are ideal for long-term data storage, especially if you need "cold storage", meaning leaving HDD drives in an air conditioned office drawer somewhere, unpowered ("cold"). They can last for years unpowered and are less likely to fail unpowered than SSD drives. However, even with HDD, it is good practice to check your data drives by connecting them to a computer (and power) once a year.
SSD drives are reliable for "hot" (powered) storage, meaning connected to a computer regularly. Because SSD drives are so much faster than any HDD drive, I recommend using SSD for the actual data analysis and other day-to-day work, then HDD for cold storage.
Additional reading
Genome Spot: 10 quick tips for genomics data management (2021)
r/Bioinformatics (Reddit community)
r/DataHoarder (Reddit community)
Just tell me what to buy for long-term storage
For single cell or long read data (which has many large individual files):
- 3.5" HDD with 6 TB or more.
- 7200 rpm preferred: faster read/write speeds especially important for transferring large individual files.
- USB 3.0 is sufficient. The limiting factor for speed is the HDD drive itself, so USB 3.1 and 3.2 won't actually give you higher speeds. Get the better USB if available but don't pay more for it.
- CMR drive (faster), not SMR drive (sluggish) for the underlying HDD technology -- look up the model number online and ask "CMR or SMR?"
- WD_Black and other "gaming" drives tend to be fast and have great read/write speeds. The faster read/write speed is especially important if you have many large individual files (e.g. single cell data or long read sequencing).
- For example:
- WD Black 8TB D10 Game Drive Desktop External Hard Drive for PS4/Xbox One/PC/Mac USB 3.2 (WDBA3P0080HBK-NESN, 3.5" external drive)
- WD Black 12TB D10 Game Drive Desktop External Hard Drive for Xbox USB 3.2 (WDBA5E0120HBK-NESN, 3.5" external drive)
- WD Black 8TB Gaming Internal Hard Drive (WD8002FZBX, 3.5" internal drive)
For bulk sequencing data (fewer large individual files), you can save money by loosening the requirements:
- 2.5" or 3.5" HDD with 4 TB or more, maybe SSD if you can power it on once a year.
- 5400 rpm (cheaper, slower) is sufficient for bulk RNA-seq if you don't plan to access the data much. If the rpm isn't advertised, assume it's 5400 rpm.
- CMR vs SMR technology is less important for bulk sequencing because the raw data requires less space, so slower speeds are less of a headache. CMR is still better.
- For example:
---------------------------------------
Updated:
4/29/2025 to add Excel formulas and more screenshots.
6/4/2025 to add CMR vs SMR notes.
8/28/2025 to add cold storage section.
4/29/2025 to add Excel formulas and more screenshots.
6/4/2025 to add CMR vs SMR notes.
8/28/2025 to add cold storage section.
1/12/2026 to add suggested products






No comments:
Post a Comment