Good habits and advice
- Download sequencing data from the sequencing core or company within 30 days of receiving it. They won't store it forever!
- Keep multiple copies of your important data.
- Don't open your original results spreadsheet files with Excel - make copies specifically to open with Excel to avoid autosaving corrupted data. Beware the Excel calendar issue with gene names.
- Always eject or unmount data drives to prevent data corruption. Do this before you unplug them physically.
- Check MD5 checksums. This helps you detect data corruption during downloads, storage, uploads, etc. If you find a mismatch, you can re-download or re-upload to correct it.
Drive formatting
- FAT32
- Oldest file format
- Common file format for USB flash drives
- Individual files cannot be over 4GB which is a problem for some raw data files from sequencing
- Partitions must be less than 8 TB, so don't use this format for larger capacity drives
- Don't use for bioinformatics raw data backups
- NTFS
- Modern file system that Windows prefers
- Read/write on Windows XP to present version
- Read-only on Macs OS X and some Linux distributions (limitation!)
- Works fine on Ubuntu 20.04 LTS (per my own experience)
- Large file size and partition size limits, so good for bioinformatics data storage
- Supports change journals (good if computer crashes)
- Does not unmount with computer hibernation which can be a data loss problem
- Ubuntu isn't good at repairing damaged partitions
- exFAT
- Introduced in 2006 and still under patent
- Optimized for flash drives, similar to FAT32 but with less limitations
- Read/write capabilities on Windows XP to present version, Mac OS X
- Linux may require additional software to read/write exFAT drives (due to patents, support isn't included with the default install files)
- Large file size and partition size limits, so good for bioinformatics data storage
- No journaling so more prone to data corruption than NTFS
Generally: use NTFS or exFAT. See if your collaborators have a preference, otherwise either one is ok.
Checksums
Whenever you download or upload large data between servers and drives, check data integrity using checksums. Read: How to get MD5 checksums to detect data corruption.
If downloading, the sequencing facility or data repository should have their own checksum. You make a separate checksum file and compare to their checksums.
Split the MD5 checksum text file into two columns, checksum and filename, using Excel's "Convert Text to Columns" feature. Assuming your checksums and the originating server's checksums are in the same order, you can quickly check equality with Excel, for example if comparing checksums in cell A2 and cell D2:
=A2=D2
Excel will evaluate this as TRUE or FALSE. You can everything to be TRUE.
Initializing hard drives or SSD drives
If you buy a new HDD or SSD and your computer does not recognize it when plugged in, you may need to "initialize it". Instructions for Windows 10 | Microsoft official instructions
***Note these instructions are for new and empty drives. Initializing and formatting deletes data.***
Search for and open "Disk Management" ("Create and format hard disk partitions")
If your disk is plugged into USB, you'll get a window saying "You must initialize a disk before Logical Disk Manager can access it."
Choose a partition style: pick GPT (GUID Partition Table) since it supports volumes larger than 2TB, so it's likely what you want for bioinformatics raw data storage.
Disk 1 will change from "Not Initialized" to "Initializing" to "Online".
Next, right-click the newly initialized disk and select "New Simple Volume"
Select the maximum simple volume size in the next window (default is usually the maximum). Select Next...
Assign a drive letter: avoid C and D and anything else that may already be in use.
Format partition: pick exFAT or NTFS. I prefer NTFS.
Format... and done! Your drive should be recognized by Windows now.
CMR vs SMR - two types of hard drive recording technologies
"WD 5TB Passport - slow writes? (5-10 MB/s)" (Reddit r/DataHoarder post)
Best Hard Drives 2025: Our top HDD picks for desktop PCs, NAS, and more (tomshardware.com)
Find if your drive is SMR by HDD model number (NAScompares.com)
tl;dr: For drives that you're going to use for regular read/write work (e.g. if you're doing analysis off these drives and regularly accessing files), get conventional magnetic recording (CMR) drives for better performance. The shingled magnetic recording (SMR) drives are slower to read/write and get slower over time. If you're only using the drives for cold storage (backup and forget) and you are more interested in saving money, then either is ok long-term.
Additional reading
Genome Spot: 10 quick tips for genomics data management (2021)
Reddit thread: "How/where do you store your data?" (2020)
---------------------------------------
Updated 4/29/2025 to add Excel formulas and more screenshots.
Updated 6/4/2025 to add CMR vs SMR notes.
Updated 6/4/2025 to add CMR vs SMR notes.
No comments:
Post a Comment