Good habits and advice
- Download sequencing data from the sequencing core or company within 30 days of receiving it. They won't store it forever!
- Keep multiple copies of your important data.
- Don't open your original results spreadsheet files with Excel - make copies specifically to open with Excel to avoid autosaving corrupted data. Beware the Excel calendar issue with gene names.
- Always eject or unmount data drives to prevent data corruption. Do this before you unplug them physically.
- Check MD5 checksums. This helps you detect data corruption during downloads, storage, uploads, etc. If you find a mismatch, you can re-download or re-upload to correct it.
Drive formatting
- FAT32
- Oldest file format
- Common file format for USB flash drives
- Individual files cannot be over 4GB which is a problem for some raw data files from sequencing
- Partitions must be less than 8 TB, so don't use this format for larger capacity drives
- Don't use for bioinformatics raw data backups
- NTFS
- Modern file system that Windows prefers
- Read/write on Windows XP to present version
- Read-only on Macs OS X and some Linux distributions (limitation!)
- Works fine on Ubuntu 20.04 LTS (per my own experience)
- Large file size and partition size limits, so good for bioinformatics data storage
- Supports change journals (good if computer crashes)
- Does not unmount with computer hibernation which can be a data loss problem
- Ubuntu isn't good at repairing damaged partitions
- exFAT
- Introduced in 2006 and still under patent
- Optimized for flash drives, similar to FAT32 but with less limitations
- Read/write capabilities on Windows XP to present version, Mac OS X
- Linux may require additional software to read/write exFAT drives (due to patents, support isn't included with the default install files)
- Large file size and partition size limits, so good for bioinformatics data storage
- No journaling so more prone to data corruption than NTFS
Generally: use NTFS or exFAT. See if your collaborators have a preference, otherwise either one is ok.
Initializing hard drives or SSD drives
If you buy a new HDD or SSD and your computer does not recognize it when plugged in, you may need to "initialize it". Instructions for Windows 10 | Microsoft official instructions
***Note these instructions are for new and empty drives. Initializing and formatting deletes data.***
Search for and open "Disk Management" ("Create and format hard disk partitions")
If your disk is plugged into USB, you'll get a window saying "You must initialize a disk before Logical Disk Manager can access it."
Choose a partition style: pick GPT (GUID Partition Table) since it supports volumes larger than 2TB, so it's likely what you want for bioinformatics raw data storage.
Disk 1 will change from "Not Initialized" to "Initializing" to "Online".
Next, right-click the newly initialized disk and select "New Simple Volume"
Select the simple volume size: select the maximum simple volume size in the next window (default is usually the maximum).
Assign a drive letter: avoid C and D and anything else that may already be in use.
Format partition: pick exFAT or NTFS. I prefer NTFS.
Format... and done! Your drive should be recognized by Windows now.
Additional reading
Genome Spot: 10 quick tips for genomics data management (2021)
Reddit thread: "How/where do you store your data?" (2020)
---
Last updated Sept 12, 2024
No comments:
Post a Comment