November 22, 2022

Advice: Raw data storage for bioinformatics (terabytes of data)

Good habits and advice

  • Download sequencing data from the sequencing core or company within 30 days of receiving it. They won't store it forever! 
  • Keep multiple copies of your important data.
  • Don't open your original results spreadsheet files with Excel - make copies specifically to open with Excel to avoid autosaving corrupted data. Beware the Excel calendar issue with gene names.
  • Always eject or unmount data drives to prevent data corruption. Do this before you unplug them physically.
  • Check MD5 checksums. This helps you detect data corruption during downloads, storage, uploads, etc. If you find a mismatch, you can re-download or re-upload to correct it.

Drive formatting


  • FAT32
    • Oldest file format
    • Common file format for USB flash drives
    • Individual files cannot be over 4GB which is a problem for some raw data files from sequencing
    • Partitions must be less than 8 TB, so don't use this format for larger capacity drives
    • Don't use for bioinformatics raw data backups
  • NTFS
    • Modern file system that Windows prefers
    • Read/write on Windows XP to present version
    • Read-only on Macs OS X and some Linux distributions (limitation!)
      • Works fine on Ubuntu 20.04 LTS (per my own experience)
    • Large file size and partition size limits, so good for bioinformatics data storage
    • Supports change journals (good if computer crashes)
    • Does not unmount with computer hibernation which can be a data loss problem
    • Ubuntu isn't good at repairing damaged partitions
  • exFAT
    • Introduced in 2006 and still under patent
    • Optimized for flash drives, similar to FAT32 but with less limitations
    • Read/write capabilities on Windows XP to present version, Mac OS X
    • Linux may require additional software to read/write exFAT drives (due to patents, support isn't included with the default install files)
    • Large file size and partition size limits, so good for bioinformatics data storage
    • No journaling so more prone to data corruption than NTFS

Generally: use NTFS or exFAT. See if your collaborators have a preference, otherwise either one is ok.

Initializing hard drives or SSD drives

If you buy a new HDD or SSD and your computer does not recognize it when plugged in, you may need to "initialize it". Instructions for Windows 10 | Microsoft official instructions

***Note these instructions are for new and empty drives. Initializing and formatting deletes data.***

Search for and open "Disk Management" ("Create and format hard disk partitions")
If your disk is plugged into USB, you'll get a window saying "You must initialize a disk before Logical Disk Manager can access it."

Choose a partition style: pick GPT (GUID Partition Table) since it supports volumes larger than 2TB, so it's likely what you want for bioinformatics raw data storage. 



Disk 1 will change from "Not Initialized" to "Initializing" to "Online".

Next, right-click the newly initialized disk and select "New Simple Volume"



Select the simple volume size: select the maximum simple volume size in the next window (default is usually the maximum).

Assign a drive letter: avoid C and D and anything else that may already be in use.

Format partition: pick exFAT or NTFS. I prefer NTFS.

Format... and done! Your drive should be recognized by Windows now.


Additional reading


---
Last updated Sept 12, 2024

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...