June 23, 2021

Tools for non-coders using large spreadsheets with gene data

Excel

Comma separated value (csv) spreadsheets vs Excel files 
  • "Save As" new Excel file (.xls, .xlsx) whenever you receive data in a .csv format. 
  • The .csv format is a simple text file. You can open it with Notepad or Excel. It's great for programming, but not for working with Excel's many tools.
  • The .csv format does not support color coding, bold/italics/font formatting, multiple sheets, plots, data filtering, merged cells, or many other Excel features. You can lose data if you filter your csv file and save it (only the visible rows will remain) or if you make new sheets and save it (only the sheet you're using will save). 
  • Avoid merging cells. It makes it hard for your bioinformatics collaborators (like me) to use your spreadsheets.
  • Avoid adding information only as color-coding or formatting. That should only be used to highlight things visually, but the information should be available as text in a column for long-term storage.
  • Make column header names unique and short whenever possible. It makes it easier to load spreadsheets into R.
  • Beware when opening .csv spreadsheets in Excel if they have gene name columns. Excel interprets some genes as dates instead of text. Sort your gene column by A-to-Z and then see my tutorial to fix this issue on your spreadsheets.

Sorting and filtering a spreadsheet

  • Highlight the table's entire header row
  • Select Data: Sort & Filter : Sort A to Z
  • Sort the table by column by using the little triangles that appear on each column in the sort area.
  • Be very careful when adding or moving columns to the front or the end of the table! The new columns may not be in the "block" of sorted data, so you can scramble the data when you sort it and the new columns don't sort with it. Highlight the whole table again including the new/moved columns and re-select the sort area.
  • Always save a copy of data before sorting in case data scrambling happens. If using other people's data, save your own copy to sort. Don't edit other people's original data.
  • Beware, don't paste values onto a filtered spreadsheet (they will overwrite values between the visible rows also)

Creating a Pivot table

  • Great for making summary tables from spreadsheets with individual sample rows
    • "How many IVF samples are on plates 1, 2, and 3?"
    • "How many female NIFT samples are on each plate? How many males? What's the breakdown for IVF?"
    • You can answer these questions by repeated filtering, but it's easier with a Pivot table.
  • Study IDs should be set to "summarize value field by" count, not sum.


Creating a histogram


Adding columns to data subsets from larger spreadsheets

  • Scenario: you have a list of 50 gene names and want to add data from a larger spreadsheet of 68,000 genes. You CAN copy/paste all the data, but it's a lot of work.
  • Try using INDEX and MATCH functions from Excel
  • If you expect mismatches, try adding this as well
  • =IFERROR(yourindexmatchformulashere, "message if no match found")


Find values from a small list in a larger list

Example: "Are the genes in this list of 40 genes found in this spreadsheet of 68,000 genes?"
  • Use the COUNTIF function to find matches
  • It might look something like this:
    =COUNTIF('[full_RNAseq_spreadsheet.xlsx]Sheet1'!$B$2:$B$68001, "*" & E2 & "*")>0
    • Where E2 contains the gene you're looking for, but it might be in a list like this:
    • gene1;gene2;gene3;gene4 
    • Beware this will provide false positives if you are looking for "mygene" and the list contains "mygene-like4" or "mygenereceptor2" or something like that.

Beware the Excel calendar gene issue

Be aware whenever you open a csv spreadsheet file: Excel changes some gene names to calendar dates, e.g. MARCH1, SEPT9.

This is why you should never delete the Ensembl gene ID column (or other gene IDs) from your spreadsheets. Don't rely on only the gene names. If you notice the calendar gene problem in your data, you can use the Ensembl ID and bioDBnet to fix the problem manually. If you set the cell format as "Text" and then fix the affected gene symbols, they stay fixed in .xlsx or .xls spreadsheets.

This is a known problem that plagues genetics and molecular biology, see:

Data annotations with Ensembl Gene IDs

BioMart.org web interface - allows you to add annotations for dozens to hundreds of genes. For thousands, it's better to use the R package biomaRt (requires programming knowledge).

bioDBnet: biological DataBase network - allows you to input gene symbols or Ensembl IDs and output various additional information, e.g. Ensembl biotype, gene synonyms, chromosome, protein information from UniProt, etc. Human taxon ID = 9606. Not all databases are for human data. 

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...