April 21, 2022

R code: Reading single spreadsheets and merging them into a multi-sheet Excel file

This R code example shows how to automatically compile multiple csv spreadsheets (or csv-formatted simple Excel spreadsheets) into one multi-sheet Excel spreadsheet. Personally, I use this to combine datasets from Ingenuity Pathways Analysis for supplemental files. I can do it manually but I always worry that I'll accidentally open the wrong analysis file or mis-name the tab. 

This way, I have a record of what files became what named sheets. I can also quickly re-compile a new Excel results file if I rerun the analysis by just changing the filenames in this code.

Packages

I prefer gdata for reading Excel sheets (better support for special characters) and openxlsx for writing the multi-sheet Excel file. 

Although gdata requires Perl installed on your computer, it does not require any knowledge of the Perl programming language. Use Strawberry Perl for Windows. Linux already has Perl.

The openxlsx package does not require any additional installations.
#install.packages("gdata")
#install.packages("openxlsx")

library(gdata)  ## for read.xls() function; requires perl
library(openxlsx)

Data input with R package gdata

Ignore “wide character in print at…” warnings. They are from the Perl code that read.xls() uses and I can’t turn them off. They didn't cause a problem.
## function to load data and edit columns -----------------
get_data = function(myfile, mysheet=1) {
    ipa = gdata::read.xls(
            xls = myfile, 
            header = TRUE,
            stringsAsFactors = FALSE, 
            fileEncoding = "UTF-8",
            skip = 0,
            method = "csv",
            na.strings = c("NA", "", "-", " "),
            sheet = mysheet)
  
  print(dim(ipa))
## [1] 612   5
  ## edit column names
  colnames(ipa) = c("Pathway", "-Log10P", "Ratio", "Zscore", "Molecules")

  
  ## make Pathway an ordered factor so ggplot doesn't reorder it 
  ## don't really need it here but it's helpful if you are making plots too
  ipa$Pathway = factor(ipa$Pathway, levels=ipa$Pathway)
  
  ## data preview - make sure you're using the right data!
  cat("\n\nPreview:\n\n")
  print(ipa[1:9,1:4])
  
  return(ipa)
}

## load data -----------------
top15cvs = get_data(
  myfile = "2021-11-18run/protein_top15-expr_First/canonical-FirstTop15p.xls")
## Preview:
## 
##                                   Pathway -Log10P Ratio Zscore
## 1                          EIF2 Signaling    73.1 0.558     NA
## 2               Mitochondrial Dysfunction    67.3 0.614     NA
## 3               Oxidative Phosphorylation    62.6 0.739     NA
## 4               Sirtuin Signaling Pathway    38.5 0.360     NA
## 5 Regulation of eIF4 and p70S6K Signaling    33.7 0.425     NA
## 6          Huntington's Disease Signaling    31.6 0.335     NA
## 7                          mTOR Signaling    29.5 0.368     NA
## 8          Protein Ubiquitination Pathway    28.0 0.320     NA
## 9             Estrogen Receptor Signaling    26.0 0.264     NA
top15pl = get_data(
  myfile = "2021-11-18run/protein_top15-expr_Third/canonical-ThirdTop15.xls")
## [1] 579   5
## 
## 
## Preview:
## 
##                                   Pathway -Log10P Ratio Zscore
## 1                          EIF2 Signaling    72.6 0.554     NA
## 2               Mitochondrial Dysfunction    46.9 0.509     NA
## 3               Oxidative Phosphorylation    41.5 0.595     NA
## 4                          mTOR Signaling    32.4 0.382     NA
## 5               Sirtuin Signaling Pathway    29.8 0.318     NA
## 6 Regulation of eIF4 and p70S6K Signaling    28.8 0.391     NA
## 7          Huntington's Disease Signaling    20.9 0.278     NA
## 8                    Phagosome Maturation    19.8 0.346     NA
## 9             Estrogen Receptor Signaling    19.5 0.234     NA
Et cetera. I loaded several files this way.


Alternative: Data input with R package openxlsx

Here is my get_data function re-written to use openxlsx::read.xlsxl

get_data = function(myfile, mysheet=1) {
    mytable = openxlsx::read.xlsx(
            xlsxFile = myfile,
            colNames = TRUE,
            rowNames = FALSE,
            startRow = 1,
            na.strings = c("NA", "", "-", " "),
            sheet = mysheet)
  return(mytable)
}

Create multi-sheet Excel output with R package openxlsx


## sheet names and dataframes to use for the final file -------------
dataset_names = list(
  
  'A-top15-First' = top15cvs,
  'B-top15-Third' = top15pl,
  'C-DEGs-First-Upreg' = D1up,
  'D-DEGs-Third-Upreg' = D3up, 
  'E-SEGs' = SEGs
)


## make Excel filename with date stamp --------------
fn = paste0("SI_File_IPA_Enriched_Canonical_Pathways_", Sys.Date(), ".xlsx"); fn
## [1] "SI_File_IPA_Enriched_Canonical_Pathways_2022-01-21.xlsx"

## make Excel file -------------------------------
  openxlsx::write.xlsx(
  dataset_names, file = fn
)
This step produces the actual .xlsx multi-sheet file. Open it to check that the sheet orders are correct. Spot check some of the data (doublecheck some columns and rows manually to make sure it's the data you expect) and then you're done!

Session

sessionInfo()
This final step outputs your operating system, R version, loaded packages and the package versions. It is a good record keeping step to add to your RMarkdown files.

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...