January 31, 2019

R programming: How to get started

R is a programming language popular for statistics and bioinformatics.

Table of Contents

Installation

  • Install R, the free behind-the-scenes code that allows your computer to run R scripts
  • Install RStudio, the user interface software
    • Get free RStudio Desktop. Install this after you install R (not before).
    • Ignore the paid versions, RStudio Desktop Pro and RStudio Server. You don't need those.
    • Default settings are ok.
  • Once you have both installed, you can open RStudio Desktop and it will automatically detect your version of R language.

Simple practice

  • Open RStudio and try replicating the console commands below. Note that integer lists are written as beginning:ending separated by a colon (e.g. 1:10 means 1,2,3,4,5,6,7,8,9,10).
  • To assign values to variables, R allows either an equal sign or an arrow:
    • x <- 1:10
    • x = 1:10
    • These two things both assign an integer list of 1 to 10 to the variable x.
  • Once you have assigned the values to x, you can now create a new variable y which uses the values of x.
    • y <- x^2
    • This creates variable y with a list of values 1,4,9,16,25,36,49,64,81,100.
  • With two number variables of equal length like our example x and y above, you can create a scatter plot.
    • plot(x,y)
  • The plot function also allows some customization such as selecting the point colors.
    • plot(x,y,col="red")
    • Note that "red" must be in quotes because it is a value, not a variable. You are assigning the value "red" to the function's variable col. This would also work:
    • mycolor = "blue"
    • plot(x,y,col=mycolor)
    • In this case you don't need the quotes since you are passing the value of variable mycolor to variable col. That value is "blue", which R understands as one its named colors.





  • You can learn more about the plot function (or any function) by looking at the R help files. 
    • Type "?" and the function name into the console window and hit Enter. For example: 
    • ?plot
    • This opens a Help tab with information about the function.


  • Read the ?plot help window and try formatting the plot in different ways.
    • col = color for points (can be named values like "red" or hex codes)
    • pch = point type (asterisks are pch=8)
  • Try different math functions.
  • Comments in R begin with a hashtag (#). Nothing after the "#" is run as a script.
  • Variable values can be set with an equal sign (=) or an arrow (<-) in R.


Lessons from Tania


R packages

R comes with a pre-loaded set of functions, practice datasets, and variables (e.g. pi = 3.141593). This set of tools is called base R.

You can also download additional tools by installing packages. Using packages requires two steps:
  1. Install the package. This is like installing software on your computer, but you do it through RStudio. You only need to do it once, unless there's an update and you want to install the new version.
  2. Load the package. Do this every time you open a new R session and need the package. This is like opening already-installed software. If you close RStudio, go to lunch, and come back, you will need to load the package again. If you try to run a package-specific function, you will get an error message if you forgot to load the package.
The two main sources of packages for scientists are:


Once you install and load a package, you have access to its functions. For example, once you load ggplot2, you may use its functions such as qplot.

x=1:10
y=x^2
qplot(x,y)

You may sometimes wish to clarify the source of the function, especially when two packages have different functions with the same name or when you are using multiple packages and want to remember which functions comes from which packages. With R, you can do this with two colons separating the package and function names.

x=1:10
y=x^2
ggplot2::qplot(x,y)


By the way, when you use qplot, you will get a warning message that it is "deprecated". This means that it is outdated and might not be supported in future updates.

Load spreadsheet data

Text-based spreadsheets import (recommended)

It is easier to load your spreadsheet data as a text file (.txt), comma-delimited file (.csv), or tab-delimited (.tsv) file. You only need base R and it loads faster than Excel files.

Two methods to load text-based files (.txt, .csv, .tsv):

  1. Manually load using RStudio: File >> Import Dataset 
  2. Load through the RStudio console using one of the lines of code below:
df = read.table("myfile.txt", stringsAsFactors=FALSE) 
df = read.csv("myfile.csv", stringsAsFactors=FALSE)
df = read.csv("myfile.tsv", stringsAsFactors=FALSE, sep="\t")
df = read.csv(file="myfile.csv", header=TRUE, skip=2, stringsAsFactors=F)


Use header=TRUE if your data begins with a row of column titles. Don't put binary values (TRUE/FALSE) in quotes. You will get an error if you try header="TRUE" because you are giving R a string when it expects a binary value.

Use skip if the header isn't on row #1. I like adding table titles and descriptions in the first two rows of my files, so I use skip=2 to tell R that my dataframe begins with a header on row 3. Don't put numbers in quotes or else R will interpret it as a character instead of a number.

Use sep to specify how values are separated (sep="\t" for tab-separated values). The default is comma-separated (sep=",") for read.csv so you don't need to include it if that's what you need also. You can view other defaults by typing ?read.csv into the RStudio console.
Use stringsAsFactors=FALSE almost always, especially if some columns have all unique values (e.g. Ensembl Gene IDs). The default for R is stringsAsFactors=TRUE.

R recognizes both F and FALSE, as well as both T and TRUE. However, it is good practice to spell out the whole word since R allows users to overwrite F and T and use them as variable names (which can cause problems later), but FALSE and TRUE cannot be overwritten.

Excel spreadsheets import

Beware: loading Excel spreadsheets takes much longer than simpler text-based formats. For small spreadsheets, the difference is negligible (seconds). For large datasets with thousands of columns, you may have issues (1 minute csv file vs 30+ minutes for the Excel file).

I occasionally use the read.xls function from the gdata R package to load Excel spreadsheets. It requires perl installed on your computer, but you don't need to know any perl to use it. 

#install.packages("gdata") ##commented out so it won't run
## to run the above line, remove the hashtag in front of it
library(gdata)  ## load the gdata package
df = gdata::read.xls("myspreadsheet.xlsx", header=TRUE,
                        stringsAsFactors=FALSE, skip=0,
                        sheet=1) 

The sheet part is required so you can specify which sheet (number) you want to use since Excel spreadsheets can have different sheets. Most other settings are optional.

The skip=0 is optional (skip=0 is the default). It lets the function know that your header is on row 1.

I extended the function name above by specifying gdata::read.xls() so that R knows which package to use. This is not necessary if there is only one version of read.xls() but important if you have two packages that use the same function name.

If you get a perl-related error, you may need to specify the path where perl is installed. Example:

        perl = "C:/Strawberry/perl/bin/perl5.26.1.exe"

      df = read.xls("myspreadsheet.xlsx", header=TRUE,
                     stringsAsFactors=FALSE, skip=0, sheet=1,
                     na.strings="NA",
                     method="csv", perl=perl)

The na.strings part is telling the function what values are equal to not available (NA) values. Otherwise, they may be imported as a string "NA" (literally the letters "N" and "A" rather than the no value meaning you want).

The method part specifies the method to use for data import. See alternatives by inputting ?read.xls at the console to read the function help page.

If you get an error "duplicate 'row.names' not allowed", fix it by removing columns with commas or changing your import method.

Dataframe manipulation

Quick functions to learn more about your dataframe

In R help pages, df is used as the example variable name for dataframe objects.

names(df)   #print a list of header names

head(df)    #print the first 6 rows of your dataframe

tail(df)    #print the last 6 rows of your dataframe

dim(df)     #print the number of rows and columns

View(df)    
#open a spreadsheet within RStudio (note capitalized View, not view)

View(df[1:10,1:50])
#open a spreadsheet within RStudio, but just view 10 columns and 50 rows  

summary(df)
#provides simple math values for columns, e.g. min, max, mean, standard deviation

table(df$columnName)
#counts number of rows that match each value in the column, good for groups, not gene names

table(df$columnName1, df$columnName2)
#counts the overlap of categories in these two columns

table(df$columnName1, df$columnName2, useNA="always")
#counts the overlap of categories in these two columns, including any NA values
Checking for NA values is useful to make sure you aren't losing data due to errors, e.g. due to a math function applied to a string value "500" resulting in NA, when actually you meant the number value 500.

str(df)          
#shows the data structure formats of each column, e.g. numeric, character, binary, factor

The str function is useful for troubleshooting. If you forget to add stringsAsFactors=FALSE to the spreadsheet import, factor columns may cause issues in your code. 

To convert columns to different data types, use these functions and manually check for any error warnings:
  • df$columnName = as.character(df$columnName) ##string values
  • df$columnName = as.numeric(df$columnName) ##number values
  • df$columnName = as.factor(df$columnName)  ##factor values
  • df$columnName = as.logical(df$columnName) ##TRUE/FALSE

Subsetting dataframes

Subset a dataframe df into a smaller dataframe dfsub like this:
  • dfsub = subset(df, pvalue<0.05)
  • dfsub = subset(df, biotype=="protein_coding")
  • dfsub = subset(df, tissue=="CVS" | tissue=="Placenta")  #example 3
  • dfsub = subset(df, tissue=="CVS" & pvalue<0.05) #example 4

This is assuming that dataframe df  has columns named pvalue, biotype, and tissue.

The last two examples use boolean operators for OR (|) and AND (&). Example 3 will get rows where column tissue has either value "CVS" or value "Placenta". Example 4 will get rows only if they have both tissue value "CVS" as well as pvalue<0.05, but not rows where either condition is false.

Another way to subset dataframes is using their row and column indexes:

  • dfsub = df[1:200, 1:10]
  • dfsub = df[, 1:10]
  • dfsub = df[c(1,4,8,20,3), c("pvalue", "tissue")

Example 1 gets the first 200 rows and the first 10 columns.

Example 2 gets all rows and the first 10 columns.

Example 3 gets only row numbers 1, 4, 8, 20, and 3 (with 3 placed at the end) and columns pvalue and tissue.


Merging dataframes with base R

Use the merge function with two dataframes (df1, df2) and a column used to match values.

  • dfmerged = merge(x=df1, y=df2, by="ensembl", all=TRUE) #example1
  • dfmerged = merge(x=df1, y=df2, by="ensembl", all=FALSE)
  • dfmerged = merge(x=df1, y=df2, by="ensembl", all.x=TRUE)
  • dfmerged = merge(x=df1, y=df2, by.x="ensembl, by.y="Ensembl_ID", all.x=TRUE)




Example 1 merges the datasets using values in column ensembl, and returns all rows even if they don't match.

Example 2 merges the datasets using values in column ensembl, and only returns matches.

Example 3 merges the datasets using values in column ensembl, and returns all rows in df1 even if they didn't have a match. This is useful if your df1 is a small spreadsheet of genes of interest and df2 is a large dataset (e.g. a download from Human Protein Atlas).

Example 4 is like example 3, except it specifies different column names to use for matching. Use this if your spreadsheets don't name their columns the same way.

Remember to check dimension values afterward to see how things change:

dim(dfmerged); dim(df1); dim(df2)

Also check if there are duplicate row values. If these values are not the same, then you have duplicates.

length(unique(dfmerged$ensembl))
length(dfmerged$ensembl)

View your data after merging. Check at least the first 20 rows:

View(dfmerged[, 1:20])  #the V is uppercase


Specify base::merge if you loaded packages that have their own merge function (the function is "masked"). Adding the base:: in front will tell R to use the default (base R) merge function. Likewise, you can add someOtherPackage::merge instead if you want to use a different merge function. 
dfmerged = base::merge(x=df1, y=df2, by="ensembl", all=TRUE)

Write data

Write dataframes to spreadsheets

write.csv(dfmerged, "my merged spreadsheet.csv")

write.csv(dfmerged, file="my merged spreadsheet.csv", row.names=F) 

write.table(dfmerged, file="my merged spreadsheet.tsv", sep="\t",  quote=F, fileEncoding="UTF-16LE")


    To write to .csv, you just need your dataframe and a filename. You can optionally add row.names=FALSE if your row names aren't meaningful (e.g. 1,2,3,4,5,...) and you don't want them saved as a column.

    To write .tsv files, same thing, but add sep="\t" to tell R to tab-separate values.

    Optional: quote=FALSE removes quotation marks from string values in the final spreadsheet.

    Optional: fileEncoding is something I've only needed once, to fix an error with a GWAS Linux tool not correctly reading a spreadsheet. I had to re-make the spreadsheet with R in Windows and specify the file encoding.


Write session info to a time-stamped file text file

timestamp = format(Sys.time(), "%Y%m%d-%H%M")  ## e.g. 20200809-0925 
f.session = paste0("sessionInfo_",timestamp,".txt")
writeLines(capture.output(sessionInfo()), con=f.session)


This code saves the output of sessionInfo() which tells you the version of R and the version of any loaded R package. It is useful information to keep for lab notebooks and publications.


Example output of sessionInfo()


Functions Introduction

Example plot function:

makeAPlot = function(requiredValue1, requiredValue2, Value3="blue", Value4=TRUE, whereline=5, mylinecol="yellow") {

   plot(requiredValue1, requiredValue2, col=Value3)
   abline(v=whereline, col=mylinecol)

   if(Value4==TRUE) {print("Hello!")}
   return(requiredValue1+5)
}


Example run:

x = 1:10 

y = x^2 

makeAPlot(x,y, mylinecol="red")


Output:



[1] "Hello!"

[1] 6  7  8  9 10 11 12 13 14 15


Notes:
  • When creating the function, indicate all your variables within function() parentheses and indicate also if they have a default value. For example, "blue" is the default for Value3.
  • Any variables without a default value are required and must be defined when calling the function.
  • Any variable with a default value aren't required. If you don't provide a value, the default will be used. In the example, I made my line red instead of yellow.
  • Recommended: use return() at the end of the function to indicate the value you want back. 
    • Otherwise you get back the last object, e.g. "Hello!"

functionForColumns = Vectorize(functionForSingleValue)
  • If you create a function functionForSingleValue but want to apply it to a column, then use the Vectorize() function to create a new function
  • This is useful for more complex functions where apply() can't be used


Free R courses and learning tools

Courses:

R plotting resources:


R references:


----------------------------------------------------
Last updated: June 27, 2023

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...