August 24, 2022

R programming lesson #1: load data, subset, and write to a new file

See also: "How to get started with R programming"

Purpose: 

Learn to clean an Excel spreadsheet with RNA-seq data, re-save as a .csv file, import into R, look at the data, subset to significant genes, and write the smaller spreadsheet to a new .csv file.


Start RStudio and set your default working directory:

Go to "Tools" and "Global Options":


Within "Global Options", click "Browse" to set your default working directory. Pick your folder for working with R. This is the folder where RStudio will look for files and save files by default. 



Click "OK" after setting your new default working directory.

Close RStudio and then open it again. Your working directory should be whatever you selected under Global Options. Check that it changed by typing this into the Console and hitting [Enter]:

getwd()

Capitalization matters for code so make sure you write lowercase getwd() and not Getwd() or getWD() or some variation.


Get a list of files and folders in your current directory:

dir() #outputs a list of directory contents

Note - you can write anything after the hashtag (#) and R will ignore it. This is how you can take notes and write explanations about your code, also called "adding comments" or "commenting your code". This is good practice and you should do it whenever you code. Commenting your code helps you remember why you did something and helps other people understand your code. 

These next three lines give you the same result:  

dir()

dir() #some comment or random text that RStudio ignores

dir("./") #this third one includes a file path shorthand that means "this folder"


You can also check the contents (files and folder) one level ABOVE your directory:

dir("../")

The file path shorthand "../" is telling RStudio to look one level ABOVE your current folder, kind of like the up arrow button on your navigation window, but without actually changing your location:



If you want to see what is two levels above, try this:

dir("../../")

Note that you might get an error if you are already at the top level or if there isn't anything one or two levels above you. 

For example, if your working directory is "C:/" then there isn't anything at "../" or "../../"

However, if your working directory is "C:/Dropbox/Tania/code/", then

"./" is "C:/Dropbox/Tania/code/", the same folder

"../" is ""C:/Dropbox/Tania/"

"../../" is "C:/Dropbox/"

"../../../" is "C:/"

"../../../../" is nowhere so you'll get an error message.


Make folders and change work directories

Make a folder named "data" inside your current directory and navigate to it:

dir.create("data") #creates a new directory (folder)


Check that it worked by checking your directory contents once again:

dir()


Go to that new folder and make it your new working directory with EITHER one of these lines, but NOT both:

setwd("data") #navigates to the new directory

setwd("./data") #another equivalent way to navigate to the new directory

Do NOT run both versions of the setwd function code above or else you are telling R to enter the "data" folder twice, sequentially, which won't work the second time unless you have a folder at "data/data". You will get the error cannot change working directory if you ask R to use a folder that doesn't exist.


Now exit the "data" folder by making your working directory one level up from your current location:

setwd("../")


Alternatively, you can also navigate to a working directory by using the complete directory address, assuming the path exists. Use your own folder location between the quotes, not my exact example below:

setwd("C:/Users/Tania/Dropbox/Rcode/")


Error troubleshooting example:

Error "cannot change working directory" means your path doesn't exist. Check capitalizations and spelling. You can look for the error by checking various parts of your path, for example:

dir.exists("C:/Users/Tania/") # does this path exist?

TRUE

dir.exists("C:/Users/Tania/Dropbox") # does this path exist?

TRUE

dir.exists("C:/Users/Tania/Dropbox/Rcode") # does this path exist?

FALSE


My next step here would be to check why I got FALSE at the last step. Maybe the last folder is actually named "rcode" or "R-code" or "R code"? Capitalization, hyphens, spaces all matter.


Download the following file to folder "data":

  • "Additional File 2: Table S1" Excel file from Gonzalez et al. "Sex differences in the late first trimester human placenta transcriptome", Biology of Sex Differences vol 9, article 4 (2018) 

Clean Additional File 2 for use with R programming:

Open Additional File 2 in Excel and scroll to the bottom.

Delete all the non-table rows at the bottom, starting from the row immediately after the table. You need to delete the empty rows between the table and the extra information as well, or R may give you errors when you try to import the data. 






You want a completely rectangular data table. Extra rows at the top are ok. Notice that the column names are on row #3. That's the table header. During import, we will skip the first two rows because of this.



"Save As" a new file formatted as a "comma-separated values" or comma delimited file (gonzalez2018-tableS1.csv). The csv file extension matters because it tells the computer the file type.

Don't save as an Excel .xlsx file.




Excel will warn you that the .csv format does not allow "some features" (color coding, filtering, multiple sheets, figures, formulas, etc) - yes, that is ok if you are saving as a new file. Click "yes".



Warning: if you are working with a multi-sheet or filtered spreadsheet, saving from an Excel file to csv will ONLY save the rows that meet your filter criteria in the current sheet. Other sheets will be lost. Color coding, graphs, and formulas will also be lost. Always "File: Save As" a new file with a different file name when saving to csv format to avoid data loss. Don't overwrite your original file.

The rest of the code assumes you are saving the spreadsheet to a subfolder called "data", and that your code is not also in that folder. If your code and the spreadsheet are in the same folder then remove "./data/" from the spreadsheet file name in the examples below.


Create a new R script and import data:

In RStudio, File: New File: R script

Save a new file outside your data folder. For example, "merge-script_2022-08.R"
(Update the year and month for the current date.)




Load the Additional File 2 from Gonzalez et al. 2018 with this code:

mydata = read.csv("./data/gonzalez2018-tableS1.csv", skip=2,
                  header=TRUE, stringsAsFactors=FALSE)

This is also ok:

mydata = read.csv(file="./data/gonzalez2018-tableS1.csv", skip=2,
                  header=TRUE, stringsAsFactors=FALSE)

Highlight the code and use keys [Ctrl] + [Enter] to run only the highlighted lines.

Results will show up on the Console window. The data finishes running when you see the ">" symbol as the last line on the console window, meaning R is ready for the next command.

Notes on what you just ran:
  • TRUE and FALSE need to be capitalized. R interprets these not as words, but as binary values (yes/no, 1 or 0). R also understands T and F, where T=TRUE and F=FALSE. We are using header=TRUE because our table has column names.
  • We are using skip=2 because the header begins on row #3 for this specific example. If the header began on row #1, you could leave out the skip variable and R would assume the default (skip=0).
  • Strings are pieces of text like "apple" or "2" or "hello world". The stringsAsFactors=FALSE tells the program to not convert the string values to categorical/factor values. R will instead decide for each column if the values are strings (text) or numerical (numbers). Values that are missing will become NA (not available) values. 
  • Variable order doesn't matter if you specifically name each variable, e.g. file="your file name"
  • Variable order DOES matter if you leave out the variable name, e.g. "your file name".

You can learn more about the read.csv function by typing this into the console:
?read.csv

A help window will show up explaining the function. Any function variable that is followed by an equal sign and a value has a default value. If there is no default value, then the variable is required and the function won't work without you giving it a value. For example, the variable file in read.csv() is required. The variable header is not required, but R automatically assumes that the data doesn't have column names. Because our table does have column names, we replace the default header value with header=TRUE during our data import.

Below, we see that stringsAsFactors=FALSE is the default value. That is a recent change for R version 4.0 and above, but previously stringsAsFactors=TRUE was the default value. It's good practice to keep setting the variable value you want since some computers still run R versions 3.2.



Troubleshooting "no such file or directory" error: 
If you try to read the spreadsheet with read.csv and get an error, first check your work directory and your file path. For example, the error below for write.csv is happening because I am already inside work directory "~/data" (see arrow), so when I try to read a file in "data/..." I am accidentally telling R to look inside a second folder named "data" inside the first "data" folder, or "~/data/data/".   


You can solve this error by either 
  1. backing out of the "data" folder using setwd("../"), or 
  2. removing "data/" or "./data/" from the file value, instead using file="gonzalez2018-tableS1.csv"

If that doesn't work, check your working directory location and contents: 
getwd()  # where are you?
dir()       # what files are there?

Also check any contents inside your data folder:
dir("data") 
This outputs a list of contents. Is the file there? Is the filename spelled correctly in your code, including capitalization and spaces and hyphens? Do you have the correct file extension, ".csv"?

If you are having issues running dir("data"), remember that capitalization, quotes, and spacing are important, so dir(" data") and dir(data) and dir("Data") are all different code. You also won't be able to run dir("data") if you don't see the data folder when you run dir() by itself.

You can also check if R thinks the file exists:
x = "gonzalez2018-tableS1.csv"
file.exists(x)

y = "data/gonzalez2018-tableS1.csv"
file.exists(y)

You should get TRUE if you have the correct file location and filename. If you get FALSE, then the error is with your directory location or filename, not with the read.csv or write.csv functions.

Once there are no more errors and made variable mydata by reading in data, continue...
Check number of rows and columns with the dimensions function:
dim(mydata)

When loading data, it is useful to check the number of rows and columns to make sure you loaded the full dataset. If you check Gonzalez et al. 2018, you will see we found 14,250 expressed genes in late first trimester placenta. This file has all of those expressed genes.

Check column names using either of these lines:
colnames(my data)
names(mydata)



In this file, the first 22 columns are overall values with either gene information (such as chromosome "Chr") or analysis information (such as false discovery rate "FDR"). Columns 23-61 have log2FPKM expression values for the 39 sequenced samples. Columns 62-100 have counts normalized for sequencing depth also called baseMeans in the DESeq2 analysis. Here, F=female and M=male.  


Check the first five values of particular columns like this:
head(mydata$Ensemble.Gene.ID)
head(mydata$Gene.Symbol)

Check the first 10 rows for columns 5 to 9 as follows:
mydata[1:10, 5:9]



Count how many genes meet the criteria FDR<0.05 using the table function. Add the option useNA="always" to make sure you also count any rows that do not have FDR values. R will evaluate FDR<0.05 as either TRUE or FALSE or not applicable (NA).
table(mydata$FDR<0.05, useNA="always")

Check the unique values for column "Biotype":
unique(mydata$Biotype)

Check the unique values for column "Biotype.category":
unique(mydata$Biotype.category)

Count how many genes are FDR<0.05 per Biotype.category:
table(mydata$Biotype.category, mydata$FDR<0.05, useNA="always")



Create a column for fold change. The dataset already has a log2FoldChange column, so you can undo the log as follows:
mydata$FC = 2^mydata$log2FoldChange


Create an inverse fold change column:
mydata$InverseFC = 1/mydata$FC


For Gonzalez et al. 2018, fold change refers to the expression ratio of females to males, so FC=2.5 means a gene is 2.5-fold upregulated in females, or 2.5 times more expressed in females compared to males. If a gene has FC=0.2, then it is 0.2-fold "upregulated" in females, except it isn't really upregulated in females because the expression is higher in males. The fold change is less than one.

Inverse fold change is the expression ratio of males to females, which is more useful to use for genes that are actually downregulated in females. A gene with FC=0.2 will have InverseFC=5, meaning it is 5-fold upregulated in males, compared to females.

Check how many of the 58 significant genes fall in one direction vs the other:
table(mydata$FC>1, mydata$FDR<0.05, useNA="always")

Notice that there are some NA (not available or empty) values. This is important information because it affects how we compare the data.

The ifelse function

Create a column for upregulation direction. For this particular RNA-seq dataset, fold change is coded as females/males so I will use that information to define the direction. For other datasets, check the manuscript or ask the data owners.

mydata$direction = ifelse(mydata$FC>1, "Females", "Males")

The ifelse() function works by evaluating the condition, in this case if FC>1, then returning the first value if TRUE or else the second value if FALSE. 

It is rare, but some genes might have FC=1 which means the expression is the same between females and males. In those cases, my code will say that FC>1 is FALSE and put "Males" in the direction value, which I don't want because those genes are not really upregulated in either direction. 

To fix this possible problem, evaluate if FC equals one (using the double equal sign to get back TRUE or FALSE) and re-make the direction column:
mydata$direction = ifelse(mydata$FC==1, "Same", mydata$direction)

If FC==1 is TRUE, then I overwrite the direction value with "Same", otherwise I keep whatever value was already there from the previous line of code.

Other examples with the ifelse() function:
x = 1
ifelse(x==1, "x is one", "x is not one")

The first line with x=1 single equal sign assigns the value to x.

The second line with two equal signs evaluates if x equals 1 and returns TRUE or FALSE. You can find out what R thinks by evaluating this in the console:

x == 7 # use double equal signs to check equality, or else you will overwrite the value
x  # what is the value of x?

More examples:
x = 2
ifelse(x>3, "x is greater than three", "x is less than or equal to three")

a = 1
b = 0
c = 50
x = a + b + c
ifelse(x>3, "x is greater than three", "x is not greater than three")


x = log10(50); x
ifelse(x>3, "x is greater than three", "x is not greater than three")

Here, I assign x the value of log10(50) and then print out the new value of x in the same line. R needs the semicolon to separate the two operations or else it will print an error. Instead of the semicolon, I could have used a new line, too.

x = NA
x
ifelse(x>3, "x is greater than three", "x is less than or equal to three")

R returns NA here because it cannot tell you if NA>3. This is good to know because the alternative might be that R tells you "x is less than or equal to three" since x>3 isn't true. We don't want either value in this case.

When making our "direction" column earlier, we would be misleading ourselves if genes with fold change NA had direction "Males" since that FC>1 isn't true but FC<1 and FC==1 also aren't true. It could cause problems with later data filtering and counting steps.

x = "potato" 
ifelse(x>3, "x is greater than three", "x is less than or equal to three")

Unexpectedly, R thinks that "potato" is greater than three. Double check this:
"potato" > 3

That's weird! Try to remember these weird assumptions as you learn to code. They can help you troubleshoot your code when it doesn't work as expected. In this case, this happens because R is giving the string value (the word) a numeric value.

Subset Gonzalez 2018 data to genes which are FDR<0.05

Subsetting data, or taking a section of a larger dataset, is a useful skill when you want to focus on only rows that meet certain criteria.

Get a sub-dataset from mydata using the subset() function and filtering to anything in column "FDR" less than 0.05.
signif = subset(mydata, FDR<0.05)

How many rows and columns do you have? You should have 58 rows. 
dim(signif)

That's not too many rows. It won't crash your computer to view everything in dataset signif, so take advantage and open the spreadsheet in RStudio like this:
View(signif) #note that View is capitalized

When the data is larger, sometimes I open only a few rows, for example view the first 100 rows of the original dataset mydata like this:
View(mydata[1:100, ]) #note that View is capitalized

Note that R always puts the rows first, then the columns. Here, I am selecting all rows from 1 to 100, then adding a comma and leaving a blank space. This tells R to show all columns.

You can also view a specific list of columns like this:
View(mydata[1:100, c("Symbol", "Chr", "direction", "FDR")])

I am still viewing rows 1 to 100, but now I am telling R to show me a list of columns which I specifically named. The function c() concatenates these values into a list so that R understands that I want all of these columns. You can see this list here:

You can also save the list to a variable and use the variable to select the columns:
mySelectedColumns = c("Symbol", "Chr", "direction", "FDR", "Biotype")
mySelectedColumns #show the list values
View(mydata[1:100, mySelectedColumns]) #open a spreadsheet



Going back to the dataset signif...

Count the genes in each direction:
table(signif$direction)


Count the genes in each direction and this time check if there are any NA values:
table(signif$direction, useNA="always")


Count the genes in each direction, per chromosome:
table(signif$direction, signif$Chr)

Count the genes in each direction, per biotype:
table(signif$direction, signif$Biotype)

Count the genes in each direction, per biotype, but flip the order so it is easier to read:
table(signif$Biotype, signif$direction)

If you get errors, check that your capitalization is correct. The columns need to be named exactly as in the dataset. You can check column names again using names(signif)


Get a list of the genes upregulated in females:
subset(signif, direction=="Females")$Symbol

What this is doing is subsetting dataset signif to rows with value "Females" in column "direction", then giving you the results of column "Symbol" which is the column with gene names.

If you get errors, first make sure that you are using two equal signs (which tells R to evaluate if the values are equal). If that's not the problem, make sure that the values in direction exist. Check all possible value in column "direction":
unique(signif$direction)

Similarly, get a list of the genes upregulated in males:
subset(signif, direction=="Males")$Symbol

In these two subsets above, R is making the subset but not saving it anywhere. If you want to save the subsets for later use, you can do so by putting them in new variables, for example:

signif.F = subset(signif, direction=="Females")

signif.M = subset(signif, direction=="Males")

Then you can get the values in column "Symbol" like this:

signif.F$Symbol

signif.M$Symbol




Save the subset of significant genes and submit as proof of completion

You started with a spreadsheet of 14,250 genes. You subset to 58 significant genes. Save this smaller spreadsheet to R like this:

write.csv(signif, file = "data/gonzalez2018-tableS1_significant.csv", row.names=TRUE)

I am telling R which dataset I want to save (signif) and what file path and filename I want (folder "data", filename "gonzalez2018-tableS1_significant.csv"). If you get an error, make sure that folder "data" exists. Check what exists in your current directory with dir()

Save the file with row.names=TRUE first, open the spreadsheet with Excel and look at the first column. In this specific example, I didn't assign row names to be anything useful The row numbers correspond to the original order in mydata but otherwise they are not very meaningful and they just add an extra column I don't want to keep.

Close the file and re-save with row.names=FALSE. Open the file. That extra column is now gone.

If you're doing this as an exercise for me, send me the signif subset csv file without row names to show me you completed this part.

-----------------------------------
Updated 3/10/2024

No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...