See also: "How to get started with R programming"
Purpose:
Learn to clean an Excel spreadsheet with RNA-seq data, re-save as a .csv file, import into R, look at the data, subset to significant genes, and write the smaller spreadsheet to a new .csv file.
Start RStudio and set your default working directory:
Within "Global Options", click "Browse" to set your default working directory. Pick your folder for working with R. This is the folder where RStudio will look for files and save files by default.
Click "OK" after setting your new default working directory.
Close RStudio and then open it again. Your working directory should be whatever you selected under Global Options. Check that it changed by typing this into the Console and hitting [Enter]:
getwd()
Capitalization matters for code so make sure you write lowercase getwd() and not Getwd() or getWD() or some variation.
Get a list of files and folders in your current directory:
dir() #outputs a list of directory contents
Note - you can write anything after the hashtag (#) and R will ignore it. This is how you can take notes and write explanations about your code, also called "adding comments" or "commenting your code". This is good practice and you should do it whenever you code. Commenting your code helps you remember why you did something and helps other people understand your code.
These next three lines give you the same result:
dir()
dir() #some comment or random text that RStudio ignores
dir("./") #this third one includes a file path shorthand that means "this folder"
You can also check the contents (files and folder) one level ABOVE your directory:
dir("../")
The file path shorthand "../" is telling RStudio to look one level ABOVE your current folder, kind of like the up arrow button on your navigation window, but without actually changing your location:
If you want to see what is two levels above, try this:
dir("../../")
Note that you might get an error if you are already at the top level or if there isn't anything one or two levels above you.
For example, if your working directory is "C:/" then there isn't anything at "../" or "../../"
However, if your working directory is "C:/Dropbox/Tania/code/", then
"./" is "C:/Dropbox/Tania/code/", the same folder
"../" is ""C:/Dropbox/Tania/"
"../../" is "C:/Dropbox/"
"../../../" is "C:/"
"../../../../" is nowhere so you'll get an error message.
Make folders and change work directories
Make a folder named "data" inside your current directory and navigate to it:
dir.create("data") #creates a new directory (folder)
Check that it worked by checking your directory contents once again:
dir()
Go to that new folder and make it your new working directory with EITHER one of these lines, but NOT both:
setwd("data") #navigates to the new directory
setwd("./data") #another equivalent way to navigate to the new directory
Do NOT run both versions of the setwd function code above or else you are telling R to enter the "data" folder twice, sequentially, which won't work the second time unless you have a folder at "data/data". You will get the error cannot change working directory if you ask R to use a folder that doesn't exist.
Now exit the "data" folder by making your working directory one level up from your current location:
setwd("../")
Alternatively, you can also navigate to a working directory by using the complete directory address, assuming the path exists. Use your own folder location between the quotes, not my exact example below:
setwd("C:/Users/Tania/Dropbox/Rcode/")
Error troubleshooting example:
Error "cannot change working directory" means your path doesn't exist. Check capitalizations and spelling. You can look for the error by checking various parts of your path, for example:
dir.exists("C:/Users/Tania/") # does this path exist?
TRUE
dir.exists("C:/Users/Tania/Dropbox") # does this path exist?
TRUEdir.exists("C:/Users/Tania/Dropbox/Rcode") # does this path exist?
FALSE
Download the following file to folder "data":
- "Additional File 2: Table S1" Excel file from Gonzalez et al. "Sex differences in the late first trimester human placenta transcriptome", Biology of Sex Differences vol 9, article 4 (2018)
Clean Additional File 2 for use with R programming:
Open Additional File 2 in Excel and scroll to the bottom.
Delete all the non-table rows at the bottom, starting from the row immediately after the table. You need to delete the empty rows between the table and the extra information as well, or R may give you errors when you try to import the data.
Create a new R script and import data:
- TRUE and FALSE need to be capitalized. R interprets these not as words, but as binary values (yes/no, 1 or 0). R also understands T and F, where T=TRUE and F=FALSE. We are using header=TRUE because our table has column names.
- We are using skip=2 because the header begins on row #3 for this specific example. If the header began on row #1, you could leave out the skip variable and R would assume the default (skip=0).
- Strings are pieces of text like "apple" or "2" or "hello world". The stringsAsFactors=FALSE tells the program to not convert the string values to categorical/factor values. R will instead decide for each column if the values are strings (text) or numerical (numbers). Values that are missing will become NA (not available) values.
- Variable order doesn't matter if you specifically name each variable, e.g. file="your file name"
- Variable order DOES matter if you leave out the variable name, e.g. "your file name".
- backing out of the "data" folder using setwd("../"), or
- removing "data/" or "./data/" from the file value, instead using file="gonzalez2018-tableS1.csv"
The ifelse function
Subset Gonzalez 2018 data to genes which are FDR<0.05
Get a sub-dataset from mydata using the subset() function and filtering to anything in column "FDR" less than 0.05.
signif = subset(mydata, FDR<0.05)
How many rows and columns do you have? You should have 58 rows.
dim(signif)
That's not too many rows. It won't crash your computer to view everything in dataset signif, so take advantage and open the spreadsheet in RStudio like this:
View(signif) #note that View is capitalized
When the data is larger, sometimes I open only a few rows, for example view the first 100 rows of the original dataset mydata like this:
View(mydata[1:100, ]) #note that View is capitalized
Note that R always puts the rows first, then the columns. Here, I am selecting all rows from 1 to 100, then adding a comma and leaving a blank space. This tells R to show all columns.
You can also view a specific list of columns like this:
View(mydata[1:100, c("Symbol", "Chr", "direction", "FDR")])
I am still viewing rows 1 to 100, but now I am telling R to show me a list of columns which I specifically named. The function c() concatenates these values into a list so that R understands that I want all of these columns. You can see this list here:
You can also save the list to a variable and use the variable to select the columns:
mySelectedColumns = c("Symbol", "Chr", "direction", "FDR", "Biotype")
mySelectedColumns #show the list values
View(mydata[1:100, mySelectedColumns]) #open a spreadsheet
Going back to the dataset signif...
Count the genes in each direction:
table(signif$direction)
Count the genes in each direction and this time check if there are any NA values:
table(signif$direction, useNA="always")
Count the genes in each direction, per chromosome:
table(signif$direction, signif$Chr)
Count the genes in each direction, per biotype:
table(signif$direction, signif$Biotype)
Count the genes in each direction, per biotype, but flip the order so it is easier to read:
table(signif$Biotype, signif$direction)
If you get errors, check that your capitalization is correct. The columns need to be named exactly as in the dataset. You can check column names again using names(signif)
No comments:
Post a Comment