Done by: Low Yi Xiang
R is an open source language with multiple libraries available. It is a widely used programming language for various purposes, such as data wrangling, visualization, modeling and even building this deck!
R also has rich libraries beyond data wrangling, it can build dashboards, rich interactive graphics or slides such as the one you are looking at right now!
Head over to website https://cran.r-project.org/bin/ or Google Install R for
and click on the first link
Choose the OS your machine is on, download & Install software.
Head over to Rstudio website https://www.rstudio.com/products/rstudio/download2/ , download Rstudio (Free license) and install it.
Launch Rstudio and you should arrive at this image below:
Click on the top left icon with the green "plus" sign to launch a script
You can create a folder you like and click on the save icon (floppy disk). A Pop up menu should guide you along.
Alternatively if you prefer to save a new copy at a differnet location, do the usual File -> Save as -> ...
Things we will go through in the next 1-2 hours!
Datatypes (characters, numeric, integers, factors) (part 2)
Dataframes (part 2)
Intermediate R (part2)
In any programming languages, parenthesis are very important. (e.g, every bracket must have closure, commas must be used carefully).
In R, you can do basic calculations like most scientifc computing languages such as Matlab, Python.
1+1 #addition
10-2 #substraction
100+2 - 4 #addition and substraction
224*2 #multiplication
84/4 #division
2^7 #power
"a"
"b"
"this is a cat"
in R, you can assign values / calculations to words with the "<-" or "=" symbol
one <- 1
two <- 2
three <- 3
cat <- "cat"
dog <- "dog"
you can print them out by typing them or with the print command
print(cat)
## [1] "cat"
You can perform calculations with Variables, for example:
one+one
two/three
three*two + one
However, if you have not defined the variable, an error is returned
four
## Error in eval(expr, envir, enclos): object 'four' not found
You can also assign new variables to these calculations
four = two*two
four = three+one
print(four)
## [1] 4
You can write conditions to check on variables and other kind of variables such as lists, matrices, dataframes etc. Recall that parenthesis is important, watch out for your brackets!
The if statement
allows you to check for a condition, and the condition must return TRUE or FALSE. Example of conditions can be found below:
animal <- "cat"
print(animal == "cat")
## [1] TRUE
There are other functions (more on that later) and conditions symbol, such as !=
(not equal), >=
(greater than or equals to) as well as <=
(smaller than or equal to)
one <- 1 ; two <-2
two <= one
## [1] FALSE
Here is an example of an if-statement. (animal == "cat")
is the condition, and the round brackets are required.
In addition, notice the curly brackets which are the parenthesis.
animal <- "cat"
if(animal == "cat"){
print("your animal is a cat!")
}
## [1] "your animal is a cat!"
What happens when you declare a variable animal <- "cat"
and write an if statement that if the animal is a bird, print the statement "your animal is a bird" ?
Write another if statement that if the animal is not a dog, print the statement "your animal is not a dog" .
animal <- "cat"
if(animal == "bird"){
print("your animal is a bird")
}
Nothing is printed out!
animal <- "cat"
if(animal != "dog"){
print("your animal is not a dog")
}
## [1] "your animal is not a dog"
What if for question 1.1 you wanted to follow up with the if statement and also print out "your animal is not a bird"
?
Example:
animal <- "cat"
if(animal == "bird"){
print("your animal is a bird")
}else{
print("your animal is not a bird")
}
## [1] "your animal is not a bird"
What if you have multiple conditions you want to check? One way to do it is to write multiple else-if statements.
animal <- "cat"
if(animal == "bird"){
print("your animal is a bird")
}else if(animal == "dog"){
print("your animal is a dog")
}else{
print("your animal is not a bird or dog")
}
## [1] "your animal is not a bird or dog"
There is also an if-else statement that is fairly convenient for simple tasks. In R, you can type ?ifelse
or any functions with a question mark infront to access the documentation.
try the command ?ifelse
and read the documentation now.
Heres an additional example:
first_digit <- 2
second_digit <- 4
ifelse(first_digit <= second_digit, "first digit is bigger", "second digit is bigger")
## [1] "first digit is bigger"
You should notice by now that in all of these statements all require a conditional statement that turns either TRUE or FALSE .
You can combine these conditions with additional and statements or assign them to variables.
digit1 <- 1 ;digit2 <- 2 ; digit3 <-3
condition1 <- digit2 >= digit1
condition2 <- digit3 <= digit2
print(condition1 & condition2 ) #TRUE AND FALSE = FALSE
## [1] FALSE
print(condition1 || condition2) #TRUE OR FALSE = TRUE
## [1] TRUE
#you can then stack them together.
condition3 <- condition1 || condition2
if(condition3){
#code here
}
## NULL
In R, sometimes you want to store multiple values, such as observations of a person height. You can declare them by using c()
. More information can be found by ?c
random_numbers <- c(1,6,3,1,8,5,7,9,0,2)
You can perform math operations on vectors, try them out!
sum(random_numbers) #find the sum
mean(random_numbers) #find the average
mode(random_numbers) #find the most freq. item
min(random_numbers) #find the minimum number
max(random_numbers) #find the maximum number
random_numbers*2 #multiply by 2
random_numbers -1 #substract each element by 1
There is alot more functionalities available in vectors - but usually you will google them as you need them along the way.
Vectors have some methods, such as finding out the length of the vector with length(vector)
.
You can also index specific part of the vectors with square brackets. Multiple elements can be extracted out with either a another vector as follows:
random_numbers[1] #extract the first element
## [1] 1
random_numbers[c(4,6)] #extract the 4th and 6th element
## [1] 1 5
You can also use a TRUE/FALSE vector
greater_than_two <- random_numbers > 2
greater_than_two
## [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
random_numbers[greater_than_two]
## [1] 6 3 8 5 7 9
One limitation with vectors is that you can only store individual elements in them. Suppose that you want to store 2 different vectors together:
a <- c(1,2,3)
b <- c(4,5,6)
new_vect <- c(a,b)
print(new_vect)
## [1] 1 2 3 4 5 6
Lists overcome this problem:
new_list <- list(a,b)
print(new_list)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4 5 6
Lists can be indexed in the same way with vectors however they need double square brackets. list[[elements]]
.
new_list[[c(1)]]
## [1] 1 2 3
In Lists, it is also possible to assign names and index them by their names. For example:
new_list2 <- list(first_A=a,second_B=b)
new_list2[["first_A"]]
## [1] 1 2 3
In list, indexing multiple elements is slightly different.
new_list[c(1,2)]
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4 5 6
Using double square results would mean that you are taking sub-elements.
new_list[[c(1,2)]] #taking the first element, then take the second element.
## [1] 2
Usually to take sub-elements, one would extract out each element and use appropriate indexing for that class. In this example the class happens to be a vector.
new_list[[1]][c(1,2)]
## [1] 1 2
List have apply functions that is beyond the scope of this course More information can be found here. You can also type ?lapply
to find out more.
Do note that with the introduction of dataframes and the dplyr package (part2), most people prefer using dataframes rather than lists when it comes to manipulating data.
Nevertheless, Lists are still very important and could be extremly useful as they can store many different variables. Infact, they can store about anything such as models, dataframes, functions. They can also be used for functional programming which is considered an advance topic.
If you are familar with programming, you will understand that functions are very important.
For those who are not familar, functions are essentially methods you can use to apply to variables and get the same output through same process(es).
In other words, there are sometimes you need to apply the same code multiple times, this is where functions are useful as it (1) reduces the code you need , (2) readability, (3) saves time
For instance, you need to take the square of a variable, square it, substract by itself, and add 8 to it.
x <- 2
x <- x^2- x + 8
x
## [1] 10
Suppose you need to do it multiple times, this might be a better approach.
x<-2
function_example <- function(x){
x <- x^2-x+8
return(x)
}
x <- function_example(x)
As seen in the previous slides, functions are essentially codes that you can re-use without explicitly typing them out. A function has to contain 3 parts, declaration of variables / function name, the body, and the output which is the return function.
In the earlier example, function_example
is the function name, x
is the input variable, while x <- x2 - x+8
is the body, and return(x)
is the code.
Heres another (trivial) example, suppose we have two numbers and multiply them together with a function:
multiply_two_numbers <- function(x,y){
new_number <- x*y
return(x*y)
}
The code below shows the outline of a function:
<function_name> <- function(input_1,input_2, ... , input_n){ #notice the brackets
#your code here
.
.
.
return(<return the variables you require>)
}
Write a function that takes in a string variable, checks whether it is a dog or a cat, otherwise return the string "it is neither a dog nor a cat"
Write a funtion that takes in a vector of numerical values, and return a list of results computing the length,max, min, and mean.
test_cat_dog <- function(animal){
if(animal == "dog"){
return("your animal is a dog")
}else if(animal == "cat"){
return("your animal is a cat")
}else{
return("your animal is neither a cat nor a dog")
}
}
test_cat_dog("dog")
## [1] "your animal is a dog"
test_cat_dog("bird")
## [1] "your animal is neither a cat nor a dog"
summary_stats <- function(x){
length_x <- length(x)
mean_x <- mean(x)
min_x <- min(x)
max_x <- max(x)
return_list <- list(length = length_x, mean = mean_x, min = min_x, max = max_x)
return(return_list)
}
summary_stats(c(1,2,3,4,5))
## $length
## [1] 5
##
## $mean
## [1] 3
##
## $min
## [1] 1
##
## $max
## [1] 5
There are other features about functions in R such as inheritance, functional programming or specifiying default values which are out of scope of this course.
To recap, functions are a extremely useful way to write neater and shorter codes. As a rule of thumb, if you need to write the same code twice or more, it is probably a good idea to write a function for it.
Sometimes, you would be using functions that are built by others in the form of packages(#/11), it is thus important that you know how to write / read / call functions to help your data analysis in R!
There are times when you need to do some task over and over again - in this case, you should think of loops!
There are two kind of loops - for
and while
loops.
for
loops run within a fixed set and perform some tasks for you. More examples will be shown later.
while
loops run until a certain condition is satisified.
In for
loops, the structure is as follows:
for(i in 1:10){ #do something for ten times
#do something
}
It also possible to specify a vector to 'loop' through :
student_names <- c("Mary","John","Peter","Berry")
for(i in student_names){
print(i)
#code to do task related to each student's name.
}
## [1] "Mary"
## [1] "John"
## [1] "Peter"
## [1] "Berry"
Specify a vector of 1:100 and using a for loop, compute the sum of all numbers in this range that are divisible by 3.
Hint1: to find the remainder of two numbers can be found with the modulo function in R, e.g 4%%2 = 0, while 4%%3 = 1.
Hint2: you can specify a variable to keep track of the running sum of variables.
numbers <- 1:100
running_sum <- 0
for(i in numbers){
running_sum <- running_sum + i #keeping track of the total sum.
}
print(running_sum)
## [1] 5050
Answer:
numbers <- 1:100
running_sum <- 0
for(i in numbers){
if(i %%3 ==0){
running_sum <- running_sum + i #keeping track of the total sum.
}
}
print(running_sum)
## [1] 1683
While loops is generally used when you want to achieve a task and is uncertain about the steps you need to take. The structure is as follows:
condition <- TRUE
while(condition){ #notice the brackets
#do some stuff
#if the condition is fufilled, change it to FALSE and the while loop stops running.
}
As an example:
i=1 ; condition <- TRUE
while(condition){
print(i) ; i<- i+1
if(i == 3){
condition<-FALSE
}
}
## [1] 1
## [1] 2
Be careful with while loops as you might encounter infinite loops - the loop will run forever since the condition will never be false!
Using a while loop, find out how many numbers is required to have a running sum that is greater than 500 with numbers that are divisible by three.
Hint:
sum_required <- 500
condition <- TRUE
i <- 1 #start from 1
running_sum <- 0
while(condition){
#check if i is divisble by three
#if yes, running_sum <- running_sum+i
#check if running_sum exceeds sum_required
i <- i+1 #add 1 to "i" to start the next interation.
}
Answer:
sum_required <- 500
condition <- TRUE
i <- 1 #start from 1
running_sum <- 0
while(condition){
if(i %% 3 == 0 ){
running_sum <- running_sum +i
}
if(running_sum >= sum_required){
condition <- FALSE
}
i<- i+1
}
print(i)
## [1] 55
Bonus Challenge: how many numbers in total were used to achieve a sum exceeding 500 with numbers that are divisible by 3.
sum_required <- 500
condition <- TRUE
i <- 1 #start from 1
running_sum <- 0
counter <- 0
while(condition){
if(i %% 3 == 0 ){
running_sum <- running_sum +i
counter <- counter+1 #just add in a counter here to see when is this condition triggered.
}
if(running_sum >= sum_required){
condition <- FALSE
}
i<- i+1
}
print(counter)
## [1] 18
There are two additional functionalities that are useful - break and next .
The break command basically stops the loops from running while the next command simply moves on to the next iteration of the loop.
for( i in 1:4){
if(i == 3){break}
print(i)
}
## [1] 1
## [1] 2
for( i in 1:4){
if(i == 3){next}
print(i)
}
## [1] 1
## [1] 2
## [1] 4
R is a widely contributed by people all over the world, there are currently 8992 packagess available on CRAN not accounting for other libraries on github.
To see the libraries available by date or by name.
In the next part we will be playing around with dataframes, please run the following codes in your console.
install.packages("dplyr")
install.packages("tidyr")
install.packages("packrat")