Done by: Low Yi Xiang
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
Example:
name <- c("mary","john","tom")
grades <- c(89,60,74)
df_student_grades <- data.frame(name,grades)
print(df_student_grades)
## name grades
## 1 mary 89
## 2 john 60
## 3 tom 74
This tutorial will teach you the basics of manuiplating csv (excel) files within R to perform analysis, joining data, etc.
This tutorial also assumes you have installed Rstudio and know how to install packages.
Dataframes can be index by their row and columns using a matrix like structure.
For example to extract the first row and first column value:
df_student_grades[1,1]
## [1] mary
## Levels: john mary tom
If only the first entry (row) is of interest:
df_student_grades[1,]
## name grades
## 1 mary 89
Or if only the grades are of interest:
df_student_grades[,2] #the second column represent the grades
## [1] 89 60 74
To extract the column names of a dataframe:
names(df_student_grades)
## [1] "name" "grades"
We can also use column names instead of numerical value to extract a column. Note* this is also true for rows.
df_student_grades[,"grades"]
## [1] 89 60 74
We can edit a specific cell in the dataframe:
df_student_grades[3,"grades"] <- 80 #edit grades for the 3rd row
df_student_grades[3,"grades"]
## [1] 80
Please install the following libraries by running these codes:
install.packages(c("dplyr","tidyr","datasets","ggplot2"))
Please check that you can install & run these libraries.
library(dplyr);
library(tidyr);
library(datasets);
library(ggplot2)
In R, you can set the working directory you wish to find the files. Please create a folder named R_dataframe
in a location you prefer.
Then, set your working directory as follows: (slightly different for different OS users)
setwd('/Users/lowyix/Desktop/R_dataframe/')
Alternatively, you can nagivate to the help bar and click Session - Set Working Directory - Choose directory
and then select the folder R_dataframe
To get your current working directory,
getwd()
## [1] "/Users/lowyix/Desktop/R_dataframe/"
To List all the files in the working directory, you can:
list.files()
## character(0)
lets load a dataset from one of the libraries; diamonds
diamonds_df <- diamonds #if the code fails to work, check that you have loaded ggplot2
To write out an excel (csv) file,
write.csv(diamonds_df,"diamonds_data.csv", row.names = FALSE)
list.files()
## [1] "diamonds_data.csv"
More information can be found with:
?write.csv
diamonds_df <- read.csv("diamonds_data.csv")
head(diamonds_df) #more on this later
## Source: local data frame [6 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
In R, you can manage your files too, such as deleting the files in the directory
file.remove("diamonds_data.csv")
There are other packages avaliable on CRAN to read/write excel tables and other formats such as xls or xlsx.
However generally CSV is the prefered format as it is machine readable. (Remember that formats such as XLSX allows merge cells, multiple sheets per file etc)
Example of such packages are
In R Dataframes, there are mainly 3 types of columns, Factors, Characeers and Integer/Numeric.
You can convert one class to another with commands like
as.integer(x)
as.numeric(x)
as.character(x)
as.factor(x)
Integer values means whole numbers only. For instance,
as.integer(1.6)
## [1] 1
Numeric values means "floats" or "decimals". You can convert "strings" or integers to numerics
string_one <- "1.6"
as.numeric(string_one)
## [1] 1.6
as.integer(string_one)
## [1] 1
Character types are essentially "strings". For instance:
as.character(c(1,2,3))
## [1] "1" "2" "3"
Factors are essentially categories or "Groups" within the data. There are particularly useful when you want to do factor level operations - generally used when plotting graphs.
They are also useful when you want to restrict operations within a Dataframe Factor Column.
as.factor(c(1,2,3))
## [1] 1 2 3
## Levels: 1 2 3
In the dataframe diamonds_df, the "str" command runs a quick summary and we can see the "cut","color" and "clarity" are factor columns.
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The different types of cut of a diamonds are:
diamonds_df %>% select(cut) %>% distinct()
## Source: local data frame [5 x 1]
##
## cut
## (fctr)
## 1 Ideal
## 2 Premium
## 3 Good
## 4 Very Good
## 5 Fair
If one tries to edit / add data that is not within the 5 categories, an error will occur. Observe:
diamonds_df[1,"cut"] #this should return a value "Ideal"
## Source: local data frame [1 x 1]
##
## cut
## (fctr)
## 1 Ideal
Suppose the value is changed to another category within the list, such as Premium,
diamonds_df[1,"cut"] <- "Premium" #this does not produce any error
The "NA" value will be generated instead:
diamonds_df[1,"cut"]
## Source: local data frame [1 x 1]
##
## cut
## (fctr)
## 1 Premium
If the value is not within the list:
diamonds_df[1,"cut"] <- "testing" #this does not produce any error
## Warning in `[<-.factor`(`*tmp*`, iseq, value = "testing"): invalid factor
## level, NA generated
diamonds_df[1,"cut"]
## Source: local data frame [1 x 1]
##
## cut
## (fctr)
## 1 NA
This is particularly useful when you want to make sure certain columns only accept certain values.
Lastly, reset the diamonds_df to its orginial value with the following code.
diamonds_df <- diamonds
Just as an side note, it is possible to change the "levels" of the columns.
The default settings in R is such that they will arrange the levels according to numeric/alphabetical order. Sometimes this may not be ideal (especially in plots), this can be corrected by the following steps:
new_diamonds_df<- ggplot2::diamonds
levels(new_diamonds_df$cut)
## [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
In the case of diamonds, suppose we want the order Premium, Very Good, Good, Ideal, and Fair instead, we define a new numeric reflecting the previous levels shown above.
Note* - If you do not understand R indexing it is ok, the point of this slide is to show levels manipulation.
new_ordering <- c(4,3,2,5,1)
levels(new_diamonds_df$cut)[new_ordering] #observe the new changes
## [1] "Premium" "Very Good" "Good" "Ideal" "Fair"
After defining the numeric vector, replace the orignial column with the new order:
new_diamonds_df$cut <- factor(new_diamonds_df$cut,
levels = levels(new_diamonds_df$cut)[new_ordering])
levels(new_diamonds_df$cut)
## [1] "Premium" "Very Good" "Good" "Ideal" "Fair"
We can see that the levels are readjusted to the levels we desire!
DPLYR is a package written by Hadley Wickham who is currently Chief Scientist at RStudio.
It is a fast, consistent tool for working with data frame like objects, both in memory and out of memory.
Note* - The Cheatsheet covers a significant portion of DPLYR and a short description of each function.
Lets look at the dataset again:
head(diamonds_df)
## Source: local data frame [6 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
As well as the names:
names(diamonds_df)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
Suppose we want to look at diamonds with color 'D':
diamonds_df_color_D <- filter(diamonds_df,color == 'D')
head(diamonds_df_color_D,3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Very Good D VS2 60.5 61 357 3.96 3.97 2.40
## 2 0.23 Very Good D VS1 61.9 58 402 3.92 3.96 2.44
## 3 0.26 Very Good D VS2 60.8 59 403 4.13 4.16 2.52
Suppose we want to look at diamonds with a larger carat, say size 2 and above:
diamonds_df_carat_abv2 <- filter(diamonds_df,carat >= 2)
head(diamonds_df_carat_abv2,3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 2.00 Premium J I1 61.5 59 5051 8.11 8.06 4.97
## 2 2.06 Premium J I1 61.2 58 5203 8.10 8.07 4.95
## 3 2.14 Fair J I1 69.4 57 5405 7.74 7.70 5.36
You can combine 'and' conditions or 'or' conditions as follows:
Example 'and':
diamonds_df_color_D_and_carat_abv2 <- filter(diamonds_df, color == "D" & carat >= 2)
head(diamonds_df_color_D_and_carat_abv2,3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 2.00 Good D SI2 63.8 57 10528 7.93 7.88 5.04
## 2 2.02 Premium D SI2 61.1 57 12108 8.12 8.05 4.94
## 3 2.00 Premium D SI2 59.3 62 12576 8.12 8.06 4.80
Example 'or'
diamonds_df_color_D_or_carat_abv2 <- filter(diamonds_df, color == "D" & carat >= 2)
head(diamonds_df_color_D_or_carat_abv2,3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 2.00 Good D SI2 63.8 57 10528 7.93 7.88 5.04
## 2 2.02 Premium D SI2 61.1 57 12108 8.12 8.05 4.94
## 3 2.00 Premium D SI2 59.3 62 12576 8.12 8.06 4.80
You can select Columns with the "select" command. For example:
diamond_df_selected <- select(diamonds_df,carat,cut)
head(diamond_df_selected,3)
## Source: local data frame [3 x 2]
##
## carat cut
## (dbl) (fctr)
## 1 0.23 Ideal
## 2 0.21 Premium
## 3 0.23 Good
There are other kind of commands that can be used in conjunction with Dplyr, for example "contains":
diamonds_df_contains_ri <- select(diamonds_df, contains("ri"))
head(diamonds_df_contains_ri,3)
## Source: local data frame [3 x 2]
##
## clarity price
## (fctr) (int)
## 1 SI2 326
## 2 SI1 326
## 3 VS1 327
Example 'starts_with':
diamonds_df_start_with_c <- select(diamonds_df, starts_with("c"))
head(diamonds_df_start_with_c,3)
## Source: local data frame [3 x 4]
##
## carat cut color clarity
## (dbl) (fctr) (fctr) (fctr)
## 1 0.23 Ideal E SI2
## 2 0.21 Premium E SI1
## 3 0.23 Good E VS1
Example 'ends_with':
diamonds_df_ends_with_t <- select(diamonds_df, ends_with("t"))
head(diamonds_df_ends_with_t,3)
## Source: local data frame [3 x 2]
##
## carat cut
## (dbl) (fctr)
## 1 0.23 Ideal
## 2 0.21 Premium
## 3 0.23 Good
More information can be found in the Cheatsheet
To understand piping, lets first understand a mock example by creating a function.
multiply_each_other <- function(x,y){
return(x*y)
}
Then, we can call this function as follows:
multiply_each_other(2,4)
## [1] 8
With Piping, we can perform a similar operation as follows:
2 %>% multiply_each_other(4)
## [1] 8
In other words, when you pipe an object to another function, it "fills" up the first argument of the new function.
Observe the two codes we used earlier:
df1 <- select(diamonds_df,carat)
head(df1,3)
## Source: local data frame [3 x 1]
##
## carat
## (dbl)
## 1 0.23
## 2 0.21
## 3 0.23
vs
df2 <- diamonds_df %>% select(carat)
head(df2,3)
## Source: local data frame [3 x 1]
##
## carat
## (dbl)
## 1 0.23
## 2 0.21
## 3 0.23
The advantage of piping means it allows us to do chaining. To explain this concept, lets assume that you have to select some columns and filter by certain conditions. You would either:
df_select <- select(diamonds_df, starts_with("c"))
df_select_filter <- filter(df_select, carat >=2)
head(df_select_filter,3)
## Source: local data frame [3 x 4]
##
## carat cut color clarity
## (dbl) (fctr) (fctr) (fctr)
## 1 2.00 Premium J I1
## 2 2.06 Premium J I1
## 3 2.14 Fair J I1
OR:
df_select_filter2 <- filter(select(diamonds_df, starts_with("c")),carat>=2)
head(df_select_filter2,3)
## Source: local data frame [3 x 4]
##
## carat cut color clarity
## (dbl) (fctr) (fctr) (fctr)
## 1 2.00 Premium J I1
## 2 2.06 Premium J I1
## 3 2.14 Fair J I1
Notice the code can be a little hard to read - What if you have multiple functions to run?
This is where Chaining is very useful by combining piping,
Observe:
df4 <- diamonds_df %>%
select(starts_with("c")) %>%
filter(carat >= 2)
head(df4,3)
## Source: local data frame [3 x 4]
##
## carat cut color clarity
## (dbl) (fctr) (fctr) (fctr)
## 1 2.00 Premium J I1
## 2 2.06 Premium J I1
## 3 2.14 Fair J I1
In the subsequent examples, you will see more of piping in action.
Additional Note* - Piping exists in many other packages, such as tidyr, ggvis, leaflet and many other libraries.
To sort the data by a particular column,
diamonds_df %>% arrange(carat) %>% head(3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
## 3 0.2 Premium E VS2 59.0 60 367 3.81 3.78 2.24
You can arrange by two columns as well, for example by carat and clarity:
diamonds_df %>% arrange(carat,clarity) %>% head(3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
## 3 0.2 Premium E VS2 59.0 60 367 3.81 3.78 2.24
To sort in a decreasing order, use the "desc" function:
diamonds_df %>% arrange(carat,desc(clarity)) %>% head(3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
## 2 0.2 Premium E VS2 59.0 60 367 3.81 3.78 2.24
## 3 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32
Another example:
diamonds_df %>% arrange(desc(carat),desc(price)) %>% head(3)
## Source: local data frame [3 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 5.01 Fair J I1 65.5 59 18018 10.74 10.54 6.98
## 2 4.50 Fair J I1 65.8 58 18531 10.23 10.16 6.72
## 3 4.13 Fair H I1 64.8 61 17329 10.00 9.85 6.43
Creating new columns can be done with the "mutate" or "transmute"" columns. The difference between transmute and mutate is transmute drops its orginial columns.
For example, lets compute the price per carat:
diamonds_df %>% mutate(price_per_carat = price/carat) %>%
select(starts_with('c'),price_per_carat) %>% head(3)
## Source: local data frame [3 x 5]
##
## carat cut color clarity price_per_carat
## (dbl) (fctr) (fctr) (fctr) (dbl)
## 1 0.23 Ideal E SI2 1417.391
## 2 0.21 Premium E SI1 1552.381
## 3 0.23 Good E VS1 1421.739
With transmute:
diamonds_df %>% transmute(price_per_carat = price/carat) %>% head(3)
## Source: local data frame [3 x 1]
##
## price_per_carat
## (dbl)
## 1 1417.391
## 2 1552.381
## 3 1421.739
You can combine ifelse / functions along with mutate functions.
For example you want to create a new column to differentiate large diamonds (abv 1 carat) and small diamonds.
set.seed(5) #set seed allows everyone to have the same results
diamonds_df %>%
mutate(big_or_small = ifelse(carat >=1, "big","small")) %>%
select(starts_with('c'),big_or_small) %>%
sample_n(5) %>%
head(5)
## Source: local data frame [5 x 5]
##
## carat cut color clarity big_or_small
## (dbl) (fctr) (fctr) (fctr) (chr)
## 1 1.00 Good H VS2 big
## 2 0.41 Premium G VS1 small
## 3 0.70 Good E SI1 small
## 4 1.01 Very Good F VS2 big
## 5 0.90 Very Good D SI1 small
You can use functions as well:
carat_big_small <- function(x){
temp = ifelse(x>=1,"big","small")
return(temp)
}
set.seed(5) #set seed allows everyone to have the same results
diamonds_df %>%
mutate(big_or_small = carat_big_small(carat)) %>%
select(starts_with('c'),big_or_small) %>%
sample_n(5) %>%
head(5)
## Source: local data frame [5 x 5]
##
## carat cut color clarity big_or_small
## (dbl) (fctr) (fctr) (fctr) (chr)
## 1 1.00 Good H VS2 big
## 2 0.41 Premium G VS1 small
## 3 0.70 Good E SI1 small
## 4 1.01 Very Good F VS2 big
## 5 0.90 Very Good D SI1 small
There are numerous summarize functions, lets look at some examples:
diamonds_df %>% summarise(mean(table))
## Source: local data frame [1 x 1]
##
## mean(table)
## (dbl)
## 1 57.45718
You can include multiple functions as follows:
diamonds_df %>% summarise(mean = mean(table),
median = median(table),
unique_carat = n_distinct(carat),
max_carat = max(carat))
## Source: local data frame [1 x 4]
##
## mean median unique_carat max_carat
## (dbl) (dbl) (int) (dbl)
## 1 57.45718 57 273 5.01
group_by should be a familar operation for those who use SQL, let's see an example first:
We want to know the relationship between the cut and the price.
diamonds_df %>%
group_by(cut) %>%
summarise(mean(price))
## Source: local data frame [5 x 2]
##
## cut mean(price)
## (fctr) (dbl)
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542
What about the color?
diamonds_df %>%
group_by(color) %>%
summarise(mean(price))
## Source: local data frame [7 x 2]
##
## color mean(price)
## (fctr) (dbl)
## 1 D 3169.954
## 2 E 3076.752
## 3 F 3724.886
## 4 G 3999.136
## 5 H 4486.669
## 6 I 5091.875
## 7 J 5323.818
Suppose you want to find out the max price within each group
diamonds_df %>%
group_by(color, cut) %>%
summarise(max_price = max(price)) %>%
arrange(max_price) %>%
head(10)
## Source: local data frame [10 x 3]
## Groups: color [2]
##
## color cut max_price
## (fctr) (fctr) (int)
## 1 D Fair 16386
## 2 D Good 18468
## 3 D Very Good 18542
## 4 D Premium 18575
## 5 D Ideal 18693
## 6 E Fair 15584
## 7 E Good 18236
## 8 E Premium 18477
## 9 E Ideal 18729
## 10 E Very Good 18731
We can see that (E, Very Good) has a higher value than (D, Fair). This is because when using the group_by operation, it finds out the max price within each group.
Sometimes you will need to "ungroup()" your data to remove the grouping, lets see an example:
Question : find out the highest average price within each group and then list the top 5 groups with the highest everage price.
df5 <- diamonds_df %>%
group_by(color, cut) %>%
summarise(mean_price = mean(price)) %>%
arrange(desc(mean_price))
df5 %>% head(6)
## Source: local data frame [6 x 3]
## Groups: color [2]
##
## color cut mean_price
## (fctr) (fctr) (dbl)
## 1 D Fair 4291.061
## 2 D Premium 3631.293
## 3 D Very Good 3470.467
## 4 D Good 3405.382
## 5 D Ideal 2629.095
## 6 E Fair 3682.312
We can see that (E,fair) has a higher price than (D,Ideal).
We can fix this with the "ungroup" option:
diamonds_df %>%
group_by(color, cut) %>%
summarise(mean_price = mean(price)) %>%
ungroup() %>%
arrange(desc(mean_price)) %>%
head(10)
## Source: local data frame [10 x 3]
##
## color cut mean_price
## (fctr) (fctr) (dbl)
## 1 J Premium 6294.592
## 2 I Premium 5946.181
## 3 I Very Good 5255.880
## 4 H Premium 5216.707
## 5 H Fair 5135.683
## 6 J Very Good 5103.513
## 7 I Good 5078.533
## 8 J Fair 4975.655
## 9 J Ideal 4918.186
## 10 I Fair 4685.446
Now the highest 10 mean prices belong to J, I, H cuts. One can conclude that color affects the price more than the cut.
Question:Find out which cut has the highest price within each color.
Answer:
diamonds_df %>%
group_by(color,cut) %>%
summarise(mean_price = mean(price), count = n()) %>%
ungroup() %>%
group_by(color)%>%
filter(mean_price == max(mean_price))
## Source: local data frame [7 x 4]
## Groups: color [7]
##
## color cut mean_price count
## (fctr) (fctr) (dbl) (int)
## 1 D Fair 4291.061 163
## 2 E Fair 3682.312 224
## 3 F Premium 4324.890 2331
## 4 G Premium 4500.742 2924
## 5 H Premium 5216.707 2360
## 6 I Premium 5946.181 1428
## 7 J Premium 6294.592 808
There are two ways to join data set, either by joining, or binding. For joining data there is left_join, right_join, inner_join, full_join, semi_join, anti_join.
Define the following variables:
data_A <- data.frame(names = c("amber","bryan","diana"), grades = c(84,93,56))
data_B <- data.frame(names = c("amber","bryan","charlie"), gender = c("F","M","M"))
left_join : join matching rows from B to A
data_A %>% left_join(data_B, by = "names")
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
## names grades gender
## 1 amber 84 F
## 2 bryan 93 M
## 3 diana 56 <NA>
Recall that dealing with factor with differnet levels will result in a warning message. To overcome this, lets change them to character levels:
data_A <- data_A %>% mutate(names = as.character(names))
data_B <- data_B %>% mutate(names = as.character(names))
left_join: join matching rows from b to a
data_A %>% left_join(data_B, by = "names")
## names grades gender
## 1 amber 84 F
## 2 bryan 93 M
## 3 diana 56 <NA>
right_join:Join matching rows from a to b.
data_A %>% right_join(data_B, by = "names")
## names grades gender
## 1 amber 84 F
## 2 bryan 93 M
## 3 charlie NA M
inner_join:Join data. Retain only rows in both sets.
data_A %>% inner_join(data_B, by = "names")
## names grades gender
## 1 amber 84 F
## 2 bryan 93 M
More in the next slide:
full_join: Join data. Retain all values, all rows.
data_A %>% full_join(data_B, by = "names")
## names grades gender
## 1 amber 84 F
## 2 bryan 93 M
## 3 diana 56 <NA>
## 4 charlie NA M
semi_join: All rows in a that have a match in b.
data_A %>% semi_join(data_B, by = "names")
## names grades
## 1 amber 84
## 2 bryan 93
anti_join: All rows in a that do not have a match in b.
data_A %>% anti_join(data_B, by = "names")
## names grades
## 1 diana 56
You can bind datasets together as well:
bind_rows:
data_C <- data.frame(names = c("cheryl","elieen","faith"), grades = c(40,60,100)) %>%
mutate(names = as.character(names))
data_A %>% bind_rows(data_C)
## Source: local data frame [6 x 2]
##
## names grades
## (chr) (dbl)
## 1 amber 84
## 2 bryan 93
## 3 diana 56
## 4 cheryl 40
## 5 elieen 60
## 6 faith 100
bind_cols:
data_A %>% bind_cols(data_C)
## Source: local data frame [3 x 4]
##
## names grades names grades
## (chr) (dbl) (chr) (dbl)
## 1 amber 84 cheryl 40
## 2 bryan 93 elieen 60
## 3 diana 56 faith 100
Window Functions are essentially functions that performs specific analysis.
more details can be found here
Suppose you want to find your cumulative sales each month:
data_sales <- data.frame(month = 1:12, sales = rnorm(12,10000,2000))
data_sales %>% mutate(cumulative_sales = cumsum(sales))
## month sales cumulative_sales
## 1 1 11054.889 11054.89
## 2 2 11740.625 22795.51
## 3 3 7551.755 30347.27
## 4 4 9952.436 40299.70
## 5 5 10297.745 50597.45
## 6 6 8330.116 58927.57
## 7 7 12430.554 71358.12
## 8 8 12006.911 83365.03
## 9 9 11169.850 94534.88
## 10 10 8493.949 103028.83
## 11 11 9899.260 112928.09
## 12 12 13649.063 126577.15
There are two types of data, "wide" and "long"/"narrow" format.
The following shows a "wide" format.
names | weight | height |
---|---|---|
Amber | 60 | 152 |
Bryan | 90 | 184 |
Charlie | 80 | 178 |
The following shows a 'long' format:
names | variable | measurements |
---|---|---|
Amber | weight | 60 |
Bryan | weight | 90 |
Charlie | weight | 80 |
Amber | height | 152 |
Bryan | height | 184 |
Charlie | height | 178 |
More information can be found here. There are many uses cases where you need wide over long data.
For examples, refer here
Lets look at a more complicated example with the following mock dataset of 2 students over 2 years:
## names grades_Q1_2015 grades_Q2_2015 grades_Q3_2015 grades_Q4_2015
## 1 amber 60 54 90 87
## 2 bryan 80 83 74 76
## grades_Q1_2016 grades_Q2_2016 grades_Q3_2016 grades_Q4_2016
## 1 60 58 40 93
## 2 79 87 85 81
Suppose i want to find out the average score of each year, i have to write the following code:
data_E %>% transmute(Y2015 = 0.25*(grades_Q1_2015+grades_Q2_2015+grades_Q3_2015+grades_Q4_2015),
Y2016 = 0.25*(grades_Q1_2016+grades_Q2_2016+grades_Q3_2016+grades_Q4_2016))
## Y2015 Y2016
## 1 72.75 62.75
## 2 78.25 83.00
This requires lots of coding + sometimes it may not be just 8 columns in this example ; what if the data for sales is over a 10 years period and a monthly breakdown is required?
This is when converting wide to long data can be extremely helpful for further analysis. Usually as a rule of thumb, long format is for analysis while wide format is for viewing consumption (i.e on a presentation).
For the same analysis in the previous slide:
data_E %>%
gather(key = variable, value = score,
grades_Q1_2015,grades_Q2_2015,grades_Q3_2015,grades_Q4_2015,
grades_Q1_2016,grades_Q2_2016,grades_Q3_2016,grades_Q4_2016) %>%
separate(variable, c("grades","quarter","year"), sep = "_") %>% #More information in next slides
group_by(names,year) %>%
summarise(mean(score))
## Source: local data frame [4 x 3]
## Groups: names [?]
##
## names year mean(score)
## (fctr) (chr) (dbl)
## 1 amber 2015 72.75
## 2 amber 2016 62.75
## 3 bryan 2015 78.25
## 4 bryan 2016 83.00
Lets explain the first part how "gather" works
data_E %>%
gather(key = variable, value = score,
grades_Q1_2015,grades_Q2_2015,grades_Q3_2015,grades_Q4_2015,
grades_Q1_2016,grades_Q2_2016,grades_Q3_2016,grades_Q4_2016) %>%
head(2)
## names variable score
## 1 amber grades_Q1_2015 60
## 2 bryan grades_Q1_2015 80
OR:
data_E %>%
gather(key = variable, value = score,
-names) %>% #notice the hypen sign
head(2)
## names variable score
## 1 amber grades_Q1_2015 60
## 2 bryan grades_Q1_2015 80
In other words, you specify the new key column name, the new value column name, followed by either the key columns (with the minus sign) or the value columns.
Heres another example with the iris dataset which is found in the documentation of gather:
By specifying the "value" columns:
mini_iris <- iris[c(1, 51, 101), ]
gather(mini_iris, key = flower_att, value = measurement,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## Species flower_att measurement
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
By specifying the "key" columns:
gather(mini_iris, key = flower_att, value = measurement, -Species)
## Species flower_att measurement
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
Usually the column names in a database or an excel sheet is uniformly labeled, such as the example provided:
names(data_E)
## [1] "names" "grades_Q1_2015" "grades_Q2_2015" "grades_Q3_2015"
## [5] "grades_Q4_2015" "grades_Q1_2016" "grades_Q2_2016" "grades_Q3_2016"
## [9] "grades_Q4_2016"
After the gather step, we arrive at and we want to split the names to 3 different columns:
data_E %>%
gather(key = variable, value = score,
-names) %>% #notice the hypen sign
head(2)
## names variable score
## 1 amber grades_Q1_2015 60
## 2 bryan grades_Q1_2015 80
We can achieve this by the "separate" function:
The way to use it is separate(col_name,new_col_names, separator)
data_F <- data_E %>%
gather(key = variable, value = score,
-names) %>% #notice the hypen sign
separate(variable, c("grades","quarter","year"), sep = "_")
data_F %>% head(2)
## names grades quarter year score
## 1 amber grades Q1 2015 60
## 2 bryan grades Q1 2015 80
you can type ?separate in your console to find out more! the opposite of this is unite.
Example unite:
unite(new_col_name,col_to_merge(s),separator)
data_G <- data_F %>% unite(variable,grades,quarter,year,sep="_")
data_G %>% head(5)
## names variable score
## 1 amber grades_Q1_2015 60
## 2 bryan grades_Q1_2015 80
## 3 amber grades_Q2_2015 54
## 4 bryan grades_Q2_2015 83
## 5 amber grades_Q3_2015 90
you can type ?unite in your console to find out more!
The opposite of gather is spread: gather(variable_column,value_column)
data_G %>% spread(key = variable, value = score)
## names grades_Q1_2015 grades_Q1_2016 grades_Q2_2015 grades_Q2_2016
## 1 amber 60 60 54 58
## 2 bryan 80 79 83 87
## grades_Q3_2015 grades_Q3_2016 grades_Q4_2015 grades_Q4_2016
## 1 90 40 87 93
## 2 74 85 76 81