6 Data Manipulation

6.1 Content

While we have already discussed some methods for data manipulation, such as indexing, subsetting and modifying data frames, the majority of R users approach this task using a dedicated collection of packages called tidyverse, introduced by Hadley Wickham, a statistician from New Zealand. While it may seem like this chapter covers tools for performing task that you are already familiar with, tidyverse follows a different philosophy than traditional R, and has a lot of advantages including better code readability and efficiency. As before, to install and load the package, simply run the code below. Remember, that you only need to call install.packages once, to download all the required files from CRAN. library needs to be called every time you start a new R session to attach the library functions to your working environment.

install.packages("tidyverse")
library(tidyverse)

Reading the data

In this chapter, we will use the same dataset we’ve used in the exploratory analysis chapter, which presents individual-level information about a sample of students. However, tidyverse offers an improved set of functions for reading in the data, as part of the readr subpackage - they work fairly similar to read.csv introduced before, however have some advantages (for example they read character columns to character vectors, rather than factors without having to include the stringsAsFactors argument, which was discussed in chapter 3. All the readr reading functions start with read_ and are used for different file types. The most usual one is read_csv (which you can use in exactly the same way as read.csv), however in this case we use read_delim, which allows us to read file with any delimiter. In case of the math dataset each row is separated by a semicolon (which you can check by opening the file via a notebook app). Because of that we specify for the second delim argument as in the example below:

math <- read_delim("data/student/student-mat-data.csv", delim = ";")

As you can see, running the function returns a message, which shows the specification of each column that was read - col_character refers to character columns, while col_double() means numeric columns. You could force each column to be read as a specific type. For example, you may want a character column to be read as factors in some of the cases - for example sex. As mentioned above, the default setting of read_ functions is to read all numbers as numeric variables and all text as character variables.

class(math$sex)
[1] "character"

To read the sex column as factor, we can read the data again, this time specifying the col_types argument. The col_types argument takes a list as input, in which we specify the type of selected columns, for example sex = col_factor() to tell:

math <- read_delim("data/student/student-mat-data.csv", delim = ";",
                 col_types = list(sex = col_factor(levels = c("M","F"))))
class(math$sex)
[1] "factor"

You may wonder why not simply use the as.factor or factor function:

math$sex <- factor(math$sex, levels = c("M","F"))

This is exactly equivalent, however the previous way of doing this is much more explicit and concise. Anyone (including you) who reads your analysis will immedietaly know which columns have you specified to be which type.

The final facet of the readr package functions is that they load the data as a slightly different data type than the normal read.csv

math <- read.csv("data/student/student-mat-data.csv", sep = ";")
class(math)
[1] "data.frame"
math <- read_delim("data/student/student-mat-data.csv", delim = ";",
                 col_types = list(sex = col_factor(levels = c("M","F"))))
class(math)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
is_tibble(math)
[1] TRUE
is.data.frame(math)
[1] TRUE

The data comes loaded as a tibble (tbl for short). Tibbles are a special kind of data frames implemented by the tidyverse package - the only thing you need to know about them for know, is that everything that you learned about data frames so far applies to tibbles. They offer some improvements, which will not be discussed here.

The pipe operator

Perhaps the most imporant innovation offered by the tidyverse package is the so-called pipe operator %>%. Its use may feel a bit quirky at first, but it is extremely useful and is widely used by most of modern R users.

As you have learned so-far, R evaluates each function from inside out. For example, we can get the sum of missing values in a data frame by running:

sum(is.na(math))
[1] 10

This essentially performs two steps - first, runs is.na on the math data frame, which returns a table filled with logical values, FALSE when a given entry is not missing and TRUE when it is. Then sum takes this table as input and adds up the values in it (treating FALSE as 0 and TRUE as 1). In many cases, such statements can get long, difficult to read and error-prone, especially when keyword arguments are specified.

The same operation may be be done using the pipe operator. In this case, rather than evaluating the sequence of functions from within, they are evaluated left to right. The %>% operator can be understood as a way of _passing the output of the thing on the left to the thing on the right as the first argument:

math %>% is.na() %>% sum()
[1] 10

In this example, the math data frame is passed to the is.na function, and then the output is passed to the sum function, which returns exactly the same result. As in the case of the regular call, you may store the output in a variable:

number_missing <- math %>% is.na() %>% sum()
number_missing
[1] 10

Before continuing we drop the missing observations:

math <- math[complete.cases(math), ]

While this may feel slightly unintuitive in this case, it comes in very handy when performing long sequences of operations on data, as we will see in the following sections.

Dataset manipulation

Three key functions are most commonly used for dataset manipulation in the tidyverse package: mutate, select and filter, coming from the dplyr sub-package. They are used as follows: - mutate is used to modify and create columns in data frames - select is used to select columns by name - filter is used to select rows given a set of logical values

All three functions take the data frame as the first argument. For example, we can create a new column, grade_average by adding together grades for all three years and dividing them by 3:

math <- mutate(math, average = (G1 + G2 + G3)/3)

Again, this is equivalent to:

math$average <- (math$G1 + math$G2 + math$G3)/3

As both operations create a variable called average by adding G1, G2 and G3 toegether, and dividing them by three. Not however, that in case of mutate there’s no need to specify the $ opearator, as you pass the math data frame as the first arguments, so the function knows that the G1, G2 and G3 names refer to this particular data frame. The mutate function returns the same data frame, but with the additional column average. Most commonly, it is used with the pipe operator:

math <- math %>% mutate(average = (G1 + G2 + G3)/3)

In this case, through the %>% we pass the math data frame to mutate, which creates a new column and returns the updated data.frame, which is stored in math.

filter allows you to filter rows of the data frame, which is an operation similar to indexing. For example, we can get students with average grade higher than 10 by:

math %>% filter(average > 18)
# A tibble: 6 × 34
  school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob     reason   guardian traveltime studytime
  <chr>  <fct> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>    <chr>    <chr>         <dbl>     <dbl>
1 GP     M        15 U       GT3     T           4     4 services teacher  course   father            1         2
2 GP     M        16 U       GT3     T           4     3 health   services reputat… mother            1         4
3 GP     M        15 U       LE3     A           4     4 teacher  teacher  course   mother            1         1
4 GP     M        15 U       LE3     T           4     2 teacher  other    course   mother            1         1
5 GP     F        18 U       GT3     T           2     2 at_home  at_home  other    mother            1         3
6 MS     F        18 R       LE3     T           4     4 other    other    reputat… mother            2         3
# ℹ 20 more variables: failures <dbl>, schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
#   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
#   Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, average <dbl>

This entire operation could be done in one step, by:

math %>%
  mutate(average = (G1 + G2 + G3)/3) %>%
  filter(average > 18)
# A tibble: 6 × 34
  school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob     reason   guardian traveltime studytime
  <chr>  <fct> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>    <chr>    <chr>         <dbl>     <dbl>
1 GP     M        15 U       GT3     T           4     4 services teacher  course   father            1         2
2 GP     M        16 U       GT3     T           4     3 health   services reputat… mother            1         4
3 GP     M        15 U       LE3     A           4     4 teacher  teacher  course   mother            1         1
4 GP     M        15 U       LE3     T           4     2 teacher  other    course   mother            1         1
5 GP     F        18 U       GT3     T           2     2 at_home  at_home  other    mother            1         3
6 MS     F        18 R       LE3     T           4     4 other    other    reputat… mother            2         3
# ℹ 20 more variables: failures <dbl>, schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
#   nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
#   Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, average <dbl>

Note that this time we haven’t modified the data frame - the variable average is created only temporarily, so that we can use it as a filter.

This query shows a bit too many columns. Suppose we wanted to narrow our search and only see the guardian of the students with average higher than 18. We could use the select function, which simply selects the column of the data frame by name:

math %>%
  mutate(average = (G1 + G2 + G3)/3) %>%
  filter(average > 18) %>%
  select(guardian)
# A tibble: 6 × 1
  guardian
  <chr>   
1 father  
2 mother  
3 mother  
4 mother  
5 mother  
6 mother  

This leaves us with a data frame of one column, showing guardians of the students with the best marks.

Reshaping data

It’s not common for social scientific to be longitudinal in nature. This means, that data in a given unit of observation (for example country, household or an individual) is observed on multiple variable (for example GDP, income, well-being) over a period of time. Such data can come in two formats - long and wide.

Wide data format - in the wide data format, each column represents a variable - for example, the table presented below presents grades of three students over three academic years in the wide format. Each column represents a separate year.

student 1 2 3
Thomas 2.34 3.87 2.03
Mary 3.87 4.58 2.70
David 3.83 3.92 4.00

Long data format - in the long data format, a separate column represents the name of the variable and a separate one - value of the corresponding variable. This format is sometimes more useful for particular types of analysis such as panel data models and for visualization. You can see the student scores in the long format below:

student year grade
Thomas 1 2.34
Thomas 2 3.87
Thomas 3 2.03
Mary 1 3.87
Mary 2 4.58
Mary 3 2.70
David 1 3.83
David 2 3.92
David 3 4.00

We can see an example of such data by loading the dataset gdp.csv, which contains GDP per capita over several years for couple of European countries:

gdp <- read_csv("data/world_bank/gdp.csv")
head(gdp)
# A tibble: 6 × 23
  country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
  <chr>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Germany DEU   27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360. 44994.
2 Denmark DNK   28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804. 46727.
3 Spain   ESP   21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720. 32434.
4 France  FRA   26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679. 39524.
5 United… GBR   26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312. 39971.
6 Greece  GRC   19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284. 26098.
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>,
#   X65 <lgl>

We can see that the data contains country names, as well as GDP values in years between 2000 and 2015 in the wide format. To reshape the data into the long format, we can use the pivot_longer function, which comes with the tidyr package, another element of the tidyverse suite. In pivot longer, we specify the dataset name as the first argument (which is usually piped to the function), followed by the column names that contain the wide-format variables (assuming that they are in order, this can be specified with a names of the left-most and right-most variable, separated by a colon). Note that in this example, we also use the inverse quotation marks, since the variable are named using numbers. The names_to argument specifies tha name of the variable which will be used to store the names of the re-formatted variables (in our example - years) and the value_to argument specifies the name of the variable which will be used to store values (GDP per capita).

gdp_long <- gdp %>% pivot_longer(`2000`:`2019`, 
                                 names_to = "year", values_to = "gdp_pc")
head(gdp_long)
# A tibble: 6 × 5
  country ccode X65   year  gdp_pc
  <chr>   <chr> <lgl> <chr>  <dbl>
1 Germany DEU   NA    2000  27209.
2 Germany DEU   NA    2001  28381.
3 Germany DEU   NA    2002  29179.
4 Germany DEU   NA    2003  29875.
5 Germany DEU   NA    2004  31305.
6 Germany DEU   NA    2005  31794.

As you can see, the function produces data in a long format, with only 4 columns, but 140 rows, as opposed to the wide data which consists of only 7 rows, but 22 columns.

In some cases, your data might come in a long format, yet you might want to reshape it into long. This can be done using the pivot_wider function. This works exactly opposite to pivot_longer. We first specify the data by piping it to the function and then use the names_from argument to specify the name of the variable containing the variable names and value_from to specify the variable containing the values. We end up obtaining the same data frame that we started with.

gdp_wide <- gdp_long %>% pivot_wider(names_from = "year", values_from = "gdp_pc")
head(gdp_wide)
# A tibble: 6 × 23
  country  ccode X65   `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
  <chr>    <chr> <lgl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Germany  DEU   NA    27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360.
2 Denmark  DNK   NA    28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804.
3 Spain    ESP   NA    21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720.
4 France   FRA   NA    26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679.
5 United … GBR   NA    26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312.
6 Greece   GRC   NA    19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284.
# ℹ 7 more variables: `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
#   `2019` <dbl>
all.equal(gdp_wide, gdp)
 [1] "Names: 21 string mismatches"                                                           
 [2] "Attributes: < Names: 1 string mismatch >"                                              
 [3] "Attributes: < Length mismatch: comparison on first 2 components >"                     
 [4] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
 [5] "Attributes: < Component \"class\": 3 string mismatches >"                              
 [6] "Attributes: < Component 2: Modes: numeric, externalptr >"                              
 [7] "Attributes: < Component 2: Lengths: 7, 1 >"                                            
 [8] "Attributes: < Component 2: target is numeric, current is externalptr >"                
 [9] "Component 3: Modes: logical, numeric"                                                  
[10] "Component 3: target is logical, current is numeric"                                    
[11] "Component 4: Mean relative difference: 0.04964384"                                     
[12] "Component 5: Mean relative difference: 0.04797772"                                     
[13] "Component 6: Mean relative difference: 0.02722602"                                     
[14] "Component 7: Mean relative difference: 0.05492863"                                     
[15] "Component 8: Mean relative difference: 0.03197982"                                     
[16] "Component 9: Mean relative difference: 0.08535064"                                     
[17] "Component 10: Mean relative difference: 0.04919092"                                    
[18] "Component 11: Mean relative difference: 0.04411082"                                    
[19] "Component 12: Mean relative difference: 0.02757141"                                    
[20] "Component 13: Mean relative difference: 0.05151733"                                    
[21] "Component 14: Mean relative difference: 0.04769779"                                    
[22] "Component 15: Mean relative difference: 0.01895478"                                    
[23] "Component 16: Mean relative difference: 0.03867191"                                    
[24] "Component 17: Mean relative difference: 0.03078083"                                    
[25] "Component 18: Mean relative difference: 0.02475497"                                    
[26] "Component 19: Mean relative difference: 0.05288349"                                    
[27] "Component 20: Mean relative difference: 0.05189977"                                    
[28] "Component 21: Mean relative difference: 0.03430557"                                    
[29] "Component 22: Mean relative difference: 0.04530826"                                    
[30] "Component 23: Modes: numeric, logical"                                                 
[31] "Component 23: target is numeric, current is logical"                                   

Joining

The final data manipulation technique that we will discuss in this chapter is joining. In many cases we will have the dataset coming in two or more separate file, each containing different variables for the same unit for observations. This is the case with all the data coming from World Bank Open Data, where information about each indicator comes in a separate csv file. For example, suppose we have data on GDP and population density of some countries:

gdp <- read_csv("data/world_bank/gdp.csv")
pop <- read_csv("data/world_bank/pop_dens.csv")
head(gdp)
# A tibble: 6 × 23
  country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
  <chr>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Germany DEU   27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360. 44994.
2 Denmark DNK   28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804. 46727.
3 Spain   ESP   21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720. 32434.
4 France  FRA   26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679. 39524.
5 United… GBR   26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312. 39971.
6 Greece  GRC   19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284. 26098.
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>,
#   X65 <lgl>
head(pop)
# A tibble: 6 × 23
  country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
  <chr>   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 France  FRA    111.   112.   113.   114.   115.   115.   116.   117.   118.   118.   119.   119.   120.   121. 
2 United… GBR    243.   244.   245.   247.   248.   250.   252.   253.   255.   257.   259.   261.   263.   265. 
3 Ireland IRL     55.2   56.1   57.1   58.0   59.1   60.4   62.0   63.9   65.2   65.8   66.2   66.5   66.8   67.1
4 Italy   ITA    194.   194.   194.   195.   196.   197.   198.   199.   200.   201.   202.   202.   202.   205. 
5 Poland  POL    125.   125.   125.   125.   125.   125.   125.   124.   124.   125.   124.   124.   124.   124. 
6 Portug… PRT    112.   113.   114.   114.   115.   115.   115.   115.   115.   115.   115.   115.   115.   114. 
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <lgl>,
#   X65 <lgl>

Suppose we want to analyze the relationship between population density and gdp per capita. To do that, it would be convenient to merge these two datasets into one, containing the variables gdp and pop_dens. We can achieve this by using joins.

First, we pivot the data into the long format:

gdp_long <- gdp %>% pivot_longer(`2000`:`2019`, 
                                 names_to = "year", values_to = "gdp_pc")
pop_long <- pop %>% pivot_longer(`2000`:`2019`, 
                                 names_to = "year", values_to = "pop_dens")

Note that the countries in the two datasets are different:

unique(gdp_long$country)
[1] "Germany"        "Denmark"        "Spain"          "France"         "United Kingdom" "Greece"        
[7] "Poland"        
unique(pop_long$country)
[1] "France"         "United Kingdom" "Ireland"        "Italy"          "Poland"         "Portugal"      
identical(unique(gdp_long$country), unique(pop_long$country))
[1] FALSE

To join two data frames, we need an ID variable (or a set of variables) that will identify observations and allow us to join them. In the example, the country and year variables are a perfect candidate, since each corresponds to one observation from a given country in a given period. We would then say that we join the two datasets on country and year. s

There are three fundamental ways in which we can approach this:

  • Inner join is used to join only the observations where the variables we are joining on which appear in both datasets. The rows where the identifying variables don’t match any observations in the other dataset are dropped from the resulting dataset. This join is used when we care primarily about the completeness of our data. The order of the dataframes does not matter when performing inner join.
dat <- inner_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
  country ccode X65.x year  gdp_pc X65.y pop_dens
  <chr>   <chr> <lgl> <chr>  <dbl> <lgl>    <dbl>
1 France  FRA   NA    2000  26100. NA        111.
2 France  FRA   NA    2001  27502. NA        112.
3 France  FRA   NA    2002  28524. NA        113.
4 France  FRA   NA    2003  28142. NA        114.
5 France  FRA   NA    2004  29034. NA        115.
6 France  FRA   NA    2005  30499. NA        115.

We can see that the new dataframe dat contains both gdp_pc and pop_dens variables. Furthermore, only the countries present in both datasets were kept:

unique(dat$country)
[1] "France"         "United Kingdom" "Poland"        
  • Left join is used to join only the observations where the variables we are joining appear in the first dataset (the one on the left of the joining function). This is done primarily when we care about keeping all the observations from the first (left) dataset. The observations where no corresponding identifying values were found are turned into missing values:
dat <- left_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
  country ccode X65.x year  gdp_pc X65.y pop_dens
  <chr>   <chr> <lgl> <chr>  <dbl> <lgl>    <dbl>
1 Germany DEU   NA    2000  27209. NA          NA
2 Germany DEU   NA    2001  28381. NA          NA
3 Germany DEU   NA    2002  29179. NA          NA
4 Germany DEU   NA    2003  29875. NA          NA
5 Germany DEU   NA    2004  31305. NA          NA
6 Germany DEU   NA    2005  31794. NA          NA

As we can see in this example, the resulting dataset contains missing values for the countries that were not present in the pop_long dataset. All the countries from the gdp_long dataset were kept:

all.equal(unique(gdp_long$country), unique(dat$country))
[1] TRUE
  • Full join - joining observations from both data frames and producing missing values whenever there’s an observation missing in one of them.
dat <- full_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
  country ccode X65.x year  gdp_pc X65.y pop_dens
  <chr>   <chr> <lgl> <chr>  <dbl> <lgl>    <dbl>
1 Germany DEU   NA    2000  27209. NA          NA
2 Germany DEU   NA    2001  28381. NA          NA
3 Germany DEU   NA    2002  29179. NA          NA
4 Germany DEU   NA    2003  29875. NA          NA
5 Germany DEU   NA    2004  31305. NA          NA
6 Germany DEU   NA    2005  31794. NA          NA

As a result of a full_join, all countries including in either of the datasets are kept:

unique(c(dat$country))
 [1] "Germany"        "Denmark"        "Spain"          "France"         "United Kingdom" "Greece"        
 [7] "Poland"         "Ireland"        "Italy"          "Portugal"      
unique(c(gdp_long$country, pop_long$country))
 [1] "Germany"        "Denmark"        "Spain"          "France"         "United Kingdom" "Greece"        
 [7] "Poland"         "Ireland"        "Italy"          "Portugal"      

There are some other joining techniques, such as Filtering joins (semi_join and anti_join), as well as nest_join. You can read more about these in the documentation by typing ?join into the console.

Aggregating data

While we have discussed summary statistics that can be used to summarize data, it’s often very useful to compare their values across group, rather than only look at one number to describe an entire dataset. The tidyverse allows us to calculate summary statistics of variables through the summarise function. For example, to get the average GDP of countries in our data:

gdp_long %>% summarise(gdp_avg = mean(gdp_pc))
# A tibble: 1 × 1
  gdp_avg
    <dbl>
1  33486.

This is no different from using gdp_long$gdp_pc %>% mean(), other than it returns a tibble rather than a scalar value. However, the summarize function is most powerful in conjunction with the group_by function. As the name suggests, group_by function divides the data frame into groups using one of the variables. On the surface, it doesn’t appear to alter much:

gdp_long_groupped <- gdp_long %>% group_by(country)
gdp_long_groupped
# A tibble: 140 × 5
# Groups:   country [7]
   country ccode X65   year  gdp_pc
   <chr>   <chr> <lgl> <chr>  <dbl>
 1 Germany DEU   NA    2000  27209.
 2 Germany DEU   NA    2001  28381.
 3 Germany DEU   NA    2002  29179.
 4 Germany DEU   NA    2003  29875.
 5 Germany DEU   NA    2004  31305.
 6 Germany DEU   NA    2005  31794.
 7 Germany DEU   NA    2006  34119.
 8 Germany DEU   NA    2007  36250.
 9 Germany DEU   NA    2008  37802.
10 Germany DEU   NA    2009  36851.
# ℹ 130 more rows
pop_long %>%
  group_by(country) %>%
  summarise(avg_pop = mean(pop_dens, na.rm = TRUE), 
            sd_pop = sd(pop_dens, na.rm = TRUE))
# A tibble: 6 × 3
  country        avg_pop sd_pop
  <chr>            <dbl>  <dbl>
1 France           118.   3.60 
2 Ireland           63.9  4.89 
3 Italy            200.   4.67 
4 Poland           124.   0.301
5 Portugal         114.   1.13 
6 United Kingdom   258.  10.3  

Similarily, we could compare the country’s average, maximum and minimum GDP growth over the past years:

gdp_long %>% 
  group_by(country) %>%
  mutate(gdp_growth = (gdp_pc - lag(gdp_pc)) / lag(gdp_pc)) %>%
  summarise(avg_growth = mean(gdp_growth, na.rm = TRUE), 
            max_growth = max(gdp_growth, na.rm = TRUE), 
            min_growth = min(gdp_growth, na.rm = TRUE)) %>%
  arrange(-avg_growth)
# A tibble: 7 × 4
  country        avg_growth max_growth min_growth
  <chr>               <dbl>      <dbl>      <dbl>
1 Poland             0.0636     0.109      0.0361
2 Denmark            0.0398     0.0919    -0.0220
3 Germany            0.0391     0.0914    -0.0252
4 Spain              0.0364     0.111     -0.0343
5 France             0.0344     0.0633    -0.0134
6 United Kingdom     0.0330     0.0641    -0.0445
7 Greece             0.0264     0.115     -0.0730

In this case, we first group the data frame by country. We then use mutate to compute gdp_growth by subtracting GDP from one year before (calculated using the lag function) and dividing this difference by the lagged GDP. Note that the mutate function also applies the computation according to the grouping defined in the group_by - the lag() is computed within each country. We then use summarise to apply the functions. Finally, we use arrange to sort the output according to the negative value of the average GDP growth, i.e. in decreasing order.

Since the behaviour of group_by affects many operations performed on a dataframe, it is important to call ungroup() at the end of our operations if we assign the data frame to a new name - performing operations on groupped tibbles can lead to suprising results. Coming back to our example, suppose we wanted to obtain the deviation of each country’s GDP growth from the global mean of this growth. First, we would obtain the growths:

gdp_growths <- gdp_long %>% 
                group_by(country) %>%
                mutate(gdp_growth = (gdp_pc - lag(gdp_pc)) / lag(gdp_pc))

#store demeaned values:
dem <- gdp_growths %>% 
  mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE))

#stored demeaned values calculated after ungrouping
dem_ungr <- gdp_growths %>% 
  ungroup() %>%
  mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE))

all.equal(dem$growth_dem, dem_ungr$growth_dem)
[1] "Mean relative difference: 0.3395083"

We can see using the all.equal function, that the resulting variables are different. This is because, if the tibble is still groupped, the mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE)) expression subtracts the group mean from each observation, wheres if it’s ungrouped, it calculates the global mean and subtracts it. While it may seem trivial in this example, forgetting to ungroup a tibble is a common error and it is crucial to remeber to ungroup the tibble after finishing performing operations on it. Also note that calling group_by() on an already groupped tibble discards the previous groupping and applies the new one instead.

6.2 Summary

  • The tidyverse package is a set of tools for data manipulation commonly used in R.
  • readr offers a convenient inferface for reading data into R and explicitly specifying column classes through functions such as read_csv, read_delim or read_tsv.
  • a tibble is a special kind of data frame implemented in the tidyverse package. While it offers some new functionalities, working with tibbles is very similar to working with traditional R data frames.
  • the pipe operator %>% passess values into functions - instead of evaluating expressions from inside-out, they are evaluated from left to right.
  • select is used to select columns from a data frame based on their names
  • filter is used to filter rows from a data frame based on a logical statement
  • mutate is used to modify the values of existing columns or to create new columns in a data frame
  • Data can often come in a long or wide format. You can use pivot_longer and pivot_wider to switch between these data formats.
  • You can merge two data frames by a common column (or a set of columns) using dplyr’s join functions. Depending on the way you want to treat non-matching records, you may use a left_join, inner_join or a full_join
  • You can aggregate data using the summary function. It works best with group_by, which allows you to get statistics by group, following a group defined by a variable or a set of variables. You should call ungroup() on the tibble, to avoid errors.

Functions list

function package description
c() c(“.GlobalEnv”, “base”) Combine values/vectors into a vector
filter() c(“dplyr”, “package:stats”) filter out rows of a data frame according to logical vector
lag() c(“dplyr”, “package:stats”) lag observations by n periods
all.equal() base NA
class() base retrieve class of an R object (e.g. ‘numeric’, ‘list’, ‘data.frame’)
factor() base create a factor
identical() base check if two R objects are identical
is.data.frame() base check if an object is a data frame
is.na() base check if a value is NA/elements of vector are NA
list() base create a list
max() base get maximum of a vector
mean() base get mean of a vector
min() base get minimum of a vector
sum() base get sum of numeric values or a vector
unique() base get unique elements
arrange() dplyr sort values of data frame according to a variable/combination of varaibles
full_join() dplyr join matching all observations from both data frames
group_by() dplyr group tibble/data.frame by a factor variable. All further tidyverse operations are performed group-wise
inner_join() dplyr join matching only observations with joining variable in both data frames
left_join() dplyr join matching all observations from the left data frame
mutate() dplyr modify/create a column in a data frame
select() dplyr select columns from a tibble/data frame
summarise() dplyr collapse the dataset to a summary statistic. Usually used with group_by()
ungroup() dplyr NA
col_factor() readr specify column type as factor. Used in readr’s functions.
read_csv() readr read tibble from a csv file
read_delim() readr read tibble from a delimited file
complete.cases() stats retrieve indices of complete observations
sd() stats Get standard deviation of a vector
is_tibble() tibble NA
pivot_longer() tidyr reshape data wide → long
pivot_wider() tidyr reshape data long → wide
head() utils show first 5 rows of a data frame
head() utils print first n (default 5) rows of the data
read.csv() utils read a csv file to data frame. Specify stringsAsFactors = FALSE to keep all string columns as characters

6.3 Exercises

  1. Select true statements about joins and correct the false statements:

A. Full join will always produce a dataframe with more rows than inner join. B. Inner join will never produce any new missing values. C. Full join will always produce at least as many rows as left join. D. Left join is used when we want to ensure that the observations from the left data frame are always matched by a non-missing element from the right data frame.

The remaining exercises will use the student data, which consists of three .csv files: math.csv, which contains grades achieved by each student in each of the years in a long format, port.csv, with the same data for portugeese classes and info.csv, which contains demographic variables.

  1. Load the math and port data to R. Specify the id variable to be a character column.

  2. Load the info data into R and join it with the grades datasets. Make sure to keep only the records of the students who have reported grades in both portugeese and math. Use the suffix argument of the join function to specify the names of the duplicated variable names (G1, G2 and G3) to be _por for portugeese grades and _mat for math grades.

  3. Create a vector called port_missing which contains student IDs that are present in the math dataset but missing from the port dataset.

  4. Compare the average performance in Portugeese and Math in third year between the schools:

  5. Did the student grades improve on average in each subject? Was the improvement better in one school than the other? (HINT: you might want your data in a long format and use the lag function used at the end of the chapter. Store the data in a new tibble called students_imp, as it will be useful for next exercises).

  6. Did the grade improvements vary across schools?

  7. Which form of support - famsup or schoolsup lead to higher average improvement in students’ grades in each subject?

  8. Does any of the two schools seem more effective in rising the average grade through school support?

The solutions for the exercises will be available here on 2021-01-07.