6 Data Manipulation
6.1 Content
While we have already discussed some methods for data manipulation, such as indexing, subsetting and modifying data frames, the majority of R users approach this task using a dedicated collection of packages called tidyverse
, introduced by Hadley Wickham, a statistician from New Zealand. While it may seem like this chapter covers tools for performing task that you are already familiar with, tidyverse
follows a different philosophy than traditional R, and has a lot of advantages including better code readability and efficiency. As before, to install and load the package, simply run the code below. Remember, that you only need to call install.packages
once, to download all the required files from CRAN. library
needs to be called every time you start a new R session to attach the library functions to your working environment.
Reading the data
In this chapter, we will use the same dataset we’ve used in the exploratory analysis chapter, which presents individual-level information about a sample of students. However, tidyverse
offers an improved set of functions for reading in the data, as part of the readr
subpackage - they work fairly similar to read.csv
introduced before, however have some advantages (for example they read character columns to character
vectors, rather than factors
without having to include the stringsAsFactors
argument, which was discussed in chapter 3. All the readr
reading functions start with read_
and are used for different file types. The most usual one is read_csv
(which you can use in exactly the same way as read.csv
), however in this case we use read_delim
, which allows us to read file with any delimiter. In case of the math
dataset each row is separated by a semicolon (which you can check by opening the file via a notebook app). Because of that we specify for the second delim
argument as in the example below:
As you can see, running the function returns a message, which shows the specification of each column that was read - col_character
refers to character
columns, while col_double()
means numeric
columns. You could force each column to be read as a specific type. For example, you may want a character column to be read as factors in some of the cases - for example sex. As mentioned above, the default setting of read_
functions is to read all numbers as numeric variables and all text as character variables.
To read the sex
column as factor, we can read the data again, this time specifying the col_types
argument. The col_types
argument takes a list
as input, in which we specify the type of selected columns, for example sex = col_factor() to tell:
math <- read_delim("data/student/student-mat-data.csv", delim = ";",
col_types = list(sex = col_factor(levels = c("M","F"))))
class(math$sex)
[1] "factor"
You may wonder why not simply use the as.factor
or factor
function:
This is exactly equivalent, however the previous way of doing this is much more explicit and concise. Anyone (including you) who reads your analysis will immedietaly know which columns have you specified to be which type.
The final facet of the readr
package functions is that they load the data as a slightly different data type than the normal read.csv
math <- read_delim("data/student/student-mat-data.csv", delim = ";",
col_types = list(sex = col_factor(levels = c("M","F"))))
class(math)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
The data comes loaded as a tibble
(tbl for short). Tibbles are a special kind of data frames implemented by the tidyverse
package - the only thing you need to know about them for know, is that everything that you learned about data frames so far applies to tibbles. They offer some improvements, which will not be discussed here.
The pipe operator
Perhaps the most imporant innovation offered by the tidyverse
package is the so-called pipe operator %>%
. Its use may feel a bit quirky at first, but it is extremely useful and is widely used by most of modern R users.
As you have learned so-far, R evaluates each function from inside out. For example, we can get the sum of missing values in a data frame by running:
This essentially performs two steps - first, runs is.na
on the math data frame, which returns a table filled with logical values, FALSE
when a given entry is not missing and TRUE
when it is. Then sum
takes this table as input and adds up the values in it (treating FALSE
as 0 and TRUE
as 1). In many cases, such statements can get long, difficult to read and error-prone, especially when keyword arguments are specified.
The same operation may be be done using the pipe operator. In this case, rather than evaluating the sequence of functions from within, they are evaluated left to right. The %>%
operator can be understood as a way of _passing the output of the thing on the left to the thing on the right as the first argument:
In this example, the math
data frame is passed to the is.na
function, and then the output is passed to the sum
function, which returns exactly the same result. As in the case of the regular call, you may store the output in a variable:
Before continuing we drop the missing observations:
While this may feel slightly unintuitive in this case, it comes in very handy when performing long sequences of operations on data, as we will see in the following sections.
Dataset manipulation
Three key functions are most commonly used for dataset manipulation in the tidyverse
package: mutate
, select
and filter
, coming from the dplyr
sub-package. They are used as follows:
- mutate
is used to modify and create columns in data frames
- select
is used to select columns by name
- filter
is used to select rows given a set of logical values
All three functions take the data frame as the first argument. For example, we can create a new column, grade_average by adding together grades for all three years and dividing them by 3:
Again, this is equivalent to:
As both operations create a variable called average
by adding G1
, G2
and G3
toegether, and dividing them by three. Not however, that in case of mutate
there’s no need to specify the $
opearator, as you pass the math
data frame as the first arguments, so the function knows that the G1
, G2
and G3
names refer to this particular data frame. The mutate
function returns the same data frame, but with the additional column average
. Most commonly, it is used with the pipe operator:
In this case, through the %>%
we pass the math
data frame to mutate
, which creates a new column and returns the updated data.frame, which is stored in math
.
filter
allows you to filter rows of the data frame, which is an operation similar to indexing. For example, we can get students with average grade higher than 10 by:
math %>% filter(average > 18)
# A tibble: 6 × 34
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime
<chr> <fct> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 GP M 15 U GT3 T 4 4 services teacher course father 1 2
2 GP M 16 U GT3 T 4 3 health services reputat… mother 1 4
3 GP M 15 U LE3 A 4 4 teacher teacher course mother 1 1
4 GP M 15 U LE3 T 4 2 teacher other course mother 1 1
5 GP F 18 U GT3 T 2 2 at_home at_home other mother 1 3
6 MS F 18 R LE3 T 4 4 other other reputat… mother 2 3
# ℹ 20 more variables: failures <dbl>, schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
# nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
# Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, average <dbl>
This entire operation could be done in one step, by:
math %>%
mutate(average = (G1 + G2 + G3)/3) %>%
filter(average > 18)
# A tibble: 6 × 34
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime
<chr> <fct> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 GP M 15 U GT3 T 4 4 services teacher course father 1 2
2 GP M 16 U GT3 T 4 3 health services reputat… mother 1 4
3 GP M 15 U LE3 A 4 4 teacher teacher course mother 1 1
4 GP M 15 U LE3 T 4 2 teacher other course mother 1 1
5 GP F 18 U GT3 T 2 2 at_home at_home other mother 1 3
6 MS F 18 R LE3 T 4 4 other other reputat… mother 2 3
# ℹ 20 more variables: failures <dbl>, schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
# nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>,
# Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, average <dbl>
Note that this time we haven’t modified the data frame - the variable average
is created only temporarily, so that we can use it as a filter.
This query shows a bit too many columns. Suppose we wanted to narrow our search and only see the guardian of the students with average higher than 18. We could use the select
function, which simply selects the column of the data frame by name:
math %>%
mutate(average = (G1 + G2 + G3)/3) %>%
filter(average > 18) %>%
select(guardian)
# A tibble: 6 × 1
guardian
<chr>
1 father
2 mother
3 mother
4 mother
5 mother
6 mother
This leaves us with a data frame of one column, showing guardians of the students with the best marks.
Reshaping data
It’s not common for social scientific to be longitudinal in nature. This means, that data in a given unit of observation (for example country, household or an individual) is observed on multiple variable (for example GDP, income, well-being) over a period of time. Such data can come in two formats - long
and wide
.
Wide data format - in the wide data format, each column represents a variable - for example, the table presented below presents grades of three students over three academic years in the wide format. Each column represents a separate year.
student | 1 | 2 | 3 |
---|---|---|---|
Thomas | 2.34 | 3.87 | 2.03 |
Mary | 3.87 | 4.58 | 2.70 |
David | 3.83 | 3.92 | 4.00 |
Long data format - in the long data format, a separate column represents the name of the variable and a separate one - value of the corresponding variable. This format is sometimes more useful for particular types of analysis such as panel data models and for visualization. You can see the student scores in the long format below:
student | year | grade |
---|---|---|
Thomas | 1 | 2.34 |
Thomas | 2 | 3.87 |
Thomas | 3 | 2.03 |
Mary | 1 | 3.87 |
Mary | 2 | 4.58 |
Mary | 3 | 2.70 |
David | 1 | 3.83 |
David | 2 | 3.92 |
David | 3 | 4.00 |
We can see an example of such data by loading the dataset gdp.csv
, which contains GDP per capita over several years for couple of European countries:
head(gdp)
# A tibble: 6 × 23
country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany DEU 27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360. 44994.
2 Denmark DNK 28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804. 46727.
3 Spain ESP 21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720. 32434.
4 France FRA 26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679. 39524.
5 United… GBR 26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312. 39971.
6 Greece GRC 19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284. 26098.
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>,
# X65 <lgl>
We can see that the data contains country names, as well as GDP values in years between 2000 and 2015 in the wide format. To reshape the data into the long format, we can use the pivot_longer
function, which comes with the tidyr
package, another element of the tidyverse
suite. In pivot longer, we specify the dataset name as the first argument (which is usually piped to the function), followed by the column names that contain the wide-format variables (assuming that they are in order, this can be specified with a names of the left-most and right-most variable, separated by a colon). Note that in this example, we also use the inverse quotation marks, since the variable are named using numbers. The names_to
argument specifies tha name of the variable which will be used to store the names of the re-formatted variables (in our example - years) and the value_to
argument specifies the name of the variable which will be used to store values (GDP per capita).
gdp_long <- gdp %>% pivot_longer(`2000`:`2019`,
names_to = "year", values_to = "gdp_pc")
head(gdp_long)
# A tibble: 6 × 5
country ccode X65 year gdp_pc
<chr> <chr> <lgl> <chr> <dbl>
1 Germany DEU NA 2000 27209.
2 Germany DEU NA 2001 28381.
3 Germany DEU NA 2002 29179.
4 Germany DEU NA 2003 29875.
5 Germany DEU NA 2004 31305.
6 Germany DEU NA 2005 31794.
As you can see, the function produces data in a long format, with only 4 columns, but 140 rows, as opposed to the wide data which consists of only 7 rows, but 22 columns.
In some cases, your data might come in a long format, yet you might want to reshape it into long. This can be done using the pivot_wider
function. This works exactly opposite to pivot_longer
. We first specify the data by piping it to the function and then use the names_from
argument to specify the name of the variable containing the variable names and value_from
to specify the variable containing the values. We end up obtaining the same data frame that we started with.
gdp_wide <- gdp_long %>% pivot_wider(names_from = "year", values_from = "gdp_pc")
head(gdp_wide)
# A tibble: 6 × 23
country ccode X65 `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
<chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany DEU NA 27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360.
2 Denmark DNK NA 28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804.
3 Spain ESP NA 21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720.
4 France FRA NA 26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679.
5 United … GBR NA 26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312.
6 Greece GRC NA 19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284.
# ℹ 7 more variables: `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
# `2019` <dbl>
all.equal(gdp_wide, gdp)
[1] "Names: 21 string mismatches"
[2] "Attributes: < Names: 1 string mismatch >"
[3] "Attributes: < Length mismatch: comparison on first 2 components >"
[4] "Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
[5] "Attributes: < Component \"class\": 3 string mismatches >"
[6] "Attributes: < Component 2: Modes: numeric, externalptr >"
[7] "Attributes: < Component 2: Lengths: 7, 1 >"
[8] "Attributes: < Component 2: target is numeric, current is externalptr >"
[9] "Component 3: Modes: logical, numeric"
[10] "Component 3: target is logical, current is numeric"
[11] "Component 4: Mean relative difference: 0.04964384"
[12] "Component 5: Mean relative difference: 0.04797772"
[13] "Component 6: Mean relative difference: 0.02722602"
[14] "Component 7: Mean relative difference: 0.05492863"
[15] "Component 8: Mean relative difference: 0.03197982"
[16] "Component 9: Mean relative difference: 0.08535064"
[17] "Component 10: Mean relative difference: 0.04919092"
[18] "Component 11: Mean relative difference: 0.04411082"
[19] "Component 12: Mean relative difference: 0.02757141"
[20] "Component 13: Mean relative difference: 0.05151733"
[21] "Component 14: Mean relative difference: 0.04769779"
[22] "Component 15: Mean relative difference: 0.01895478"
[23] "Component 16: Mean relative difference: 0.03867191"
[24] "Component 17: Mean relative difference: 0.03078083"
[25] "Component 18: Mean relative difference: 0.02475497"
[26] "Component 19: Mean relative difference: 0.05288349"
[27] "Component 20: Mean relative difference: 0.05189977"
[28] "Component 21: Mean relative difference: 0.03430557"
[29] "Component 22: Mean relative difference: 0.04530826"
[30] "Component 23: Modes: numeric, logical"
[31] "Component 23: target is numeric, current is logical"
Joining
The final data manipulation technique that we will discuss in this chapter is joining. In many cases we will have the dataset coming in two or more separate file, each containing different variables for the same unit for observations. This is the case with all the data coming from World Bank Open Data, where information about each indicator comes in a separate csv file. For example, suppose we have data on GDP and population density of some countries:
head(gdp)
# A tibble: 6 × 23
country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany DEU 27209. 28381. 29179. 29875. 31305. 31794. 34119. 36250. 37802. 36851. 38979. 42542. 43360. 44994.
2 Denmark DNK 28669. 29450. 30640. 30787. 32909. 34150. 37289. 38966. 41278. 40370. 43032. 44403. 44804. 46727.
3 Spain ESP 21592. 22959. 24372. 25019. 26120. 27607. 30683. 32436. 33263. 32123. 31704. 31868. 31720. 32434.
4 France FRA 26100. 27502. 28524. 28142. 29034. 30499. 32429. 34086. 35095. 34711. 35927. 37441. 37679. 39524.
5 United… GBR 26413. 27757. 29069. 30262. 31965. 32668. 34761. 35597. 36660. 35030. 36368. 37162. 38312. 39971.
6 Greece GRC 19524. 20964. 22616. 23871. 25437. 25578. 28515. 29290. 30856. 30388. 28169. 26141. 25284. 26098.
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>,
# X65 <lgl>
head(pop)
# A tibble: 6 × 23
country ccode `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FRA 111. 112. 113. 114. 115. 115. 116. 117. 118. 118. 119. 119. 120. 121.
2 United… GBR 243. 244. 245. 247. 248. 250. 252. 253. 255. 257. 259. 261. 263. 265.
3 Ireland IRL 55.2 56.1 57.1 58.0 59.1 60.4 62.0 63.9 65.2 65.8 66.2 66.5 66.8 67.1
4 Italy ITA 194. 194. 194. 195. 196. 197. 198. 199. 200. 201. 202. 202. 202. 205.
5 Poland POL 125. 125. 125. 125. 125. 125. 125. 124. 124. 125. 124. 124. 124. 124.
6 Portug… PRT 112. 113. 114. 114. 115. 115. 115. 115. 115. 115. 115. 115. 115. 114.
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <lgl>,
# X65 <lgl>
Suppose we want to analyze the relationship between population density and gdp per capita. To do that, it would be convenient to merge these two datasets into one, containing the variables gdp
and pop_dens
. We can achieve this by using joins.
First, we pivot the data into the long format:
gdp_long <- gdp %>% pivot_longer(`2000`:`2019`,
names_to = "year", values_to = "gdp_pc")
pop_long <- pop %>% pivot_longer(`2000`:`2019`,
names_to = "year", values_to = "pop_dens")
Note that the countries in the two datasets are different:
unique(gdp_long$country)
[1] "Germany" "Denmark" "Spain" "France" "United Kingdom" "Greece"
[7] "Poland"
To join two data frames, we need an ID variable (or a set of variables) that will identify observations and allow us to join them. In the example, the country
and year
variables are a perfect candidate, since each corresponds to one observation from a given country in a given period. We would then say that we join the two datasets on country and year. s
There are three fundamental ways in which we can approach this:
- Inner join is used to join only the observations where the variables we are joining on which appear in both datasets. The rows where the identifying variables don’t match any observations in the other dataset are dropped from the resulting dataset. This join is used when we care primarily about the completeness of our data. The order of the dataframes does not matter when performing inner join.
dat <- inner_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
country ccode X65.x year gdp_pc X65.y pop_dens
<chr> <chr> <lgl> <chr> <dbl> <lgl> <dbl>
1 France FRA NA 2000 26100. NA 111.
2 France FRA NA 2001 27502. NA 112.
3 France FRA NA 2002 28524. NA 113.
4 France FRA NA 2003 28142. NA 114.
5 France FRA NA 2004 29034. NA 115.
6 France FRA NA 2005 30499. NA 115.
We can see that the new dataframe dat
contains both gdp_pc
and pop_dens
variables. Furthermore, only the countries present in both datasets were kept:
- Left join is used to join only the observations where the variables we are joining appear in the first dataset (the one on the left of the joining function). This is done primarily when we care about keeping all the observations from the first (left) dataset. The observations where no corresponding identifying values were found are turned into missing values:
dat <- left_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
country ccode X65.x year gdp_pc X65.y pop_dens
<chr> <chr> <lgl> <chr> <dbl> <lgl> <dbl>
1 Germany DEU NA 2000 27209. NA NA
2 Germany DEU NA 2001 28381. NA NA
3 Germany DEU NA 2002 29179. NA NA
4 Germany DEU NA 2003 29875. NA NA
5 Germany DEU NA 2004 31305. NA NA
6 Germany DEU NA 2005 31794. NA NA
As we can see in this example, the resulting dataset contains missing values for the countries that were not present in the pop_long
dataset. All the countries from the gdp_long
dataset were kept:
- Full join - joining observations from both data frames and producing missing values whenever there’s an observation missing in one of them.
dat <- full_join(gdp_long, pop_long, by = c("country", "ccode", "year"))
head(dat)
# A tibble: 6 × 7
country ccode X65.x year gdp_pc X65.y pop_dens
<chr> <chr> <lgl> <chr> <dbl> <lgl> <dbl>
1 Germany DEU NA 2000 27209. NA NA
2 Germany DEU NA 2001 28381. NA NA
3 Germany DEU NA 2002 29179. NA NA
4 Germany DEU NA 2003 29875. NA NA
5 Germany DEU NA 2004 31305. NA NA
6 Germany DEU NA 2005 31794. NA NA
As a result of a full_join
, all countries including in either of the datasets are kept:
unique(c(dat$country))
[1] "Germany" "Denmark" "Spain" "France" "United Kingdom" "Greece"
[7] "Poland" "Ireland" "Italy" "Portugal"
unique(c(gdp_long$country, pop_long$country))
[1] "Germany" "Denmark" "Spain" "France" "United Kingdom" "Greece"
[7] "Poland" "Ireland" "Italy" "Portugal"
There are some other joining techniques, such as Filtering joins (semi_join
and anti_join
), as well as nest_join
. You can read more about these in the documentation by typing ?join
into the console.
Aggregating data
While we have discussed summary statistics that can be used to summarize data, it’s often very useful to compare their values across group, rather than only look at one number to describe an entire dataset. The tidyverse
allows us to calculate summary statistics of variables through the summarise
function. For example, to get the average GDP of countries in our data:
This is no different from using gdp_long$gdp_pc %>% mean()
, other than it returns a tibble rather than a scalar value. However, the summarize
function is most powerful in conjunction with the group_by
function. As the name suggests, group_by
function divides the data frame into groups using one of the variables. On the surface, it doesn’t appear to alter much:
gdp_long_groupped <- gdp_long %>% group_by(country)
gdp_long_groupped
# A tibble: 140 × 5
# Groups: country [7]
country ccode X65 year gdp_pc
<chr> <chr> <lgl> <chr> <dbl>
1 Germany DEU NA 2000 27209.
2 Germany DEU NA 2001 28381.
3 Germany DEU NA 2002 29179.
4 Germany DEU NA 2003 29875.
5 Germany DEU NA 2004 31305.
6 Germany DEU NA 2005 31794.
7 Germany DEU NA 2006 34119.
8 Germany DEU NA 2007 36250.
9 Germany DEU NA 2008 37802.
10 Germany DEU NA 2009 36851.
# ℹ 130 more rows
pop_long %>%
group_by(country) %>%
summarise(avg_pop = mean(pop_dens, na.rm = TRUE),
sd_pop = sd(pop_dens, na.rm = TRUE))
# A tibble: 6 × 3
country avg_pop sd_pop
<chr> <dbl> <dbl>
1 France 118. 3.60
2 Ireland 63.9 4.89
3 Italy 200. 4.67
4 Poland 124. 0.301
5 Portugal 114. 1.13
6 United Kingdom 258. 10.3
Similarily, we could compare the country’s average, maximum and minimum GDP growth over the past years:
gdp_long %>%
group_by(country) %>%
mutate(gdp_growth = (gdp_pc - lag(gdp_pc)) / lag(gdp_pc)) %>%
summarise(avg_growth = mean(gdp_growth, na.rm = TRUE),
max_growth = max(gdp_growth, na.rm = TRUE),
min_growth = min(gdp_growth, na.rm = TRUE)) %>%
arrange(-avg_growth)
# A tibble: 7 × 4
country avg_growth max_growth min_growth
<chr> <dbl> <dbl> <dbl>
1 Poland 0.0636 0.109 0.0361
2 Denmark 0.0398 0.0919 -0.0220
3 Germany 0.0391 0.0914 -0.0252
4 Spain 0.0364 0.111 -0.0343
5 France 0.0344 0.0633 -0.0134
6 United Kingdom 0.0330 0.0641 -0.0445
7 Greece 0.0264 0.115 -0.0730
In this case, we first group the data frame by country. We then use mutate
to compute gdp_growth
by subtracting GDP from one year before (calculated using the lag
function) and dividing this difference by the lagged GDP. Note that the mutate
function also applies the computation according to the grouping defined in the group_by
- the lag()
is computed within each country. We then use summarise
to apply the functions. Finally, we use arrange
to sort the output according to the negative value of the average GDP growth, i.e. in decreasing order.
Since the behaviour of group_by
affects many operations performed on a dataframe, it is important to call ungroup()
at the end of our operations if we assign the data frame to a new name - performing operations on groupped tibbles can lead to suprising results. Coming back to our example, suppose we wanted to obtain the deviation of each country’s GDP growth from the global mean of this growth. First, we would obtain the growths:
gdp_growths <- gdp_long %>%
group_by(country) %>%
mutate(gdp_growth = (gdp_pc - lag(gdp_pc)) / lag(gdp_pc))
#store demeaned values:
dem <- gdp_growths %>%
mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE))
#stored demeaned values calculated after ungrouping
dem_ungr <- gdp_growths %>%
ungroup() %>%
mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE))
all.equal(dem$growth_dem, dem_ungr$growth_dem)
[1] "Mean relative difference: 0.3395083"
We can see using the all.equal
function, that the resulting variables are different. This is because, if the tibble is still groupped, the mutate(growth_dem = gdp_growth - mean(gdp_growth, na.rm = TRUE))
expression subtracts the group mean from each observation, wheres if it’s ungrouped, it calculates the global mean and subtracts it. While it may seem trivial in this example, forgetting to ungroup a tibble is a common error and it is crucial to remeber to ungroup the tibble after finishing performing operations on it. Also note that calling group_by()
on an already groupped tibble discards the previous groupping and applies the new one instead.
6.2 Summary
- The
tidyverse
package is a set of tools for data manipulation commonly used in R. readr
offers a convenient inferface for reading data into R and explicitly specifying column classes through functions such asread_csv
,read_delim
orread_tsv
.- a tibble is a special kind of data frame implemented in the
tidyverse
package. While it offers some new functionalities, working with tibbles is very similar to working with traditional R data frames. - the pipe operator
%>%
passess values into functions - instead of evaluating expressions from inside-out, they are evaluated from left to right. select
is used to select columns from a data frame based on their namesfilter
is used to filter rows from a data frame based on a logical statementmutate
is used to modify the values of existing columns or to create new columns in a data frame- Data can often come in a long or wide format. You can use
pivot_longer
andpivot_wider
to switch between these data formats. - You can merge two data frames by a common column (or a set of columns) using
dplyr
’sjoin
functions. Depending on the way you want to treat non-matching records, you may use aleft_join
,inner_join
or afull_join
- You can aggregate data using the
summary
function. It works best withgroup_by
, which allows you to get statistics by group, following a group defined by a variable or a set of variables. You should callungroup()
on the tibble, to avoid errors.
Functions list
function | package | description |
---|---|---|
c() | c(“.GlobalEnv”, “base”) | Combine values/vectors into a vector |
filter() | c(“dplyr”, “package:stats”) | filter out rows of a data frame according to logical vector |
lag() | c(“dplyr”, “package:stats”) | lag observations by n periods |
all.equal() | base | NA |
class() | base | retrieve class of an R object (e.g. ‘numeric’, ‘list’, ‘data.frame’) |
factor() | base | create a factor |
identical() | base | check if two R objects are identical |
is.data.frame() | base | check if an object is a data frame |
is.na() | base | check if a value is NA/elements of vector are NA |
list() | base | create a list |
max() | base | get maximum of a vector |
mean() | base | get mean of a vector |
min() | base | get minimum of a vector |
sum() | base | get sum of numeric values or a vector |
unique() | base | get unique elements |
arrange() | dplyr | sort values of data frame according to a variable/combination of varaibles |
full_join() | dplyr | join matching all observations from both data frames |
group_by() | dplyr | group tibble/data.frame by a factor variable. All further tidyverse operations are performed group-wise |
inner_join() | dplyr | join matching only observations with joining variable in both data frames |
left_join() | dplyr | join matching all observations from the left data frame |
mutate() | dplyr | modify/create a column in a data frame |
select() | dplyr | select columns from a tibble/data frame |
summarise() | dplyr | collapse the dataset to a summary statistic. Usually used with group_by() |
ungroup() | dplyr | NA |
col_factor() | readr | specify column type as factor. Used in readr’s functions. |
read_csv() | readr | read tibble from a csv file |
read_delim() | readr | read tibble from a delimited file |
complete.cases() | stats | retrieve indices of complete observations |
sd() | stats | Get standard deviation of a vector |
is_tibble() | tibble | NA |
pivot_longer() | tidyr | reshape data wide → long |
pivot_wider() | tidyr | reshape data long → wide |
head() | utils | show first 5 rows of a data frame |
head() | utils | print first n (default 5) rows of the data |
read.csv() | utils | read a csv file to data frame. Specify stringsAsFactors = FALSE to keep all string columns as characters |
6.3 Exercises
- Select true statements about joins and correct the false statements:
A. Full join will always produce a dataframe with more rows than inner join. B. Inner join will never produce any new missing values. C. Full join will always produce at least as many rows as left join. D. Left join is used when we want to ensure that the observations from the left data frame are always matched by a non-missing element from the right data frame.
The remaining exercises will use the student data, which consists of three .csv files: math.csv
, which contains grades achieved by each student in each of the years in a long format, port.csv
, with the same data for portugeese classes and info.csv
, which contains demographic variables.
Load the
math
andport
data to R. Specify the id variable to be a character column.Load the
info
data into R and join it with the grades datasets. Make sure to keep only the records of the students who have reported grades in both portugeese and math. Use thesuffix
argument of the join function to specify the names of the duplicated variable names (G1
,G2
andG3
) to be_por
for portugeese grades and_mat
for math grades.Create a vector called
port_missing
which contains student IDs that are present in themath
dataset but missing from theport
dataset.Compare the average performance in Portugeese and Math in third year between the schools:
Did the student grades improve on average in each subject? Was the improvement better in one school than the other? (HINT: you might want your data in a long format and use the
lag
function used at the end of the chapter. Store the data in a new tibble calledstudents_imp
, as it will be useful for next exercises).Did the grade improvements vary across schools?
Which form of support -
famsup
orschoolsup
lead to higher average improvement in students’ grades in each subject?Does any of the two schools seem more effective in rising the average grade through school support?
The solutions for the exercises will be available here on 2021-01-07.