3 Data structures
3.1 Content
The the previous chapter you have become familiar with the most common data structure in R programming - a vector. In this section, you will be introduced to some more advanced data structures that are often used in R, in particular the data.frame
, which is the most common way of storing and manipulating data in R.
Generally, the best way to examine any R object is using the str()
function, which returns contents of the object along with its class. For example you can check how it works for simple vectors
num
indicates that it is an integer vector and [1:3]
tells us that its index ranges from 1 to 3.
Data Frames
In the previous chapter’s exercises you’ve manipulated data related to some basic development indicators of several countries. When we’re dealing with multiple variables represented by multiple vectors, it’s often very useful to store them toegether as one entity - a data.frame
. Data frames can simply be thought of as tables, where each of the columns is a vector with its unique name. In this case, we can store the information about countries in a data frame called dev_data
.
country <- c("Argentina", "Georgia", "Mexico",
"Philippines", "Turkey", "Ukraine")
eys <- c(17.6, 15.4, 14.3, 12.7, 16.4, 15.1)
mys <- c(10.6, 12.8, 8.6, 9.4, 7.7, 11.3)
lexp <- c(76.5, 73.6, 75, 71.1, 77.4, 72)
gni <- c(17611, 9570, 17628, 9540, 24905, 7994)
dev_data <- data.frame(country, eys, mys, lexp, gni)
We can use the head
function to see the first 5 rows of the data (in the toy example we have above it might seem unnecessary, but it is useful to get an overview of all the variables when the data consists of potentially thousands of rows).
head(dev_data)
country eys mys lexp gni
1 Argentina 17.6 10.6 76.5 17611
2 Georgia 15.4 12.8 73.6 9570
3 Mexico 14.3 8.6 75.0 17628
4 Philippines 12.7 9.4 71.1 9540
5 Turkey 16.4 7.7 77.4 24905
6 Ukraine 15.1 11.3 72.0 7994
The str()
function is also very useful to get an overview of the variables included in a dataframe:
str(dev_data)
'data.frame': 6 obs. of 5 variables:
$ country: chr "Argentina" "Georgia" "Mexico" "Philippines" ...
$ eys : num 17.6 15.4 14.3 12.7 16.4 15.1
$ mys : num 10.6 12.8 8.6 9.4 7.7 11.3
$ lexp : num 76.5 73.6 75 71.1 77.4 72
$ gni : num 17611 9570 17628 9540 24905 ...
To access a column stored in a dataframe, you can use the $
operator.
Similarily, we can use the same operator to create a new column:
dev_data$log_gni <- log(dev_data$gni)
dev_data$log_gni
[1] 9.776279 9.166388 9.777244 9.163249 10.122824 8.986447
As in the case of vectors, data frames can be indexed to retrieve values stored at specific positions. Since data frame is a table, each position in a dataframe is associated with two indices - one for rows, the other for columns - the first index references the row and the second the column. For example, the code below retrieves the value from the second row from the third column of dev_data
.
Note that this is identical to:
This is because the mys
is the third column in dev_data
.
By leaving one of the indices empty, we can also retrieve entire row/column of a data frame:
dev_data[1, ] #get first row
country eys mys lexp gni log_gni
1 Argentina 17.6 10.6 76.5 17611 9.776279
Data frames can also be indexed with integer vectors. Such indexing will always return a smaller data frame. For example, to retrieve rows 1 and 5 from columns 2 and 3, we can do:
Similarily, character vectors referencing the column names can be used to subset a dataframe. To achieve similar result to the one above, one could also type:
We can also use logical indexing to subset dataframes. Recall from the previous chapter, that we can check which values of a given vector satisfy a certain condition by:
We can then use the output generated by the above code to index the dev_data
data frame and obtain the rows with gni per capita larger than 10000:
dev_data[dev_data$gni > 10000, ]
country eys mys lexp gni log_gni
1 Argentina 17.6 10.6 76.5 17611 9.776279
3 Mexico 14.3 8.6 75.0 17628 9.777244
5 Turkey 16.4 7.7 77.4 24905 10.122824
There are many useful functions that work in combination with data frames. Below, there are several examples:
The with
command allows to evaluate column names in the context of a given data frame. This means, that we do not have to reference the data frame name whenever we use one of its columns. Suppose we wanted to calculate the UN’s Education Index as in the previous section’s exercises and assign it’s values to a new column in dev_data
, dev_data$edu_ind
. This could be done by:
However, in many circumstances this will require you to reference the name of the data frame you are using multiple times, often making the code long and unreadable. To avoid it, it’s often useful to do:
The with
function takes name of the dataframe as its first argument and the operation you want to perform as the second argument.
Similarily, to subset a dataframe by multiple variables, the subset()
command can be used:
Factors
When looking at the str(dev_data)
you could’ve noticed that the country
variable is a vector type that we haven’t encountered earlier - a factor. Factors are a specific type of vectors used to store values that take a prespecified set of values, called factor levels. For example, suppose we have two character vectors storing names of students and their year. We can use factor()
to create a factor vector from a character vector. This can be done for any other type of vector as well.
name <- c("Thomas","James","Kate","Nina","Robert","Andrew","John")
year_ch <- c("Freshman","Freshman","Junior","Sophmore","Freshman","Senior","Junior")
year_ch
[1] "Freshman" "Freshman" "Junior" "Sophmore" "Freshman" "Senior" "Junior"
year <- factor(year_ch)
year
[1] Freshman Freshman Junior Sophmore Freshman Senior Junior
Levels: Freshman Junior Senior Sophmore
We can view the unique levels of the factor using the levels()
function:
A crucial difference between factor and character vectors is that the former have an underlying integer representation. That means, that there’s a natural ordering to their levels, which is alphabetic by default. We can see that using the coercion function as.numeric
on the year
factor.
year
[1] Freshman Freshman Junior Sophmore Freshman Senior Junior
Levels: Freshman Junior Senior Sophmore
Note that the ordering of the values corresponds with the ordering obtained by the levels()
function. This matters in some circumstances (such as when using factor variables in regression models, discussed in the Linear Regression section of the course). It’s a good practice to explicitly pass the factor levels to the factor()
constructor. For example, in our case, “Sophmore” comes as the last value of the factor, even though it would make more sense for it to be second. Explicit creation of the factor levels can be seen below:
year_ch <- c("Freshman","Freshman","Junior",
"Sophmore","Freshman","Senior","Junior")
year <- factor(year_ch, levels = c("Freshman","Sophmore","Junior","Senior"))
We can now see that the ordering of the levels is different, and so is the underlying numeric representation of the factor:
Note that we cannot change the value of a factor vector to any other than the pre-specified levels:
The error message returned by R means that the value we were trying to assign to the factor is not one of the predefined levels (i.e. “Freshman”,“Junior”, “Senior” and “Sophmore”) and thus NA
missing value was generated.
However, if we know that a level that has no values attached to it will be created in the future, NA
s can be avoided by explicitly creating an unused levels when constructing the factor vector.
year_ch <- c("Freshman","Freshman","Junior",
"Sophmore","Freshman","Senior","Junior")
year <- factor(year_ch, levels = c("Freshman","Sophmore","Junior","Senior", "Graduate"))
Here, we have created the variable with 5 levels: Freshman, Sophmore, Junior, Senior, Graduate, even though only 4 of them are actual values of the factor. As a result, we can assign a value with the “Graduate” value without producing NA
s. The relevance of having empty factor levels will become apparent in the next part of the book when discussing Cross-Tabulation.
year[1] <- "Graduate"
year
[1] Graduate Freshman Junior Sophmore Freshman Senior Junior
Levels: Freshman Sophmore Junior Senior Graduate
We can also rename the levels of an existing factor, by using the levels<-
command. This can be done either to specific levels of a factor…
year <- factor(year_ch, levels = c("Freshman","Sophmore","Junior","Senior"))
levels(year)[1] <- "Fresher"
year
[1] Fresher Fresher Junior Sophmore Fresher Senior Junior
Levels: Fresher Sophmore Junior Senior
…or to all the levels:
levels(year) <- c("First","Second","Third","Final")
year
[1] First First Third Second First Final Third
Levels: First Second Third Final
This way, all the values of the character are changed very quickly.
Finally, there’s some confusion about the difference between factor()
and as.factor()
functions. In many contexts, these can be used equivalently, since both create a factor vector from a numeric or a character vector. However, some important differences include:
factor()
allows to explicitly pass vector levels at construction, whetheras.factor()
assigns them by defaultThe behaviour of the two functions is different when passed factors with empty levels. For example, let’s create the
year
factor as earlier and only keep the first three values. In this case, the Sophmore and Senior levels are unused.
year_char <- c("Freshman","Freshman","Junior",
"Sophmore","Freshman","Senior","Junior")
year <- factor(year_char, levels = c("Freshman","Sophmore","Junior","Senior"))
year <- year[1:3]
year
[1] Freshman Freshman Junior
Levels: Freshman Sophmore Junior Senior
Passing the year
vector to as.factor
will not change anything in the vector’s structure:
However, using factor()
constructor on an existing factor vector is a convenient way to drop unused levels (when it’s desirable):
- The performance of
as.factor()
tends to be quicker when numeric or character vectors are passed to it. The two commands also treatNA
levels slightly differently. You can read more about it in this Stack Overflow post.
Finally, some R functions such as the data.frame
constructor treat all read all character vectors as factors by default. This can be noticed by examining the dev_data
data frame we created earlier:
str(dev_data)
'data.frame': 6 obs. of 7 variables:
$ country: chr "Argentina" "Georgia" "Mexico" "Philippines" ...
$ eys : num 17.6 15.4 14.3 12.7 16.4 15.1
$ mys : num 10.6 12.8 8.6 9.4 7.7 11.3
$ lexp : num 76.5 73.6 75 71.1 77.4 72
$ gni : num 17611 9570 17628 9540 24905 ...
$ log_gni: num 9.78 9.17 9.78 9.16 10.12 ...
$ edu_ind: num 0.842 0.854 0.684 0.666 0.712 ...
As you can see country
is a factor with 6 levels - each for one country name. This doesn’t make too much sense, as the column is unlikely to have any repeating values. To avoid this behaviour, we can set the stringsAsFactors
optional argument in the data.frame
function explicitly to FALSE
. This way, all the character vectors remain character variables in the data frame.
dev_data <- data.frame(country, eys, mys, lexp, gni, stringsAsFactors = FALSE)
str(dev_data)
'data.frame': 6 obs. of 5 variables:
$ country: chr "Argentina" "Georgia" "Mexico" "Philippines" ...
$ eys : num 17.6 15.4 14.3 12.7 16.4 15.1
$ mys : num 10.6 12.8 8.6 9.4 7.7 11.3
$ lexp : num 76.5 73.6 75 71.1 77.4 72
$ gni : num 17611 9570 17628 9540 24905 ...
Reading and writing the data
Reading from CSV
While so far, we’ve created small and simple datasets by manually typing them into the scripts, the usual way of loading data into R is through external files. The most common format used to store data for R analysis is a CSV file, which stands for Comma Separated Values. This essentially means, that the data is represented as a text file, in which values are separeted by columns to indicate their relative positions - for example, a csv file with 5 columns will have 4 commas to separate them in each row.
In the example below, we read in data on Human Development Indicators for 209 countries for 2018 obtained from the UN Human Development Reports. Yuo can download the file used in the example from here.
In the above example, the first argument specifies the path to the file read as a string, i.e. enclosed in quotation marks. The file can be read:
1. using absolute path - for example dev <- read.csv("C:/Users/yourusername/Documents/dev2018.csv")
in Windows or dev <- read.csv("/Users/yourusername/Documents/dev2018.csv")
in MacOS. In this case, you need to provide the full path to where the file is located in the computer.
- using relative path, as in the above example. In this case, R will search for the directory in your current working directory. Working directory is simply the specific folder in your computer in which R looks for the data. R Studio usually sets one default working directory (this can be changed under Tools -> Global Options -> Set Default Working Directory). This means that every time you open RStudio or restart your R session (as described in Chapter 1, the working directory is set to this default. You can also change working directory manually by executing the
setwd()
function from your script or the console.
You can also get your current working directory by using the getwd()
function:
While some users tend to include setwd(path/to/project)
in the beginnings of their scripts, this is potentially problematic, as whenever you move your data or script to another folder, errors are likely to occur. Therefore, it is a good practice to always set working directory to the location of your R Source script and keep the data in the same folder as your source script. This can be done by choosing the Session tab
Note that in this case, it is assumed that you have selected “Set Working Directory” > “To Source File” location from the “Session” tab in Rstudio, as discussed in the Introduction and that the directory of the source file has a folder called “data” in which the dev2018.csv
file is stored. Alternatively, dev <- read.csv("dev2018.csv")
would read the file directly from your working directory. You could also use dev <- read.csv("C:/Users/yourusername/Documents/dev2018.csv")
in Windows or dev <- read.csv("/Users/yourusername/Documents/dev2018.csv")
in MacOS to read the data file from an arbitary folder using its absolute path. Similarily to the data.frame
constructor, we can also use the stringsAsFactors
argument to ensure all character variables are read as strings.
You can also save data to .csv files by using the write.csv
, which takes the data frame as its first argument and the string specifying the path to which you want to save the file as the second argument. For example, suppose we want to keep only the first 40 rows of the data and store it in a separate file.
Reading from other formats
While csv is the most common format, the data is often likely to come in many other variants - common examples include Stata’s .dta
files or SPSS’ .sav
, as well as .xlsx
Excel format. Some of the R packages offer functionalities percisely to deal with such files.
So far, we have only used the built-in functionalities offered by R. While their range is pretty extensive and the ones covered in this course are only the tip of the iceberg, much more than that is offered by user-made packages, which offer new functions useful for specific tasks. The official R packages are available through CRAN. To use a package it needs to be installed first and then loaded. For example, to use an example package named foo, you should first run install.packages("foo")
to download the package files from CRAN and install it and then put library(foo)
in your R Script to load it into R. Note that while installation has to be done only once, you have to load the library every time you use it - that’s why, you should always put the library
calls at the top of your R script. If you use a function from that package without loading it first, your R script execution will fail! Please also note, that you pass the package name as a string (i.e. in quotation marks) to the install.packages
, but without them to library
.
Coming back to our example, we can use the R package haven
to load Stata, SPSS and SAS files. You can see an example below:
Similarily, the data can be written using:
Other alternatives offered by the haven
package inlcude read_sav
or read_xpt
. Other packages useful for reading unusual data types include readxl
for reading Excel files and foreign
for a broader choice of file types.
Missing values
As mentioned earlier, the NA
missing value constant is particularly important in R. Real-life data that you are likely to deal with most of the time when using R in practice is often imperfect and missingness should be addressed as one of the first steps of the analysis process.
is.na()
command can be used to determine whether a value of an R object is missing. It returns true for each value of the index which is missing.
To count NAs in an R object we can levarage the fact that TRUE
values are also interpreted as 1 and use the sum
function:
You can also verify whether an object contains NA
s using the anyNA
function. Let’s check if the HDI data we have loaded contains any missing values:
The column returns TRUE
. Therefore there is some missingness in the data.
Another useful function for missing data analysis is complete.cases
. As the name suggests, given a data frame it returns a logical vector with TRUE
for each row which doesn’t contain missing values. We can verify which observations are the cause of the data missingness:
Lists
The final key R data structure covered in this section are lists. Similarily to data frames, lists can be thought of containers to store other data structures.2 However, unlike data frames, they are less strict in terms of their contents - a list can store vectors of different length, data frames and even other lists. Lists are created with the list()
constructor.
my_list <- list(names = c("Tom","James","Tim"), values = 1:20)
my_list
$names
[1] "Tom" "James" "Tim"
$values
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
You can extract elements from a list using their names or their numeric index. To index a list, double square brackets [[
are used, as opposed to vectors.
If a list is indexed with single brackets, it returns a one-element list, rather than the object stored in it:
You can also extract elements from a list using the $
operator, similarily to data.frames. Finally, you can assign values to lists similarily as in the case of vectors or data.frames:
3.2 Summary
Data Frames are one of the most common data structures in R. You can think about them as an Excel spreadsheet with tables with rows and columns or as a list of variables (vectors of equal length), each with its unique name. Each value in a data frame has two indexes - one for the row number and one for the column number. For example,
df[2,3]
retrieves the value in the second row from the third column.Factors are a special kind of vectors that can only take a pre-specified set of values, determined when the factor is created.
csv files are the most common way of storing data in R. You can load the data from them using
read.csv
and save the data usingwrite.csv
.working directory is the folder in your computer where R looks for files by default. You can check it using the
getwd()
function, change it usingsetwd()
or set it to the location where your current script is located by selectingSession > Set Working Directory > To Source File Location
.packages are sets of new functions developed for R by external developers, extending its functionalities. You can install them using
install.packages
and load them using thelibrary
function.missing values are marked in R by the
NA
token. There are many useful functions created to detect missing values, such asis.na
,complete.cases
oranyNA
.lists are another type of R data structure. They can be thought of as containers, which can be used to store arbitrary elements at each position.
3.3 Exercises
- The following code returns an error. Why? Check what happends if we set b to
1:5
instead of1:3
. Explain this behaviour.
What is the difference between character and factor vectors in R? In what situation you might prefer one over the other and vice versa?
To complete this exercise, load the
dev2018.csv
data into R.
What proportion of the rows are complete?
Store all the non-missing rows in a data.frame called
dev_clean
.For the
dev_clean
, compute the HDI following the method outlined in the previous chapter.Use indexing to retrieve:
- countries with HDI greater than 0.7 or GNI per capita greater than 10000
- 10 countries with the largest GNI
- 10 countries with shortest life expactancy at birth
- the development data for Poland
- countries with Education Index higher than Life Expectancy Index
- The UN categorizes the countries into 4 groups based on their HDI value - very high human development \(HDI \geq 0.8\), high human development \(0.8 > HDI \geq 0.7\), medium human development \(0.7 > HDI \geq 0.55\) and low human development \(0.55 > HDI\). Based on this thresholds, create a data frame called
hdi_groups, with element names
"vhigh", "high", "med", "low"
, with each containing a dataframe only with observations corresponding to its respective HDI group. How many rows (as a fraction of total data.frame size) does each of these levels consist of?
- The following operation returns a warning error and the result is not quite as you would expect. Why? How would you replace the first element of the list with the
1:5
sequence so that the error doesn’t appear? Name the two ways this could be done.
The solutions for the exercises will be available here on 2020-11-12.
More specifically, the
data.frame
class is a special type oflist
- you can verify that by running thetypeof
function with a data frame as input.↩︎