5 Key Programming Concepts

5.1 Content

To make the data analysis more efficient, it is crucial to understand some of the crucial programming concepts. In the first part of this section we discuss for loops and if statements. These are so-called “control flow statements”, which are common to almost all programming languages. The second part will discuss the creation and basic usage of functions. Finally, the third part will go through the sapply() function family, a common tool used in R to apply functions over objects multiple times.

Control flow statements

For loops

For loops are essentially a way of telling the programming language “perform the operations I ask you to do N times”. A for loop in R beginns with an for() statement, which is followed by an opening curly brace { in the same line - this is esentially opening the for-loop. After this, usually in a new line, you place the code which you want to execute. Then, in the last line you close the for loop by another curly brace }. You can execute the for loop by placing the cursor either on the for statement (first line) or the closing brace (last line) and executing it as any other code. Below, you can see the for loop printing the string "Hello world!" 5 times

for(i in 1:5) {
  print("Hello world")
}
[1] "Hello world"
[1] "Hello world"
[1] "Hello world"
[1] "Hello world"
[1] "Hello world"

The i in the for statements is the variable that will sequentially take all the values of the object (usually a vector) specified on the right hand side of the in keyword. In majority of the cases, the object is a sequence of integers, as in the example below, where i takes the values of each element of the vector 1:5 and prints it.

for(i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

A for loop could be used to add a constant to each element of a vector:

x <- c(4, 5, 1, 2, 9, 8, 0, 5, 3)
x
[1] 4 5 1 2 9 8 0 5 3
#for all integers between 1 and length of vector x:
for(i in 1:length(x)) { 
  x[i] <- x[i] + 5
}
x
[1]  9 10  6  7 14 13  5 10  8

However, in R this is redundant, because of vectorization (see the section on vectors from chapter 2). The above statement os equivalent to:

x <- c(4, 5, 1, 2, 9, 8, 0, 5, 3)
x + 5
[1]  9 10  6  7 14 13  5 10  8

This is not only simpler, but also more efficient.

Another, more practical aplication of the for loop could examine all columns of a data frame for missing values, so that:

dev <- read.csv("data/un_data/dev2018.csv",
                stringsAsFactors = FALSE)
missing <- numeric() #create empty numeric vector
for (i in 1:length(dev)){
  missing[i] <- sum(is.na(dev[,i])) #get sum of missing for ith column
  names(missing)[i] <- names(dev)[i] #name it with ith column name
}
missing
country     eys     gni    lexp     mys 
      0       1       3       3       5 

From this, we can see that there are 0 misisng values in the country name, 1 missing value in the expected years of schooling variable and, 3 missing values in gni and life expectancy and 5 missing values in mean years of schooling.

While this is a bit more useful than the previous example, R still offers a shorthand method for such problems, which is discussed in more detail in the last part of this chapter. In general, due to the phenomena of vectorization, for loops are rarely used in simple data analysis in R. However, they are a core element of programming as such, therefore it’s important to understand them. In fact, vectorization is made possible only because of for loops being used by R in the background - simply their faster and more efficient versions.

If statements

If statements are another crucial programming concept. They essentially allow performing computation conditionally on a logical statement. In other words, depending on a logical expression an operation is performed or not. If loops in R are constructed in the following way:

if (logical_expression) {
  operations
}

Where logical_expression must an expression that evaluates to a logical value, for example X > 5, country == "France" or is.na(x). operations are performed if and only if the logical_expression evaluates to TRUE. The simples possible example would be

x <- 2
if (x > 0) {
  print("the value is greater than 0")
}
[1] "the value is greater than 0"

x <- -2
if (x > 0) {
  print("the value is greater than 0")
}

If is naturally complemented by the else clause, i.e. the operations that should be performed otherwise. The general form of such statement is:

if (logical_expression) {
  operations
} else {
  other_operations
}

In this case, R first checks if the logical_expression evaluates to TRUE, and if it doesn’t, performs the other_operations. For example:

x <- -2
if (x > 0) {
  print("the value is greater than 0")
} else {
  print("the value is less or equal than 0")
}
[1] "the value is less or equal than 0"

Finally, else if allows to provide another statement to be evaluated. The general form of such statement would be:

if (logical_statement) { 
  operation
} else if (other_logical_statement) {
  other_operation
} else {
  yet_another_operation
}

Here, R first checks the logical_statement, if it’s FALSE then it proceeds to check the other_logical_statement. If the second one is TRUE if performs the other_operation and if it’s FALSE it proceeds to perform the yet_another_operation. An extension of the previous example:

x <- 2
if (x > 0) {
  print("The value is positive")
} else if (x < 0) {
  print("The value is negative") 
} else {
  print("The value is 0")
}
[1] "The value is positive"

IF-ELSE statments can be used to conditionally replace values. For example, suppose that we want to create a variable that is 1 when country is France and 0 otherwise. We could do that by:

dev$france <- 0
for (i in 1:nrow(dev)) {
  if (dev$country[i] == "France") {
    dev$france[i] <- 1
  }
}

dev$france[dev$country == "France"]
[1] 1

Again, because of vectorization, R offers a shorthand for this, through the ifelse() function:

dev$france <- ifelse(dev$country == "France", 1, 0)
dev$france[dev$country == "France"]
[1] 1

When you look at the documentation ?ifelse, you can see that it takes three arguments - test, yes and no. The test argument is the logical condition - same as logical_statement in the if, with the small subtle difference that it can evaluate to a logical vector rather than one single logical value. The yes argument is the value returned by the function if the test is TRUE and the no argument is returned when test is FALSE. You can fully see this in the example below:

ifelse(c(TRUE, FALSE, FALSE, TRUE), "yes", "no")
[1] "yes" "no"  "no"  "yes"
ifelse(c(TRUE, FALSE, FALSE, TRUE), 1, 0)
[1] 1 0 0 1

Functions

R is known as a functional programming language - as you have already seen, almost all of the operations performed are done using functions. It is also possible to create our own, custom functions by combining other functions and data structures. This is done using the function() keyword. The general syntax of a function looks as follows:

function_name <- function(arg1, arg2) {
  output <- operations(arg1, arg2)
  output
}

As with any R object, you can use almost any name instead of function_name. Arguments are separeted by commas (in the above example arg1, arg2) - these are the objects you pass to your function on which you perform some arbitrary operations. Again, the arguments can have arbitrary names, but you need to use them within the function consistently. Finally, most of the functions return a value - this is the last object called within the function (output in the above example).

After creating the function we can run it, exactly the same way as we would with any of R’s built-in functions. A simple example could return the number of missing values in an object:

count_na <- function(x) {
  sum(is.na(x))
}

count_na(dev$mys)
[1] 5

We could also implement our own summary statistics function, similar to describe() discussed in the previous chapter:

summary_stats <- function(x) {
  if (is.numeric(x)) {
    list(Mean = mean(x, na.rm = TRUE), 
                SD = sd(x, na.rm = TRUE), 
                IQR = IQR(x, na.rm = TRUE))
  } else if (is.character(x)) {
    list(Length = length(x), 
                  Mean_Nchar = mean(nchar(x)))
  } else if (is.factor(x)) {
  list(Length = length(x), 
       Nlevels  = length(levels(x)))
  }
}

Let’s walk through the above function Given a vector x, the function : 1. Checks whether x is a numeric vector. If so, returns a list of it’s mean, standard deviation and interquartile range. 2. Else, checks if x is a character vector. If so, returns a list containng its length and average number of characters. 3. Else, checks if x is a factor. If so returns a list containing its length and average number of character.

We can see how it works below:

summary_stats(c(1, 2, 3, 10))
$Mean
[1] 4

$SD
[1] 4.082483

$IQR
[1] 3
summary_stats(dev$country)
$Length
[1] 195

$Mean_Nchar
[1] 9.902564
summary_stats(as.factor(dev$country))
$Length
[1] 195

$Nlevels
[1] 195

Keyword arguments

Many of the functions used in R come with so-called default arguments - this was already mentioned in sorting. When defining our own functions, we can make use of that functionality as well. For example, the count_na example can be modified in the following way:

count_na <- function(x, proportion = TRUE) {
  num_na <- sum(is.na(x))
  if (proportion == TRUE) {
    num_na/length(x)
  } else {
    num_na
  }
}

The proportion argument controls whether the function returns the number of NAs as value or as proportion of the entire vector:

count_na(dev$gni)
[1] 0.01538462
count_na(dev$gni, proportion = TRUE) #same as above
[1] 0.01538462
count_na(dev$gni, proportion = FALSE)
[1] 3

There are couple of reasons why functions are frequently applied when analyzing data: 1. To avoid repetition - often, you need to perform the same operation repeatedly - sometimes on a dataframe with tens or hunderds of columns or even multiple data frames. To avoid re-writing the same code over and over again (which always increases the chance of an error occuring). 2. To enhance clarity - when you perform a long and complicated series of operations on a dataset, it’s often much easier to break it down into functions. Then when you need to come back to your code after a long time, it is often much easier to see recode_missing_values(data) appear in your code, with the record_missing_values function defined somewhere else, as you don’t need to go through your code step by step, but only understand what particular functions return. 3 To improve performance - while most of the operations we’ve seen in R take fractions of seconds, larger data can often lead to longer computation times. Functions can be combined with other tools to make computation more elegant and quicker - some of these methods are discussed in the next section.

Sapply

Recall the code we used to check each column of our data frame for missingness in the for loops section:

missing <- numeric() #create empty numeric vector
for (i in 1:length(dev)){
  missing[i] <- sum(is.na(dev[,i])) #get sum of missing for ith column
  names(missing)[i] <- names(dev)[i] #name it with ith column name
}

We could re-write it using our new knowledge of functions, such that:

count_na <- function(x) {
  sum(is.na(x))
}

missing <- numeric()
for (i in 1:length(dev)) {
  missing[i] <- count_na(dev[,i])
  names(missing)[i] <- names(dev)[i]
}
missing
country     eys     gni    lexp     mys  france 
      0       1       3       3       5       0 

While this may look a bit more fancy, in fact more code was used to perform this operation and it doesn’t differ too much in terms of clarity. The exact same result can be achieved using the sapply() function. sapply() takes two arguments - an R object, such as a vector and a data frame and a function. Then, it applies the function to each element of this object (i.e. value in case of vectors, column/variable in case of data frames).

sapply(dev, count_na)
country     eys     gni    lexp     mys  france 
      0       1       3       3       5       0 

The result is exactly the same as in the previous case. sapply() used the count_na function on each columns of the dev dataset.

When using short, simple functions, sapply() can be even more concise, as we can defined our function without giving it a name. In the example below, instead of defining count_na separately, we define it directly within the sapply() call (i.e. inside the parentheses). This yields the same result.

sapply(dev, function(x) sum(is.na(x)))
country     eys     gni    lexp     mys  france 
      0       1       3       3       5       0 

Consider the function below. What do you expect it to return? Try going through each element of the code separately. You can check how the rowSums command works by typing ?rowSums into the R console.

quartile <- function(x) {
  quantiles <- quantile(x, c(0.25, 0.5, 0.75), na.rm = TRUE)
  comparisons <- sapply(quantiles, function(y) y <= x)
  rowSums(comparisons) + 1
}

The function takes a vector as input and computes three quantiles of its values - 25%, 50%, 75%. You may recall from the previous chapter that quantiles are cut points that divide a variable into ranges of equal proportions in the data set. The resulting quantiles vector consists of three values, corresponding with thre three quantiles. We then use sapply on these three values to compare each of them with the value of the x vector. As a result, we obtain a 3 x n array, where n is length of x. For each of the values of x we get three logical values. Each of them is TRUE when the corresponding value of x was larger than the quantile and FALSE if the corresponding value of x was lower than the quantile. We can then sum the results by row, using rowSums. Our final result is a vector with values of 0, 1 and 2. Its value is 0 if the corresponding value of x was less than all quartiles, 1 if it was greater or equal than the .25, 2 if it was greater or equal than 0.5 and 3 if it was greater or equal than all of them. We then finally add 1 to each, so that they correspond to true quartile numbers (1st quartile, rather than 0th quartile, etc).

We can then use the split function, which takes a data frame and a vector as input and splits the data frame into several parts, each with the same value of the splitting variable. As a result, we obtain dev_split dataset, which stores 4 data frames, each only with countries in the respective quantile of expected years of schooling.

dev_split <- split(dev, quartile(dev$eys))
head(dev_split[[1]])
                    country  eys  gni lexp mys france
1               Afghanistan 10.1 1746 64.5 3.9      0
14               Bangladesh 11.2 4057 72.3 6.1      0
27             Burkina Faso  8.9 1705 61.2 1.6      0
33 Central African Republic  7.6  777 52.8 4.3      0
34                     Chad  7.5 1716 54.0 2.4      0
38                  Comoros 11.2 2426 64.1 4.9      0

You can then look at descriptive statistics of each of the quartiles using:

sapply(dev_split, summary)
$`1`
   country               eys              gni             lexp            mys            france 
 Length:47          Min.   : 5.000   Min.   :  777   Min.   :52.80   Min.   :1.600   Min.   :0  
 Class :character   1st Qu.: 8.700   1st Qu.: 1611   1st Qu.:60.80   1st Qu.:3.700   1st Qu.:0  
 Mode  :character   Median : 9.700   Median : 2318   Median :64.30   Median :4.850   Median :0  
                    Mean   : 9.415   Mean   : 3579   Mean   :63.89   Mean   :4.861   Mean   :0  
                    3rd Qu.:10.550   3rd Qu.: 3731   3rd Qu.:67.00   3rd Qu.:6.075   3rd Qu.:0  
                    Max.   :11.200   Max.   :17796   Max.   :75.10   Max.   :9.800   Max.   :0  
                                     NA's   :1                       NA's   :1                  

$`2`
   country               eys             gni              lexp            mys             france 
 Length:50          Min.   :11.30   Min.   :   660   Min.   :58.90   Min.   : 3.100   Min.   :0  
 Class :character   1st Qu.:11.80   1st Qu.:  4232   1st Qu.:68.03   1st Qu.: 6.500   1st Qu.:0  
 Mode  :character   Median :12.30   Median :  6903   Median :71.50   Median : 7.850   Median :0  
                    Mean   :12.22   Mean   : 10788   Mean   :70.39   Mean   : 7.869   Mean   :0  
                    3rd Qu.:12.70   3rd Qu.: 11578   3rd Qu.:73.83   3rd Qu.: 9.475   3rd Qu.:0  
                    Max.   :13.00   Max.   :110489   Max.   :80.10   Max.   :11.600   Max.   :0  
                                                     NA's   :2       NA's   :2                   

$`3`
   country               eys             gni             lexp            mys             france 
 Length:47          Min.   :13.10   Min.   : 3317   Min.   :63.90   Min.   : 5.500   Min.   :0  
 Class :character   1st Qu.:13.65   1st Qu.:10694   1st Qu.:74.53   1st Qu.: 8.600   1st Qu.:0  
 Mode  :character   Median :14.30   Median :14356   Median :76.05   Median : 9.900   Median :0  
                    Mean   :14.19   Mean   :22644   Mean   :75.45   Mean   : 9.883   Mean   :0  
                    3rd Qu.:14.70   3rd Qu.:26054   3rd Qu.:76.88   3rd Qu.:11.200   3rd Qu.:0  
                    Max.   :15.10   Max.   :99732   Max.   :82.10   Max.   :12.600   Max.   :0  
                                    NA's   :1       NA's   :1       NA's   :1                   

$`4`
   country               eys             gni             lexp            mys            france    
 Length:50          Min.   :15.20   Min.   : 9570   Min.   :72.40   Min.   : 7.70   Min.   :0.00  
 Class :character   1st Qu.:15.68   1st Qu.:24906   1st Qu.:77.25   1st Qu.:10.43   1st Qu.:0.00  
 Mode  :character   Median :16.35   Median :34918   Median :81.20   Median :12.25   Median :0.00  
                    Mean   :16.79   Mean   :35322   Mean   :79.66   Mean   :11.55   Mean   :0.02  
                    3rd Qu.:17.40   3rd Qu.:45698   3rd Qu.:82.38   3rd Qu.:12.70   3rd Qu.:0.00  
                    Max.   :22.10   Max.   :83793   Max.   :84.70   Max.   :14.10   Max.   :1.00  

While working an R and looking for help online, you may stumble upon other variants of the sapply() functions. Essentially, all R functions with apply in their name serve the same purpose - applying a function to each element of an object. lapply() is a less user friendy version of sapply(), which always returns a list, not a vector. vapply() forces the user to determine the type of the output, which makes its behaviour more predictible and slightly faster. tapply() applies the function to data frame by group determined by another variable - a similar procedure to what we did using split() and sapply(), but in less steps.

5.2 Summary

-For loops allow to perform the same operation multiple times over a range of values of one variable. They are constructed using for(i in vector). Their use in R is relatively rare due to vectorization.

-If statements control whether na operation is performed depending on the value of a logical condition. To conditionally modify values of vectors, use ifelse(test, yes, no)

-Functions can be created by user to combine multiple operations into shorter pieces of code, which allows to avoid repetition. They take one or many arguments and return one value.

-sapply() is a function used for applying a function to each element of a vector or each column of a data frame. You may find other versions of it, with apply in their name, which perform the same task, but with slight alteration.

Functions list

function package description
count_na() .GlobalEnv NA
quartile() .GlobalEnv NA
summary_stats() .GlobalEnv NA
c() c(“.GlobalEnv”, “base”) Combine values/vectors into a vector
names() c(“.GlobalEnv”, “base”) retrieve names of a list/vector
as.factor() base coerce a vector to factor
ifelse() base return a or b depending on the value of test
is.character() base check if vector is character
is.factor() base check if a vector is of class ‘factor’
is.na() base check if a value is NA/elements of vector are NA
is.numeric() base check if vector is numeric
length() base get number of elements in a vector or list
levels() base get levels of a factor
list() base create a list
mean() base get mean of a vector
nchar() base get number of characters in a string
nrow() base get number of rows of a data frame
numeric() base initialize a numeric vector
print() base print object to the console
rowSums() base get sums of a data frame by rows
sapply() base apply function to each element of a list
split() base split list based on a function/vector
sum() base get sum of numeric values or a vector
IQR() stats obtain the inter-quartile range of a vector
quantile() stats obtain empirical quantiles of a vector
sd() stats Get standard deviation of a vector
head() utils show first 5 rows of a data frame
head() utils print first n (default 5) rows of the data
read.csv() utils read a csv file to data frame. Specify stringsAsFactors = FALSE to keep all string columns as characters

5.3 Exercises

  1. Suppose we pass a data frame to the summary_stats function. What would the function return? Why?

  2. Use the summary_stats function to summarize each variable from the iris dataset. You can load it using data(iris).

  3. Use a for loop to create a scatter plot of Sepal.Width and Sepal.length attributes from the iris dataset with each flower species (specified by the Species variable) having a different marker and color. To do that, use the skeleton code below.

data(iris)
plot(-99, -99, xlim = c(min(iris$Sepal.Width), max(iris$Sepal.Width)),
     ylim = c(min(iris$Sepal.Length), max(iris$Petal.Length)))
for (...) {
  points(x = , y = , col = , pch = )
}
  1. Create function called detect_outliers that will take a vector and a quantile threshold as an argument and will return the indices of the values that can be considered outliers given this threshold (i.e. lie above the nth quantile or below 100 - nth quantile).

  2. Extend this function so that it returns rows of the data frame that contain outliers in any of the (numerical) variables.

  3. Apply the function from the above exercise to the dev dataset. Which countries can be considered outliers?

  4. Recall the quartile function from the examples above. Can you extend it so that:

quartile <- function(x) {
  quantiles <- quantile(x, c(0.25, 0.5, 0.75), na.rm = TRUE)
  comparisons <- sapply(quantiles, function(y) y <= x)
  rowSums(comparisons) + 1
}
  • it splits a variable into an arbitary number of ranges with equal proportions (for example into deciles).

  • it returns a sensible default when a value of the vector is missing. What could such “sensible default” be? Make it the default value when specifying function arguments.

  • Try applying your new function to the dev dataset and splitting it into parts using split. You may then compare the descriptive statistics of each part using the lapply function.

The solutions for the exercises will be available here on 2021-01-07.