11 R Markdown

11.1 Content

One of the more persuasive arguments to use R (over say SPSS) is its ability to easily make work reproducible. This means that you could give your RScript to another person, and they would be able to replicate, step by step your data preparation and analysis, obtaining the same results. This is not only a fundamental part of any scientific enquiry, but also helpful to you, should you wish to replicate your own work at a later stage. Take it from me: after a few months you will have forgotten any data management procedure or steps used in an analysis. How do you achieve this? You can, of course, use annotations in your RScript to explain to others and yourself what you have done in each step. And in fact you should do exactly that as a matter of routine. But we can go a step further than that.

You might have asked yourself when studying the previous Chapters how to get all the great output in the form of Tables and Figures into an essay, or your dissertation. The answer to this is Markdown. It lets you create reproducible essays / articles with great ease, and even has a feature to create your bibliography and take care of your referencing. Intrigued? Then read on!

By the way, this very webpage is also created with R Markdown.

Introduction

So, what is R Markdown? As promised in the introduction,

R Markdown provides a unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more. (https://r4ds.had.co.nz/r-markdown.html)

Usage

The use of R Markdown is very flexible and allows you to reveal as much of the coding as you wish. You can simply regard it as a text processing programme with the added benefit of being able to include the output of your analysis seamlessly in the document you produce. All modules with coding element in PAIS require you to also submit the RScript, or indeed, the Markdown file (this is an .Rmd file). So it is a good idea to start writing your assessments in Markdown for those modules.

Looking ahead, you can use Markdown to “collaborate with other data scientists who are interested in both your conclusions, and how you reached them ( i.e. the code).” Or you can see Markdown as “an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.” (https://r4ds.had.co.nz/r-markdown.html)

The Components of an R Markdown Document

To start with a Markdown document, you need to click “File”, “New File”, and “R Markdown”. R will ask you what type of output you wish to create. We will assume that you want to create a document here, either in pdf format, or MS Word. In what follows, I am creating a pdf (which is preferable, because it doesn’t mess up the formatting), but the process is almost identical for a Word file.

This selection process opens a new window in R with three components:

YAML
Text
Code Chunks

I will take you through these components in turn now.

The YAML

The bit at the top of the window is called YAML which is an acronym for “Yet Another Markup Language”. This part determines the formatting and settings of your document. For now, it is OK to accept the defaults (amending the title and author of the document, of course), but it’s worthwhile investing a little more time into this at a later stage to personalise the document to your preferences.

Text

Basically, you can just write the text as you always would. You will have noted that there is no formatting menu appearing as in Word, for example, to make text bold, etc. We will do this with code and I will explain how to do this further below.

R Markdown is not WYSIWYG (What You See Is What You Get). Much rather, R needs to compile your document once you have finished writing it. In this process - which is called “knitting” - R formats the text, executes all the code chunks (see below), and includes their output in the document. Start by compiling the template R has given you when you created the document. For this, simply press the “knit” button in the task bar:

When you knit the document for the first time, R will ask you where to save the document. It is sensible to treat the Markdown document just like an RScript: create a new folder which will serve as the working directory for the document.

The Markdown template is trying to be helpful in that it introduces you to a number of formatting options, such as headings, links, bold setting, etc. But I will take you through these in a more systematic way now. Let’s start with headings.

Headings

You create a heading by preceding it with a hashtag. The number of hashtags determines the the depth of the numbering:

# Heading 

## Sub-Heading

### Sub-Sub Heading

As a default, R does not number these. But you can add numbering by changing the YAML slightly:

I have set the output type “pdf_document” to a new line, indented it, and finished the line with a colon. This flags to R that there is more formatting coming for this document type. The next line is self-explanatory, I think.

Note that I have reduced the number of hashtags in the document to one, to make “R Markdown” a level 1 heading.

Emphasis

The word “Knit” is set in bold in the sample document, and this is achieved by sandwiching the word in two asterisks on either side. Here is a list of the most commonly used modes of emphasising text:

*italic*

**bold**

\texttt(courier)

\underline(underline)

Links

If you wish to include a hyperlink, you can just copy and paste it into the document, and Markdown will recognise it as such. URLs look ugly, though, and so it sometimes makes sense to hide it with a caption. For this, you set the text you wish to be displayed in the document into squared brackets and the URL in normal brackets directly after:

[Analysing Quantitative Data with R](https://drfloreiche.github.io/)

Lists

To insert a bullet point list, simply start a bullet point with a dash. If you want to create sub-bullet points, you need to indent the line with 4 spaces, or 2 tabs.

- 
    - 
    -

Numbered Lists

The same principle applies to numbered lists:

1. 
    a. 
    b. 
2.

Line Breaks

Once you finish a paragraph and you want to start a new one, you need to have one clear line in between.

This will not produce a line break:

line 1
line 2

But this will:

line 1

line 2

Block Quotes

If you are quoting more than 40 words from a source, the quote needs to be indented as a block. To achieve this in Markdown, simply precede the quote with “$>$”. The block will turn green (only in R, not in your document) when you do so.

Equations

If you have ever tried to set an equation in MS Word, you probably wanted to frisbee you computer out of the window – it is cumbersome and painful. In R, it’s a breeze. You have two ways to start an equation. Either you wrap it in $ signs:

$ equation $

or you start an equation environment (if you are doing the LaTeX course you will recognise this):

\begin{equation}
equation
\end{equation}

To give you an example, the command

\begin{equation}
Y = \beta_{0} + \beta_{1} x_{i} + \epsilon
\end{equation}

results in:

\[\begin{equation} Y = \beta_{0} + \beta_{1} x_{i} + \epsilon \tag{11.1} \end{equation}\]

Note that subscripts are produced by underscores, and superscripts are produced with a hat.

To suppress the numbering of the equation you need to include an asterisk in the equation environment:

\begin{equation*}
Y = \beta_{0} + \beta_{1} x_{i} + \epsilon
\end{equation*}

\[\begin{equation*} Y = \beta_{0} + \beta_{1} x_{i} + \epsilon \end{equation*}\]

List of Symbols

Especially for equations you will need to use a range of special characters and symbols. You can find a good compilation of symbols here:

https://latex.wikia.org/wiki/List_of_LaTeX_symbols

Code Chunks

One of the big perks of working with R Markdown is to include code and / or output in the document you are writing. If you want to include R Code in Markdown you need to wrap it in:

```{r}

```

Rather than typing this out every time, there is a handy shortcut:

Mac: Option + Command + I
Windows: Ctrl + Alt + I

Let me give you an example. If you want to caluclate the sum of 5 and 3 you type:

```{r}
5+3
```

This will result in:

5+3
[1] 8

There are a number of options available for code chunks, because sometimes you will want to hide the code and only display the result, or you only want to show the code without executing it, etc. You can suppress various types of output by:

For example, if you want to display the code, but not execute, you include `eval=FALSE’ in the chunk:

```{r eval=FALSE}
5+3
```

Very often, R packages have messages that have no business in finished documents. To suppress them you type:

```{r message=F}
library(tidyverse)
```

Or if you want a mix of the two:

```{r message=F, eval=F}
library(tidyverse)
```

If you create a plot with R, then you would most likely want to suppress the code, and only display the output. In this case you would call:

```{r echo=F}
plot(x,y)
```

Figures and Graphs

You can also use code chunks to include figures and graphs which you have not created with R. For this you use the knitr package. Note that the file you wish to include needs to be placed in the same working directory as the Rmd file (or in a sub-folder, but then you need to include the path in the code chunk). The option out.width= determines the width of the figure in a percentage of the width of the text.

```{r echo=FALSE, out.width='75%'}
knitr::include_graphics('./filename.png')
```

You should always add a caption to a figure or table so that the reader knows what is being displayed.

```{r echo=FALSE, out.width='75%', fig.cap="\\label{fig:test}Test Caption"}
knitr::include_graphics('./filename.png')
```

Since we have labelled the figure, we can now refer to it in the text as follows:

\ref{fig:test}

So for example this:

```{r echo=FALSE, out.width='75%', fig.cap="\\label{fig:spell}RStudio Task Bar"}
knitr::include_graphics('./spell.png')
```

As we can see in \ref{fig:spell}

turns into

$\label{fig:spell}RStudio Task Bar$

Figure 11.1: RStudio Task Bar

As we can see in Figure 11.1

Useful Stuff

Let me give you some useful tricks and tweaks that will make your work with Markdown a lot easier.

Spell-Checker

First up is the spell-checker which you can run by clicking the respective icon in the task bar.

Useful Commands

There are some commands that will help you to achieve the layout you want:

New page

\newpage

Centering a Line

\begin{center}
Text to be centred.
\end{center}

Bibliography

Right, now we come to the option to include a bibliography. Automatically. Properly formatted, according to PAIS style guide. In seconds. By simply pressing a button. Sounds good?

Well, there is no free lunch, unfortunately. This feature involves a little bit more coding language and a little work in .

The first thing we need to do is to adjust our YAML a little, so that R can do the magic. We specify the citation engine (natibib), the style in which we want the bibliography to be formatted (apalike), and the file from which R shall pull the information for the references we include (R.bib). It is this last item that I will focus on next.

The .bib file

The R.bib file is a repository containing the information about every source you are wishing to include in a document. This file is not actually written in R, but in LaTeX which you can download and install as follows:

Mac: MacTeX (http://www.tug.org/mactex/)
Windows: MiKTeX (https://miktex.org/)

These contain a complete TeX system with LaTeX itself and editors to write documents. If you want more information, then you can go to https://www.latex-project.org/get/

Open the editor that comes with the distribution, and create a new document with the file ending .bib (not .tex). Save this as “R.bib”. Now, every citation you wish to include in the document needs to have a reference in the .bib file, such as:

@book{grolemund:2016,
  author={Garrett Grolemund and Hadley Wickham},
  title={R for Data Science},
  publisher={O'Reilly Media},
  year={2016}}

The start of this chunk determines what type of source we are dealing with (here: book), followed by a unique identifier which you are free to set however you want (here: grolemund:2016). My own system is to state the surname of the first author and the year of publication, separated by a colon. The remaining items in this chunk are fairly self-explanatory, I think. You will have to do this for each source. This is a little work at the beginning, but you can use the file again and again, and as you add over time, it will become more and more comprehensive. By the way, rather than randomly adding these entries to the .bib file, I order mine alphabetically by surname of author, just as you would in the bibliography proper.

If you want to learn more about LaTeX, there is another Moodle Skills Module called Academic Writing in LaTeX available.

Once you are done, you need to place the .bib file into the working directory of the document. Then you add a level one heading at the very end of the document, called # References. That’s it. R will now produce the bibliography under this heading for you. But how does R know what you have cited?

Citations in Markdown

In order to make the automated citation and bibliography possible, there is a certain way in which you need to refer to the sources in the text. For example, if you wanted to refer to page 361 in the book by Grolemund, then you would type:

Text [@grolemund:2016, p. 361]

Or if you want to refer to multiple places, you could write:

Text [@grolemund:2016, pp. 33-35, 38-39 and *passim*].

When you have mentioned the author in the text already, it is necessary to suppress the author in the citation. You can do that by adding a minus sign in front of the reference:

Grolemund and Wickham write that ... [-@grolemund:2016]

So, how does all of this look in practice? If you deliver the following input:

# Introduction

Text [@grolemund:2016, p. 361]

# References

R turns this into:

Neat, ey?

stargazer

I have little to no patience for students (or academics!) who are fiddling around in MS Word and are using all sorts of fancy options to make their Tables and Figures as unreadable as possible. You can tell that I am quite particular when it comes to the tabular display of statistical output.

When you study for a degree in Political Science, you are acquiring the skills to write professionally about a topic of the discipline, and to analyse a research question within its remit. James A. Stimson puts it very aptly in his article Writing in Political Science when he says:

You are a professional author. Learn to use the tools of authorship or choose a profession for which you are better suited. (p. 10)

Stimson says this in reference to Tables in particular, as there are some principles that any academic author needs to observe. Let me quote again from his article here:

Table design is important, and often done badly. It requires you to think about what the reader knows and wants to know from your work and then very carefully lay out the table to tell the story. (…)

Tables should always be composed so that a reader can pick one up and understand its content, without having read the text. That means it must be fully self-contained, depending on nothing that is explained only in the text. The opposite is also true; a reader should be able to skip the table and understand the analysis completely from the text. (p.10)

To observe most of the principles set out in Stimson’s article (please read it, it’s worth its weight in gold), R has a snazzy package that helps you on the way. It is called stargazer and is pretty much the best invention since sliced bread:

is an R package that creates code, HTML code and ASCII text for well-formatted regression tables, with multiple models side-by-side, as well as for summary statistics tables, data frames, vectors and matrices. (https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf)

stargazer is very easy to use, supports a large number of statistical models and renders the most beautiful tables. Let’s do an example, using the replication data for Fearon and Laitin’s famous 2003 article analysing the determinants of civil war. Let’s load the necessary packages:

library(haven)
library(tidyverse)
library(pROC)
library(stargazer)

do a little data prep:

fearon <- read_dta("data/fearon.dta")

fearon$onset[fearon$onset==4] <- NA

fearon$onset <- as.factor(fearon$onset)

and estimate three different models, including their ROC curves (see Chapter 11):

model1 <- glm(onset ~ gdpenl,
             family = binomial(link = logit),
             na.action = na.exclude,
             data = fearon)

prob_model1 <- predict(model1, type="response")
fearon$prob_model1 <- unlist(prob_model1)

roc_model1 <- roc(fearon$onset, fearon$prob_model1)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

model2 <- glm(onset ~ gdpenl + lpopl1,
             family = binomial(link = logit),
             na.action = na.exclude,
             data = fearon)

prob_model2 <- predict(model2, type="response")
fearon$prob_model2 <- unlist(prob_model2)

roc_model2 <- roc(fearon$onset, fearon$prob_model2)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

model3 <- glm(onset ~ gdpenl + lpopl1 + lmtnest,
             family = binomial(link = logit),
             na.action = na.exclude,
             data = fearon)

prob_model3 <- predict(model3, type="response")
fearon$prob_model3 <- unlist(prob_model3)

roc_model3 <- roc(fearon$onset, fearon$prob_model3)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Now we use stargazer to put the results into a table. All you have to do is to call the stargazer function and list the models you wish to include in the table. I am suppressing the annoying header here, and am adjusting the font size, too.

```{r results='asis', eval=F, message=F, tab.cap = NULL}
stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny")
```

Output

You will note that stargazer adds the asterisks that Stimson is so strongly (and correctly) opposed to. The rating of lower p-values with increasing numbers of asterisks suggests a certain goal (like wanting to stay in a five-star hotel), but this goal is nonesense. If you want to know more about this, I recommend an excellent module called “Introduction to Quantitative Political Analysis I” where you learn about the p-value and the Type I and Type II errors. stargazer’s practice to include these asterisks is therefore misguided, but since most journals insist on this nonsense, it is worthwhile to get used to it. It is often necessary to adjust the font size, just as I have here, and you can choose from the following which are ordered in descending size:

\Huge
\huge
\LARGE
\Large
\large
\normalsize
\small
\footnotesize
\scriptsize
\tiny

This is already looking quite neat, but we are not done, yet. Next up is leftmost columns, also calles the “Stub” which contains the variable labels.

The usual problem is that the names are too brief to convey what the indicator is. (And remember the rule about being self-contained: if the reader needs to page back to find out what some ambiguous name stands for, you have violated the rule and caused reader impatience.) Abbreviate nothing. And never ever ever use computer variable names to stand for concepts. These are personal code words that convey no meaning to readers. (p.10)

So let’s change them. I will ignore the {r results='asis', echo=F, message=F, tab.cap = NULL} environment now, and only show you the actual stargazer script to be used.

stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny", 
          covariate.labels = c("GDP per capita (in logged 1985\\$ thousands)",
                               "Population (logged, in thousands)",
                               "Mountainous Terrain (logged \\% of total)"),
          dep.var.labels   = "Onset of Civil War")

Better. Now, the table contains information on some summary statistics at the bottom we are not interested in. But it does not contain information on the ROC curve which we are interested in. Let’s start by removing the unwanted ones:

stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny", 
          covariate.labels = c("GDP per capita (in logged 1985\\$ thousands)",
                               "Population (logged, in thousands)",
                               "Mountainous Terrain (logged \\% of total)"),
          dep.var.labels   = "Onset of Civil War",
          omit.stat = c("aic", "ll"))

For a full list of abbreviations for statistics, see page 22 of https://cran.r-project.org/web/packages/stargazer/stargazer.pdf. Now we can add the ROC curve.

stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny", 
          covariate.labels = c("GDP per capita (in logged 1985\\$ thousands)",
                               "Population (logged, in thousands)",
                               "Mountainous Terrain (logged \\% of total)"),
          dep.var.labels   = "Onset of Civil War",
          omit.stat = c("aic", "ll"),
          add.lines = list(c("ROC Curve", auc(roc_model1), 
                              auc(roc_model2), auc(roc_model3))))

Nobody can be petty enough to want this many decimal places, and so let us shorten this to two:

stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny", 
          covariate.labels = c("GDP per capita (in logged 1985\\$ thousands)",
                               "Population (logged, in thousands)",
                               "Mountainous Terrain (logged \\% of total)"),
          dep.var.labels   = "Onset of Civil War",
          omit.stat = c("aic", "ll"),
          add.lines = list(c("ROC Curve", round(auc(roc_model1),2), 
                              round(auc(roc_model2),2), 
                              round(auc(roc_model3),2))))

You can also control the number of decimal places with the digits = option more generally. For example, if you want to round everything to two decimal places, you would set digits = 2.

Lastly, let’s label the Table, so that a reader knows what is being shown.

stargazer(model1, model2, model3,
          header=F, 
          font.size = "tiny", 
          covariate.labels = c("GDP per capita (in logged 1985\\$ thousands)",
                               "Population (logged, in thousands)",
                               "Mountainous Terrain (logged \\% of total)"),
          dep.var.labels   = "Onset of Civil War",
          omit.stat = c("aic", "ll"),
          add.lines = list(c("ROC Curve", round(auc(roc_model1),2), 
                              round(auc(roc_model2),2), 
                              round(auc(roc_model3),2))),
          title = "Determinants of Civil War (Fearon and Laitin, 2003)")

This is a beautiful and informative Table which should be the standard to which you work.

11.2 Summary

YAML - Yet Another Markup Language, setting the formatting of the document

Functions list

11.3 Exercises

The solutions for the exercises will be available here on 2022-03-11.