Worksheet Week 1

Self-Assessment Questions1

  1. What do cross-tabulations do?
  2. Can we use continuous variables for cross-tabulations?
  3. What are the strengths and weaknesses of cross-tabulations?
  4. Why do we calculate the \(\chi^2\)-value as \(\chi^{2} = \Sigma \frac{(f_{o}-f_{e})^{2}}{f_{e}}\) ?
  5. How does the \(\chi^2\) distribution differ from the t- and normal distribution?

Please stop here and don’t go beyond this point until we have compared notes on your answers.


Calculations by Hand

I have given you an example of a cross-tabulation in the lecture. Consider the following Table:

Calculate the Expected Values and fill in the following table:

  • Calculate the \(\chi^{2}\)-value
  • How many degrees of freedom does this table have? Why?
  • Using the \(\chi^2\) Table, what is the p-value?
  • Are mode of transport and year of study independent in the population?

Cross-Tabulations in R

If you have joined PO12Q without suffering through PO11Q, then please work through these two worksheets before proceeding with the present material:

Data Set

  • We are working with the World Development Indicators this week. Data are taken from World Bank (2024), Boix et al. (2018), and Marshall & Gurr (2020). For more information on the democracy coding by Boix et al. please download the corresponding paper.
  • Download the data WDI_PO12Q.csv and place it in a new working directory for this week
  • The variables we will be working with today are as follows:
Table 1: WDI Codebook
variable label
Country Name Country Name
Country Code Country Code
year year
democracy 0 = Autocracy, 1 = Dictatorship (Boix et al., 2018)
gdppc GDP per capita (constant 2010 US$)
gdpgrowth Absolute growth of per capita GDP to previous year (constant 2010 US Dollars)
enrl_gross School enrollment, primary (% gross)
enrl_net School enrollment, primary (% net)
agri Employment in agriculture (% of total employment) (modeled ILO estimate)
slums Population living in slums (% of urban population)
telephone Fixed telephone subscriptions (per 100 people)
internet Individuals using the Internet (% of population)
tax Tax revenue (% of GDP)
electricity Access to electricity (% of population)
mobile Mobile cellular subscriptions (per 100 people)
service Services, value added (% of GDP)
oil Oil rents (% of GDP)
natural Total natural resources rents (% of GDP)
literacy Literacy rate, adult total (% of people ages 15 and above)
prim_compl Primary completion rate, total (% of relevant age group)
infant Mortality rate, infant (per 1,000 live births)
hosp Hospital beds (per 1,000 people)
tub Incidence of tuberculosis (per 100,000 people)
health_ex Current health expenditure (% of GDP)
ineq Income share held by lowest 10%
unemploy Unemployment, total (% of total labor force) (modeled ILO estimate)
lifeexp Life expectancy at birth, total (years)
urban Urban population (% of total population)
polity5 Combined Polity V score

Loading the Data

setwd("~/PO12Q/Seminars/PO12Q_Seminar_Week 1")
wdi <- read.csv("WDI_PO12Q.csv")

You can copy the code from this page by hovering over the code chunk and clicking the icon in the top-right hand corner. You can then paste it into your RScript.

Guided Example

  • We are now going to use the WDI, and produce a crosstab of Life expectancy at birth, total (years) by GDP per capita (constant 2010 US$), using the variables lifeexp and gdppc
  • We first need to recode both variables into factor variables, as these are continuous.
  • First up is lifeexp
  • Recoding: as a refresher from PO11Q, we are specifying the new data frame as the old one, then we add a pipe, and call the function mutate. Therein, we create a new variable called lifecat, which will be an ordered factor, cutting the variable life at the stated intervals, and labeling these levels accordingly.
library(tidyverse)

wdi <- wdi %>% 
  mutate(lifecat=
           ordered(
             cut(lifeexp, breaks=c(0,77,84), 
                 labels=c("low","high"))))

Make a habit of adding a note underneath each code chunk in your RScript (with a ‘#’) in which you translate the code into plain English.

  • Next up is gdppc which we are recoding into a factor variable with three levels:
wdi <- wdi %>% 
  mutate(gdpcat=
           ordered(
             cut(gdppc, breaks=c(0,3861.5,25260, 150000), 
                 labels=c("low","medium", "high"))))
  • Let us now see whether the level of GDP has an influence on life expectancy
  • State the null and the alternative hypothesis (directional)
  • Run the cross-tabulation, by calling:
wdi_table <- with(wdi, table(gdpcat, lifecat))
wdi_table
        lifecat
gdpcat   low high
  low     69    0
  medium  54   16
  high     3   27
  • Now perform a chi-squared test, using the command2
Xsq <- chisq.test(wdi_table, correct=FALSE)
Xsq

    Pearson's Chi-squared test

data:  wdi_table
X-squared = 89.702, df = 2, p-value < 2.2e-16

As on PO11Q, I have prepared flashcards to help you learn R functions. These have their own section each week in this companion, but are also available on the more general POQ Flashcard Page.

  • Are the results statistically significant? At what level?
  • What do these results mean for our hypotheses?
  • Produce the table of expected frequencies, using the command
round(Xsq$expected, 2)
        lifecat
gdpcat     low  high
  low    51.44 17.56
  medium 52.19 17.81
  high   22.37  7.63
  • You can compare that to the observed ones:
round(Xsq$observed, 2)
        lifecat
gdpcat   low high
  low     69    0
  medium  54   16
  high     3   27

You can see that this is identical to producing the cross-tabulation in the first place:

wdi_table
        lifecat
gdpcat   low high
  low     69    0
  medium  54   16
  high     3   27
  • In plain English: What have we found out?

Exercises

  1. Let us find out whether the completion of primary school influences unemployment rates.
    1. State the null and directional alternative hypothesis for this test.

    2. Create a new variable primary_fac using the prim_compl variable. Cut it into three categories “low”, “medium”, and “high”, cutting prim_compl at its first quartile, and its mean.

    3. Apply the same procedure to unemploy, creating a new variable called unemp_fac.

    4. Create a cross-tabulation assessing the dependence of unemployment on primary completion rate. unemp_fac.

    5. Test whether the dependence is statistically significant. unemp_fac.

  2. Repeat steps 1a) to 1e) for two variables of your own choice.

Homework for Week 2

  • Read the items marked “essential” on the reading list (see Talis)
  • Work through this and next week’s “Methods, Methods, Methods” Sections.
  • Work through this week’s flashcards to familiarise yourself with the relevant R functions.
  • Find an example for each NEW function and apply it in R to ensure it works
  • Complete the Week 1 Moodle Quiz

Solutions

You can find the Solutions in the Downloads Section.



  1. Some of the content of this worksheet is taken from Reiche (forthcoming).↩︎

  2. I am using the correct=FALSE option here to reproduce the \(\chi^{2}\)-value you would get if you calculated this by hand. Technically, \(\chi^{2}\) is only an approximation of the hypergeometric distribution which would deliver an exact test. You can get the precise value by applying Yates’ continuity correction with correct=TRUE.↩︎