Worksheet Week 1
Self-Assessment Questions1
- What do cross-tabulations do?
- Can we use continuous variables for cross-tabulations?
- What are the strengths and weaknesses of cross-tabulations?
- Why do we calculate the \(\chi^2\)-value as \(\chi^{2} = \Sigma \frac{(f_{o}-f_{e})^{2}}{f_{e}}\) ?
- How does the \(\chi^2\) distribution differ from the t- and normal distribution?
Please stop here and don’t go beyond this point until we have compared notes on your answers.
Calculations by Hand
I have given you an example of a cross-tabulation in the lecture. Consider the following Table:
Calculate the Expected Values and fill in the following table:
- Calculate the \(\chi^{2}\)-value
- How many degrees of freedom does this table have? Why?
- Using the \(\chi^2\) Table, what is the p-value?
- Are mode of transport and year of study independent in the population?
Cross-Tabulations in R
If you have joined PO12Q without suffering through PO11Q, then please work through these two worksheets before proceeding with the present material:
Data Set
- We are working with the World Development Indicators this week. Data are taken from World Bank (2024), Boix et al. (2018), and Marshall & Gurr (2020). For more information on the democracy coding by Boix et al. please download the corresponding paper.
- Download the data WDI_PO12Q.csv and place it in a new working directory for this week
- The variables we will be working with today are as follows:
variable | label |
---|---|
Country Name | Country Name |
Country Code | Country Code |
year | year |
democracy | 0 = Autocracy, 1 = Dictatorship (Boix et al., 2018) |
gdppc | GDP per capita (constant 2010 US$) |
gdpgrowth | Absolute growth of per capita GDP to previous year (constant 2010 US Dollars) |
enrl_gross | School enrollment, primary (% gross) |
enrl_net | School enrollment, primary (% net) |
agri | Employment in agriculture (% of total employment) (modeled ILO estimate) |
slums | Population living in slums (% of urban population) |
telephone | Fixed telephone subscriptions (per 100 people) |
internet | Individuals using the Internet (% of population) |
tax | Tax revenue (% of GDP) |
electricity | Access to electricity (% of population) |
mobile | Mobile cellular subscriptions (per 100 people) |
service | Services, value added (% of GDP) |
oil | Oil rents (% of GDP) |
natural | Total natural resources rents (% of GDP) |
literacy | Literacy rate, adult total (% of people ages 15 and above) |
prim_compl | Primary completion rate, total (% of relevant age group) |
infant | Mortality rate, infant (per 1,000 live births) |
hosp | Hospital beds (per 1,000 people) |
tub | Incidence of tuberculosis (per 100,000 people) |
health_ex | Current health expenditure (% of GDP) |
ineq | Income share held by lowest 10% |
unemploy | Unemployment, total (% of total labor force) (modeled ILO estimate) |
lifeexp | Life expectancy at birth, total (years) |
urban | Urban population (% of total population) |
polity5 | Combined Polity V score |
Loading the Data
You can copy the code from this page by hovering over the code chunk and clicking the icon in the top-right hand corner. You can then paste it into your RScript.
Guided Example
- We are now going to use the WDI, and produce a crosstab of Life expectancy at birth, total (years) by GDP per capita (constant 2010 US$), using the variables
lifeexp
andgdppc
- We first need to recode both variables into factor variables, as these are continuous.
- First up is
lifeexp
- Recoding: as a refresher from PO11Q, we are specifying the new data frame as the old one, then we add a pipe, and call the function
mutate
. Therein, we create a new variable calledlifecat
, which will be an ordered factor, cutting the variablelife
at the stated intervals, and labeling these levels accordingly.
library(tidyverse)
wdi <- wdi %>%
mutate(lifecat=
ordered(
cut(lifeexp, breaks=c(0,77,84),
labels=c("low","high"))))
Make a habit of adding a note underneath each code chunk in your RScript (with a ‘#’) in which you translate the code into plain English.
- Next up is
gdppc
which we are recoding into a factor variable with three levels:
wdi <- wdi %>%
mutate(gdpcat=
ordered(
cut(gdppc, breaks=c(0,3861.5,25260, 150000),
labels=c("low","medium", "high"))))
- Let us now see whether the level of GDP has an influence on life expectancy
- State the null and the alternative hypothesis (directional)
- Run the cross-tabulation, by calling:
wdi_table <- with(wdi, table(gdpcat, lifecat))
wdi_table
lifecat
gdpcat low high
low 69 0
medium 54 16
high 3 27
- Now perform a chi-squared test, using the command2
Xsq <- chisq.test(wdi_table, correct=FALSE)
Xsq
Pearson's Chi-squared test
data: wdi_table
X-squared = 89.702, df = 2, p-value < 2.2e-16
As on PO11Q, I have prepared flashcards to help you learn R functions. These have their own section each week in this companion, but are also available on the more general POQ Flashcard Page.
- Are the results statistically significant? At what level?
- What do these results mean for our hypotheses?
- Produce the table of expected frequencies, using the command
- You can compare that to the observed ones:
You can see that this is identical to producing the cross-tabulation in the first place:
- In plain English: What have we found out?
Exercises
- Let us find out whether the completion of primary school influences unemployment rates.
State the null and directional alternative hypothesis for this test.
Create a new variable
primary_fac
using theprim_compl
variable. Cut it into three categories “low”, “medium”, and “high”, cuttingprim_compl
at its first quartile, and its mean.Apply the same procedure to
unemploy
, creating a new variable calledunemp_fac
.Create a cross-tabulation assessing the dependence of unemployment on primary completion rate.
unemp_fac
.Test whether the dependence is statistically significant.
unemp_fac
.
- Repeat steps 1a) to 1e) for two variables of your own choice.
Homework for Week 2
- Read the items marked “essential” on the reading list (see Talis)
- Work through this and next week’s “Methods, Methods, Methods” Sections.
- Work through this week’s flashcards to familiarise yourself with the relevant R functions.
- Find an example for each NEW function and apply it in R to ensure it works
- Complete the Week 1 Moodle Quiz
Some of the content of this worksheet is taken from Reiche (forthcoming).↩︎
I am using the
correct=FALSE
option here to reproduce the \(\chi^{2}\)-value you would get if you calculated this by hand. Technically, \(\chi^{2}\) is only an approximation of the hypergeometric distribution which would deliver an exact test. You can get the precise value by applying Yates’ continuity correction withcorrect=TRUE
.↩︎