Worksheet Week 8

Self-Assessment Questions¹²

Compare bivariate regression and multiple regression.
Give an example of the relationship in which you could apply multiple linear regression.
How do you interpret partial slope coefficients?
How do you interpret adjusted R-Squared in regression analysis?
Prepare to interpret the following regression output substantively (what do the coefficients mean, how good is the model fit, etc.) income represents the median household income in a London ward (in £), and turnout represents the turnout at the 2012 mayoral election (in %).


Call:
lm(formula = turnout ~ income, data = london)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.3099  -2.3808   0.5004   3.2679  11.2834 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.955e+01  9.861e-01   19.83   <2e-16 ***
income      3.720e-04  2.467e-05   15.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.594 on 623 degrees of freedom
Multiple R-squared:  0.2673,    Adjusted R-squared:  0.2661 
F-statistic: 227.3 on 1 and 623 DF,  p-value: < 2.2e-16

Please stop here and don’t go beyond this point until we have compared notes on your answers.

Please have a look at the indicative solutions to check your answers.

Multiple Regression in R – Guided Example

Download the WDI_PO12Q.csv data set. Data are taken from World Bank (2024), Boix et al. (2018), and Marshall & Gurr (2020).
Put it into an appropriate working directory for this seminar and create a dedicated RScript and save it into the same working directory.
Import the data set into R:
Once again, here is the overview of the variables available and their respective label in Table 12:

Table 12: WDI Codebook
variable	label
Country Name	Country Name
Country Code	Country Code
year	year
democracy	0 = Autocracy, 1 = Dictatorship (Boix et al., 2018)
gdppc	GDP per capita (constant 2010 US$)
gdpgrowth	Absolute growth of per capita GDP to previous year (constant 2010 US Dollars)
enrl_gross	School enrollment, primary (% gross)
enrl_net	School enrollment, primary (% net)
agri	Employment in agriculture (% of total employment) (modeled ILO estimate)
slums	Population living in slums (% of urban population)
telephone	Fixed telephone subscriptions (per 100 people)
internet	Individuals using the Internet (% of population)
tax	Tax revenue (% of GDP)
electricity	Access to electricity (% of population)
mobile	Mobile cellular subscriptions (per 100 people)
service	Services, value added (% of GDP)
oil	Oil rents (% of GDP)
natural	Total natural resources rents (% of GDP)
literacy	Literacy rate, adult total (% of people ages 15 and above)
prim_compl	Primary completion rate, total (% of relevant age group)
infant	Mortality rate, infant (per 1,000 live births)
hosp	Hospital beds (per 1,000 people)
tub	Incidence of tuberculosis (per 100,000 people)
health_ex	Current health expenditure (% of GDP)
ineq	Income share held by lowest 10%
unemploy	Unemployment, total (% of total labor force) (modeled ILO estimate)
lifeexp	Life expectancy at birth, total (years)
urban	Urban population (% of total population)
polity5	Combined Polity V score

Let’s take the example from Week 7 back up. First re-run the regression I used back then.

wdi_life <- lm(gdppc ~ lifeexp, data = wdi)
summary(wdi_life)

Call:
lm(formula = gdppc ~ lifeexp, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-18457  -9877  -4187   4963  78242 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -94306.8    10406.7  -9.062 3.33e-16 ***
lifeexp       1503.2      144.4  10.412  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14690 on 167 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.3936,    Adjusted R-squared:   0.39 
F-statistic: 108.4 on 1 and 167 DF,  p-value: < 2.2e-16

Interpret the coefficients.

Interpreting a coefficient

The order in which to interpret a coefficient is as follows:

Is it significant? If not, all you can say is that there is no influence. You have falsified the alternative hypothesis.
If it is significant, you can interpret its size and direction according to the statistical model (for example, slope coefficnet vs. partial slope coefficient).
What does the coefficient mean for the hypothesis? Look at the direction. Is the direction as predicted by the hypothesis? Then you have evidence to support the hypothesis. If the direction is inverse, then you have falsified the hypothesis, even though you have a significant coefficient.

Now we use a different regressor, say “Urban population (% of total)”.
Specify the null and the alternative hypotheses.

wdi_urban <- lm(gdppc ~ urban, data = wdi)
summary(wdi_urban)

Call:
lm(formula = gdppc ~ urban, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-28588 -10681  -3110   6863 152303 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16983.43    3872.34  -4.386 1.98e-05 ***
urban          542.03      61.74   8.779 1.42e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19120 on 176 degrees of freedom
  (17 observations deleted due to missingness)
Multiple R-squared:  0.3045,    Adjusted R-squared:  0.3006 
F-statistic: 77.07 on 1 and 176 DF,  p-value: 1.417e-15

Interpret the coefficients.
What does this mean for the hypotheses?
If we want to assess the influence of both independent variables together, we type:

wdi_joint <- lm(gdppc ~ lifeexp + urban, data = wdi)
summary(wdi_joint)

Call:
lm(formula = gdppc ~ lifeexp + urban, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-22570  -8825  -2143   5594  74445 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -74786.14   10690.13  -6.996 6.17e-11 ***
lifeexp       1012.17     172.70   5.861 2.41e-08 ***
urban          273.74      59.13   4.629 7.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13860 on 166 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.463, Adjusted R-squared:  0.4565 
F-statistic: 71.55 on 2 and 166 DF,  p-value: < 2.2e-16

Interpret the coefficients.

Bear in mind for your interpretation that these are partial slope coefficients!

Because I am nice, I am producing an overview¹³ of all three regressions in Table 13:

Table 13: Regression Models 1.
	Dependent Variable: per capita GDP
	Bivariate (1)	Bivariate (2)	Multiple (3)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Life Expectancy (years)	1503.211***		1012.168***
	(144.371)		(172.696)
Urbanisation		542.033***	273.735***
		(61.743)	(59.134)
Constant	-94306.813***	-16983.431***	-74786.145***
	(10406.727)	(3872.336)	(10690.128)
Num.Obs.	169	178	169
R2	0.394	0.305	0.463
R2 Adj.	0.390	0.301	0.456

Drawing on what you have learned about the stargazer package in the additional exercises last week, replicate this table.

How have the slope coefficients changed? Why?
Which model explains the level of GDP best? Why?
Specify the SRF for Model 3, paying special attention to notation.
Now assume, we want to know whether education has a bearing on the level of GDP. We call:

wdi_lit <- lm(gdppc ~ literacy, data = wdi)

and also add it to the joint model:

wdi_joint1 <- lm(gdppc ~ lifeexp + urban + literacy, data = wdi)

This should lead to these results:

Table 14: Regression Models 2.
	Dependent Variable: per capita GDP
	(1)	(2)	(3)	(4)	(5)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Life Expectancy (years)	1503.211***		1012.168***		973.985*
	(144.371)		(172.696)		(411.375)
Urbanisation		542.033***	273.735***		227.126*
		(61.743)	(59.134)		(83.427)
Literacy				258.407*	-221.731
				(97.381)	(142.455)
Constant	-94306.813***	-16983.431***	-74786.145***	-12967.003	-55246.837**
	(10406.727)	(3872.336)	(10690.128)	(8529.810)	(19311.126)
Num.Obs.	169	178	169	40	39
R2	0.394	0.305	0.463	0.156	0.497
R2 Adj.	0.390	0.301	0.456	0.134	0.454

In Model 5 the coefficient for literacy has turned insignificant. Reproduce the results in Table 4 to find out which variable takes away the significance.

Table 15: Regression Models 3.
	Dependent Variable: per capita GDP
	(1)	(2)	(3)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Life Expectancy (years)		1483.785***
		(397.567)
Urbanisation			315.375***
			(77.633)
Literacy	258.407*	-223.305	28.390
	(97.381)	(154.620)	(99.706)
Constant	-12967.003	-77851.339***	-12399.608+
	(8529.810)	(18924.058)	(7189.937)
Num.Obs.	40	39	40
R2	0.156	0.390	0.417
R2 Adj.	0.134	0.357	0.385

What can we conclude from this investigation?
Does the variable infant have the same effect? What do you conclude from this?
Which measurement explains GDP better, life or infant?

Table 16: Regression Models 4.
	Dependent Variable: per capita GDP
	life	infant
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Life Expectancy (years)	1503.211***
	(144.371)
Infant Mortality (per 1,000 live births)		-524.111***
		(73.885)
Constant	-94306.813***	26467.135***
	(10406.727)	(2257.385)
Num.Obs.	169	178
R2	0.394	0.222
R2 Adj.	0.390	0.218

Multiple Regression in R – Independent Analysis

Before you start with these, please pause and let Flo/Luis know that you are done. We will compare notes on your answers up to this point, to make sure that you are on the right track for the independent exercises.

Use the wdi data frame. Set polity5 as the dependent variable, and choose three sensible variables which you believe could influence democracy. Note that the Polity V Score codes regimes from -10 (indicating perfect autocracy) to +10 (indicating perfect democracy).
State the null- and alternative hypotheses for each of the independent variables chosen.
Plot two of the bivariate models in a scatter plot (black points) with fitted regression line (in red). Use base R, or ggplot. Does the direction of influence agree with your hypotheses?
Run all possible regression models, using a bottom-up strategy. Construct a stargazer table to present the results.
Specify the Sample Regression Function (SRF) for a bivariate model (Model A), a multivariate model with two independent variables (Model B), and for the model using all three independent variables (Model C).
Interpret the intercept and one of the slope coefficients in Models A, B, and C.
Interpret the model fit measure for Models A, B, and C.
Which model explains democracy best? Why?.
What do we conclude with respect to the hypotheses stated in Exercise 2 from this analysis?
Collate a PowerPoint (Keynote) presentation with one slide for each of the preceding nine points. We will discuss this in Week 9.

Homework for Week 9

There is no separate reading for the Week 9 seminar
Work through this week’s flashcards to familiarise yourself with the relevant R functions.
Find an example for each NEW function and apply it in R to ensure it works
Complete the Week 8 Moodle Quiz
Work through the Week 9 “Methods, Methods, Methods” Section.
Revise the material of Weeks 1-5 and note questions you have whilst doing this.

Solutions

You can find the Solutions in the Downloads Section.

Self-Assessment Questions12

Multiple Regression in R – Guided Example

Multiple Regression in R – Independent Analysis

Homework for Week 9

Solutions

Self-Assessment Questions¹²