Worksheet Week 8

Self-Assessment Questions12

  1. Compare bivariate regression and multiple regression.
  2. Give an example of the relationship in which you could apply multiple linear regression.
  3. How do you interpret partial slope coefficients?
  4. How do you interpret adjusted R-Squared in regression analysis?

Please stop here and don’t go beyond this point until we have compared notes on your answers.


Multiple Regression in R – Guided Example

  • Download the WDI_PO12Q.csv data set. Data are taken from World Bank (2024), Boix et al. (2018), and Marshall & Gurr (2020).

  • Put it into an appropriate working directory for this seminar and create a dedicated RScript and save it into the same working directory.

  • Import the data set into R:

  • Once again, here is the overview of the variables available and their respective label in Table 8:

Table 8: WDI Codebook
variable label
Country Name Country Name
Country Code Country Code
year year
democracy 0 = Autocracy, 1 = Dictatorship (Boix et al., 2018)
gdppc GDP per capita (constant 2010 US$)
gdpgrowth Absolute growth of per capita GDP to previous year (constant 2010 US Dollars)
enrl_gross School enrollment, primary (% gross)
enrl_net School enrollment, primary (% net)
agri Employment in agriculture (% of total employment) (modeled ILO estimate)
slums Population living in slums (% of urban population)
telephone Fixed telephone subscriptions (per 100 people)
internet Individuals using the Internet (% of population)
tax Tax revenue (% of GDP)
electricity Access to electricity (% of population)
mobile Mobile cellular subscriptions (per 100 people)
service Services, value added (% of GDP)
oil Oil rents (% of GDP)
natural Total natural resources rents (% of GDP)
literacy Literacy rate, adult total (% of people ages 15 and above)
prim_compl Primary completion rate, total (% of relevant age group)
infant Mortality rate, infant (per 1,000 live births)
hosp Hospital beds (per 1,000 people)
tub Incidence of tuberculosis (per 100,000 people)
health_ex Current health expenditure (% of GDP)
ineq Income share held by lowest 10%
unemploy Unemployment, total (% of total labor force) (modeled ILO estimate)
lifeexp Life expectancy at birth, total (years)
urban Urban population (% of total population)
polity5 Combined Polity V score
  • Let’s take the example from Week 7 back up. First re-run the regression I used back then.
wdi_life <- lm(gdppc ~ lifeexp, data = wdi)
summary(wdi_life)

Call:
lm(formula = gdppc ~ lifeexp, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-18457  -9877  -4187   4963  78242 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -94306.8    10406.7  -9.062 3.33e-16 ***
lifeexp       1503.2      144.4  10.412  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14690 on 167 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.3936,    Adjusted R-squared:   0.39 
F-statistic: 108.4 on 1 and 167 DF,  p-value: < 2.2e-16
  • Interpret the coefficients.

Interpreting a coefficient

The order in which to interpret a coefficient is as follows:

  1. Is it significant? If not, all you can say is that there is no influence. You have falsified the alternative hypothesis.
  2. If it is significant, you can interpret its size and direction according to the statistical model (for example, slope coefficnet vs. partial slope coefficient).
  3. What does the coefficient mean for the hypothesis? Look at the direction. Is the direction as predicted by the hypothesis? Then you have evidence to support the hypothesis. If the direction is inverse, then you have falsified the hypothesis, even though you have a significant coefficient.
  • Now we use a different regressor, say “Urban population (% of total)”.
  • Specify the null and the alternative hypotheses.
wdi_urban <- lm(gdppc ~ urban, data = wdi)
summary(wdi_urban)

Call:
lm(formula = gdppc ~ urban, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-28588 -10681  -3110   6863 152303 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16983.43    3872.34  -4.386 1.98e-05 ***
urban          542.03      61.74   8.779 1.42e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19120 on 176 degrees of freedom
  (17 observations deleted due to missingness)
Multiple R-squared:  0.3045,    Adjusted R-squared:  0.3006 
F-statistic: 77.07 on 1 and 176 DF,  p-value: 1.417e-15
  • Interpret the coefficients.
  • What does this mean for the hypotheses?
  • If we want to assess the influence of both independent variables together, we type:
wdi_joint <- lm(gdppc ~ lifeexp + urban, data = wdi)
summary(wdi_joint)

Call:
lm(formula = gdppc ~ lifeexp + urban, data = wdi)

Residuals:
   Min     1Q Median     3Q    Max 
-22570  -8825  -2143   5594  74445 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -74786.14   10690.13  -6.996 6.17e-11 ***
lifeexp       1012.17     172.70   5.861 2.41e-08 ***
urban          273.74      59.13   4.629 7.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13860 on 166 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.463, Adjusted R-squared:  0.4565 
F-statistic: 71.55 on 2 and 166 DF,  p-value: < 2.2e-16
  • Interpret the coefficients.

Bear in mind for your interpretation that these are partial slope coefficients!

  • Because I am nice, I am producing an overview of all three regressions in the following table:
   
Table 9: Regression Models 1
Dependent variable:
per capita GDP
(1) (2) (3)
Life Expectancy 1,503.211*** 1,012.168***
(144.371) (172.696)
Urbanisation 542.033*** 273.735***
(61.743) (59.134)
Constant -94,306.810*** -16,983.430*** -74,786.150***
(10,406.730) (3,872.336) (10,690.130)
Observations 169 178 169
R2 0.394 0.305 0.463
Adjusted R2 0.390 0.301 0.456
Note: *p<0.1; **p<0.05; ***p<0.01

   

Drawing on what you have learned about the stargazer package in the additional exercises last week, replicate this table.

  • How have the slope coefficients changed? Why?
  • Which model explains the level of GDP best? Why?
  • Specify the SRF for Model 3, paying special attention to notation.
  • Now assume, we want to know whether education has a bearing on the level of GDP. We call:
wdi_lit <- lm(gdppc ~ literacy, data = wdi)

and also add it to the joint model:

wdi_joint1 <- lm(gdppc ~ lifeexp + urban + literacy, data = wdi)

This should lead to these results:

   
Table 10: Regression Models 2
Dependent variable:
per capita GDP
(1) (2) (3) (4) (5)
Life Expectancy 1,503.211*** 1,012.168*** 973.985**
(144.371) (172.696) (411.375)
Urbanisation 542.033*** 273.735*** 227.126**
(61.743) (59.134) (83.427)
Literacy 258.407** -221.731
(97.381) (142.455)
Constant -94,306.810*** -16,983.430*** -74,786.150*** -12,967.000 -55,246.840***
(10,406.730) (3,872.336) (10,690.130) (8,529.810) (19,311.130)
Observations 169 178 169 40 39
R2 0.394 0.305 0.463 0.156 0.497
Adjusted R2 0.390 0.301 0.456 0.134 0.454
Note: *p<0.1; **p<0.05; ***p<0.01

   

  • In Model 5 the coefficient for literacy has turned insignificant. Reproduce the results in Table 4 to find out which variable takes away the significance.
   
Table 11: Regression Models 3
Dependent variable:
per capita GDP
(1) (2) (3)
Literacy 258.407** -223.305 28.390
(97.381) (154.620) (99.706)
Life Expectancy 1,483.785***
(397.567)
Urbanisation 315.375***
(77.633)
Constant -12,967.000 -77,851.340*** -12,399.610*
(8,529.810) (18,924.060) (7,189.937)
Observations 40 39 40
R2 0.156 0.390 0.417
Adjusted R2 0.134 0.357 0.385
Note: *p<0.1; **p<0.05; ***p<0.01

   

  • What can we conclude from this investigation?

  • Does the variable infant have the same effect? What do you conclude from this?

  • Which measurement explains GDP better, life or infant?

   
Table 12: Regression Models 4
Dependent variable:
per capita GDP
(1) (2)
Life Expectancy 1,503.211***
(144.371)
Infant Mortality -524.111***
(73.885)
Constant -94,306.810*** 26,467.130***
(10,406.730) (2,257.385)
Observations 169 178
R2 0.394 0.222
Adjusted R2 0.390 0.218
Note: *p<0.1; **p<0.05; ***p<0.01

   


Multiple Regression in R – Independent Analysis

Before you start with these, please pause and let Flo/Luis know that you are done. We will compare notes on your answers up to this point, to make sure that you are on the right track for the independent exercises.

  1. Use the wdi data frame. Set polity5 as the dependent variable, and choose three sensible variables which you believe could influence democracy. Note that the Polity V Score codes regimes from -10 (indicating perfect autocracy) to +10 (indicating perfect democracy).
  2. State the null- and alternative hypotheses for each of the independent variables chosen.
  3. Plot two of the bivariate models in a scatter plot (black points) with fitted regression line (in red). Use base R, or ggplot. Does the direction of influence agree with your hypotheses?
  4. Run all possible regression models. Put the results into a suitable table, noting the p-values of each coefficient in the brackets underneath. Try all different combinations of your variables to see which of the three is the best to explain democracy.
  5. Specify the Sample Regression Function (SRF) for models 1, 4, and 7.
  6. Interpret the intercept and one of the slope coefficients in models 1, 4, and 7.
  7. Interpret the model fit measure for models 1, 4, and 7.
  8. What do we conclude with respect to the hypotheses stated in Exercise 2 from this analysis?
  9. Collate a PowerPoint (Keynote) presentation with one slide for each of the preceding eight points. We will discuss this in Week 9.

Homework for Week 9

  • There is no separate reading for the Week 9 seminar
  • Work through this week’s flashcards to familiarise yourself with the relevant R functions.
  • Find an example for each NEW function and apply it in R to ensure it works
  • Complete the Week 8 Moodle Quiz
  • Work through the Week 9 “Methods, Methods, Methods” Section.
  • Revise the material of Weeks 1-5 and note questions you have whilst doing this.

Solutions

You can find the Solutions in the Downloads Section.



  1. Some of the content of this worksheet is taken from Reiche (forthcoming).↩︎