Worksheet Week 10

Self-Assessment Questions

How do the mathematical properties of OLS (Week 5) differ from the Classical Linear Assumptions?
Why is observed heteroscedasticity a problem that needs addressing?
What happens in a model with (multi-)collinearity?
Does the assumption of linearity in the Gauss-Markov Theorem preclude non-linear transformations? Why / why not?

Please stop here and don’t go beyond this point until we have compared notes on your answers.

Testing the Classical Linear Assumptions

For this worksheet we will be using the london data set from the lectures, this time with the following variables:

Table 20: Codebook for `london_11` Data Set
Variable	Label	Year
const	Parliamentary constituency	n/a
gcse	An average score based on a pupil’s best eight grades in a group of GCSEs. The maximum a pupil can achieve is 90 points	2019
houseprice	Median house price, average across four quarters, in GBP	2019
idaci	The Income Deprivation Affecting Children Index rank - how it compares to other constituencies	2015/2016
income	Mean income by constituency	2017/18

Data are taken from GOV.UK (2013), London Data Store (2010), and House of Commons Library (n.d.).

Homoscedasticity

Load the data set london_11.csv as an object called london
Subset the data to a random sample as follows:

set.seed(123)
london_sam <- sample_frac(london, 0.65, replace=FALSE)

Regress GCSE scores on income and interpret the results.

model1 <- lm(gcse ~ income, data=london_sam)

Breusch-Pagan Test by Hand
1. Extract the residuals from model 1 and store their squared values in a new object called ressq.

ressq <- resid(model1)^2

b. Regress the squared residuals on the original independent variable.

model1res <- lm(ressq~income, data=london_sam)

c. Store the number of observations of this new model in an object called N.

N <- nobs(model1res)

d. Obtain the model fit measure R

$^2$ for this model.

e. Calculate the test-statistic for the Breusch-Pagan test as the product of N and R

$^2$ .

chisq <- 0.001857*N

f. Conduct a Chi-Squared test with one degree of freedom.

pchisq(chisq, df=1, lower.tail=FALSE)
[1] 0.7676653

Breusch-Pagan Test with the lmtest package.
1. Install and load the lmtest package

b. Call

bptest(model1)

c. Compare the results with the results obtained in Exercise 4f.

Heteroscedasticity
1. Now let’s introduce heteroscedasticity:

london_heterosc <- filter(london, income<28000 | income>100000)

Why does this code lead to heteroscedasticity?

b. Estimate a new model, and note the significance of coefficients.

model2 <- lm(gcse ~ income, data=london_heterosc)

c. Conduct a Breusch-Pagan Test with the lmtest package.
d. Install and load the sandwich package.

library(sandwich)

e. The sandwich package replaces

$\Omega$ with a new diagonal which takes into account

heteroscedasticity. The default is HC3 – if you ever work with somebody using Stata then

you need to replace this with HC1 to obtain the same results. Conduct a significance test with

robust standard errors (with the sandwich function coeftest):

coeftest(model2, vcov = vcovHC(model2, type="HC3"))

f. Compare the significance to the results in 6b and explain the reason for the difference.

g. Estimate a model of your own choice with gcse as your dependent variable.

h. Conduct the Breusch-Pagan test

i. by hand

ii. with the lmtest package

(Multi-)Collinearity¹⁵

In this Section, we will be testing for collinearity. We can do so with correlation analysis, which assesses to what degree two variables vary together. As such, this method is limited to pairwise comparisons between variables. It is conceivable, and indeed often the case, however, that one variable is functionally related to two or more other independent variables, such as

$\begin{equation*} x_{1,i} = 0.2 x_{2,i }- 1.7 x_{3,i} \end{equation*}$

We refer to this situation as multicollinearity. In order to detect it, we need to test for the functional dependence of each independent variable on the remaining independent variables in the model. We can do so by employing secondary regression models, not unlike in the Breusch-Pagan Test (see lecture slides). We specify K secondary regression models, each adopting one of our K independent variables as its dependent variable. Let us specify such a function for $x_1$ :

$\begin{equation*} x_{1,i} = \alpha_0 + \alpha_1 x_{2,i} + \alpha_2 x_{3,i} + ... + \alpha_{k-1} x_{k,i} + v_i \end{equation*}$

where $\alpha$ denotes our new regression coefficients, and $v_i$ represents a random error term. For the remaining independent variables, the functions would look as follows:

$\begin{align*} x_{2,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{3,i} + \alpha_3 x_{4,i} + ... + \alpha_{k-1} x_{k,i} + v_i \\ x_{3,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \alpha_3 x_{4,i} + ... + \alpha_{k-1} x_{k,i} + v_i \\ \vdots \\ x_{k,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \alpha_3 x_{3,i} + ... + \alpha_{k-1} x_{k-1,i} + v_i \end{align*}$

In order to quantify how well each of the K regression functions explains the respective independent variable, we are going to use Adjusted R-Squared. We will note the $\bar{R}^2$ of each individual regression as $\bar{R}_j^2$ where j runs from 1 to k, or $j=1,...,k$ .

$\bar{R}_j^2$ in itself already provides useful information, but we can do better than this. We can use $\bar{R}_j^2$ to develop a measure which tells us how much larger the observed variance of a coefficient is compared to a scenario in which the variable was totally functionally independent from the other independent variables in the model. We call this measure variance inflation factor and define it as follows:

$\begin{equation} \text{VIF}_j = \frac{1}{1-\bar{R}_j^2} \end{equation}$

Its size is solely determined by $\bar{R}_j^2$ . The better a model explains the respective independent variable, the closer $\bar{R}_j^2$ will be to one, increasing the size of VIF $_j$ . A commonly adopted threshold for classing a variable as (multi-)collinear is VIF=10. This threshold is completely arbitrary, but has nonetheless managed to establish itself in regression analysis.

We will now use london_sam data frame to put theory into practice.

Regress GCSE scores on and in a multiple regression model. Interpret the results.

model3 <- lm(gcse ~ income + idaci, data=london_sam)

Install and load the package.

library(car)

Calculate the Variance Inflation Factor (VIF) for this model as follows:

vif(model3)

Interpret the results.
Regress GCSE scores on houseprice. Store the results in an object called model4. Note the significance of the slope coefficient and interpret the results.
Let’s introduce some collinearity by estimating the following model:

model5 <- lm(gcse ~ income + houseprice, data=london_sam)

Calculate the VIF for model 5. Interpret the results.
Drawing on the results of Exercise 7, explain the significance of the slope coefficients of Model 1 (Exercise 3 in the previous section) and Model 4 (Exercise 6 of this section).

The Gauss-Markov Theorem

Explain in your own words, what BLUE means.
According to Malinvaud, the assumption that $E(\epsilon_{i}|X_{i})=0$ is quite important. To see this, consider the PRF: $Y=\beta_{0}+\beta_{1}X_{i}+\epsilon_{i}$ . Now consider two situations: (1) $\beta_{0}=0$ , $\beta_{1}=1$ and $E(\epsilon_{i})=0$ ; and (2) $\beta_{0}=1$ , $\beta_{1}=0$ and $E(\epsilon_{i})=X_{i}-1$ . Now take the expectation of the PRF conditional upon $X$ in those two cases, and see if you agree with Malinvaud about the significance of the assumption $E(\epsilon_{i}|X_{i})=0$ .¹⁶

This discussion is a verbatim reproduction of Reiche (forthcoming).↩︎
Exercise taken from Gujarati & Porter (2009)↩︎