Worksheet Week 10
Self-Assessment Questions
- How do the mathematical properties of OLS (Week 5) differ from the Classical Linear Assumptions?
- Why is observed heteroscedasticity a problem that needs addressing?
- What happens in a model with (multi-)collinearity?
- Does the assumption of linearity in the Gauss-Markov Theorem preclude non-linear transformations? Why / why not?
Please stop here and don’t go beyond this point until we have compared notes on your answers.
Testing the Classical Linear Assumptions
For this worksheet we will be using the london
data set from the lectures, this time with the following variables:
Data are taken from GOV.UK (2013), London Data Store (2010), and House of Commons Library (n.d.).
Homoscedasticity
- Load the data set london_11.csv as an object called
london
- Subset the data to a random sample as follows:
- Regress GCSE scores on
income
and interpret the results.
- Breusch-Pagan Test by Hand
- Extract the residuals from model 1 and store their squared values in a new object called
ressq
.
- Extract the residuals from model 1 and store their squared values in a new object called
N
.N
and R\(^2\).- Breusch-Pagan Test with the
lmtest
package.- Install and load the
lmtest
package
- Install and load the
- Heteroscedasticity
- Now let’s introduce heteroscedasticity:
lmtest
package.d. Install and load the
sandwich
package.sandwich
package replaces \(\Omega\) with a new diagonal which takes into accountHC3
– if you ever work with somebody using Stata thenHC1
to obtain the same results. Conduct a significance test withsandwich
function coeftest
):gcse
as your dependent variable.lmtest
package(Multi-)Collinearity14
In this Section, we will be testing for collinearity. We can do so with correlation analysis, which assesses to what degree two variables vary together. As such, this method is limited to pairwise comparisons between variables. It is conceivable, and indeed often the case, however, that one variable is functionally related to two or more other independent variables, such as
\[\begin{equation*} x_{1,i} = 0.2 x_{2,i }- 1.7 x_{3,i} \end{equation*}\]
We refer to this situation as multicollinearity. In order to detect it, we need to test for the functional dependence of each independent variable on the remaining independent variables in the model. We can do so by employing secondary regression models, not unlike in the Breusch-Pagan Test (see lecture slides). We specify K secondary regression models, each adopting one of our K independent variables as its dependent variable. Let us specify such a function for \(x_1\):
\[\begin{equation*} x_{1,i} = \alpha_0 + \alpha_1 x_{2,i} + \alpha_2 x_{3,i} + ... + \alpha_{K-1} x_{K,i} + v_i \end{equation*}\]
where \(\alpha\) denotes our new regression coefficients, and \(v_i\) represents a random error term. For the remaining independent variables, the functions would look as follows:
\[\begin{align*} x_{2,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{3,i} + \alpha_3 x_{4,i} + ... + \alpha_{K-1} x_{K,i} + v_i \\ x_{3,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \alpha_3 x_{4,i} + ... + \alpha_{K-1} x_{K,i} + v_i \\ \vdots \\ x_{K,i} &= \alpha_0 + \alpha_1 x_{1,i} + \alpha_2 x_{2,i} + \alpha_3 x_{3,i} + ... + \alpha_{K-1} x_{K-1,i} + v_i \end{align*}\]
In order to quantify how well each of the K regression functions explains the respective independent variable, we are going to use Adjusted R-Squared. We will note the \(R^2\) of each individual regression as \(R_j^2\) where j runs from 1 to K, or \(j=1,...,K\).
\(R_j^2\) in itself already provides useful information, but we can do better than this. We can use \(R_j^2\) to develop a measure which tells us how much larger the observed variance of a coefficient is compared to a scenario in which the variable was totally functionally independent from the other independent variables in the model. We call this measure variance inflation factor and define it as follows:
\[\begin{equation} \text{VIF}_j = \frac{1}{1-R_j^2} \end{equation}\]
Its size is solely determined by \(R_j^2\). The better a model explains the respective independent variable, the closer \(R_j^2\) will be to one, increasing the size of VIF\(_j\). A commonly adopted threshold for classing a variable as (multi-)collinear is VIF=10. This threshold is completely arbitrary, but has nonetheless managed to establish itself in regression analysis.
We will now use london_sam
data frame to put theory into practice.
- Regress GCSE scores on and in a multiple regression model. Interpret the results.
- Install and load the package.
- Calculate the Variance Inflation Factor (VIF) for this model as follows:
- Interpret the results.
- Regress GCSE scores on
houseprice
. Store the results in an object calledmodel4
. Note the significance of the slope coefficient and interpret the results. - Let’s introduce some collinearity by estimating the following model:
- Calculate the VIF for model 5. Interpret the results.
- Drawing on the results of Exercise 7, explain the significance of the slope coefficients of Model 1 (Exercise 3 in the previous section) and Model 4 (Exercise 6 of this section).
The Gauss-Markov Theorem
- Explain in your own words, what BLUE means.
- According to Malinvaud, the assumption that \(E(\epsilon_{i}|X_{i})=0\) is quite important. To see this, consider the PRF: \(Y=\beta_{0}+\beta_{1}X_{i}+\epsilon_{i}\). Now consider two situations: (1) \(\beta_{0}=0\), \(\beta_{1}=1\) and \(E(\epsilon_{i})=0\); and (2) \(\beta_{0}=1\), \(\beta_{1}=0\) and \(E(\epsilon_{i})=X_{i}-1\). Now take the expectation of the PRF conditional upon \(X\) in those two cases, and see if you agree with Malinvaud about the significance of the assumption \(E(\epsilon_{i}|X_{i})=0\).15