In the previous article, we took a detailed look at Simple Linear Regression. In this post, we will take a quick look at Multiple Linear Regression in which there is more than one independent variables to predict the value of the dependent variable (y).
Lets introduce two more independent variables qsec and drat and analyse the results. In simple regression, the coefficient estimate (or the slope) of wt was -5.3445 and when we introduce additional variables the estimate changes to -4.3978 because of the impact of the additional variables.
The estimate tells us how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant.
Looking at the summary() of the model, we can make the following observations :
- The residuals are more or less normally distributed
- wt and qsec are statistically significant as they have very low p values.
- The p-value of drat is higher than our cut off of 0.05 and its 95% confidence interval contains 0, both indicating that its not statistically significant
- The multiple R-squared indicates that the model accounts for 83% of the variance
- The F statistic represents the overall model prediction. The value of 47.93 is greater than the critical value of 2.94 and a low probability indicates that it is statistically significant.
- The absolute values of the standardised betas indicates wt has the strongest effect of 0.71
lm.fit <- lm(mpg~wt+qsec+drat, data = mtcars)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1152 -1.8273 -0.2696 1.0502 5.5010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3945 8.0689 1.412 0.16892
## wt -4.3978 0.6781 -6.485 5.01e-07 ***
## qsec 0.9462 0.2616 3.616 0.00116 **
## drat 1.6561 1.2269 1.350 0.18789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.8196
## F-statistic: 47.93 on 3 and 28 DF, p-value: 3.723e-11
qf(p=0.05, df1 = 3, df2 = 28, lower.tail = FALSE)
## [1] 2.946685
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -5.1339149 27.922817
## wt -5.7868161 -3.008776
## qsec 0.4102527 1.482155
## drat -0.8571278 4.169418
QuantPsyc::lm.beta(lm.fit)
## wt qsec drat
## -0.7139693 0.2805421 0.1469244
There are couple of additional concepts in multiple regression that we need to look at
Variance Inflation Factor (VIF)
In a multiple regression model, it is possible that one of the dependent variables can be linearly predicted from the other dependent variables. This phenomenon is called multicollinearity. And we can use Variance Inflation Factor (VIF) to measure this.
The definition as per Wikipedia is
“In statistics, the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.”
Lets find the VIF for the independent variables in our model. We will use the vif() from the car package.
car::vif(lm.fit)
## wt qsec drat
## 2.082097 1.033884 2.035472
To explain this further, we will create another model with qsec as the dependent variable and wt and drat as the independent variables. This model gives us multiple R-squared of 0.033 which is variance explained by this model. We can conclude that wt and drat are able to explain only about 3% of the variance in qsec, which inturn implies that there is no (or very little) correlation between qsec and the other 2 variables. And hence qsec doesnt display collinearity with the remaining variables.
But how is this related to VIF ? If we take the reciprocal of the VIF of qsec we get 0.967 which is same as the variance unexplained by our new model (1-0.033 = 0.967). This is also known as the tolerance.
In other words, tolerance is defined as that concept that says this is how much variance is left over in this particular independent variable once we know all of the other independents in the model. A really low amount of variance left over means that this particular independent is highly redundant with everything else in the model.
The general rule of thumb is that any variable that has a VIF of over five (five or larger) or a tolerance of 0.2 is kind of redundant. Effectively saying that we have only got about 20% of the variance of this independent variable left over once I account for all of the other variables in the model. And hence can be omitted from the model.
Partial Correlation
Partial correlation is the correlation between 2 variables holding other variables constant. It gives the proportion of variance accounted for uniquely by each variable.
ppcor::pcor(mtcars[,c(1,6,7)])
## $estimate
## mpg wt qsec
## mpg 1.0000000 -0.8885492 0.5456251
## wt -0.8885492 1.0000000 0.4176413
## qsec 0.5456251 0.4176413 1.0000000
##
## $p.value
## mpg wt qsec
## mpg 0.000000e+00 2.518948e-11 0.001499883
## wt 2.518948e-11 0.000000e+00 0.019400101
## qsec 1.499883e-03 1.940010e-02 0.000000000
##
## $statistic
## mpg wt qsec
## mpg 0.000000 -10.429771 3.506179
## wt -10.429771 0.000000 2.475278
## qsec 3.506179 2.475278 0.000000
##
## $n
## [1] 32
##
## $gp
## [1] 1
##
## $method
## [1] "pearson"
This post is Work in Progress