Linear Regression Part II

Linear Regression Part II - Multiple

14-Apr-18 . 5 mins read

R regression

Variance Inflation Factor (VIF)
Partial Correlation

Photo by Kenneth Thewissen on Unsplash

In the previous article, we took a detailed look at Simple Linear Regression. In this post, we will take a quick look at Multiple Linear Regression in which there is more than one independent variables to predict the value of the dependent variable (y).

Lets introduce two more independent variables qsec and drat and analyse the results. In simple regression, the coefficient estimate (or the slope) of wt was -5.3445 and when we introduce additional variables the estimate changes to -4.3978 because of the impact of the additional variables.

The estimate tells us how much the dependent variable is expected to increase when that independent variable increases by one unit, holding all the other independent variables constant.

Looking at the summary() of the model, we can make the following observations :

The residuals are more or less normally distributed
wt and qsec are statistically significant as they have very low p values.
The p-value of drat is higher than our cut off of 0.05 and its 95% confidence interval contains 0, both indicating that its not statistically significant
The multiple R-squared indicates that the model accounts for 83% of the variance
The F statistic represents the overall model prediction. The value of 47.93 is greater than the critical value of 2.94 and a low probability indicates that it is statistically significant.
The absolute values of the standardised betas indicates wt has the strongest effect of 0.71

lm.fit <- lm(mpg~wt+qsec+drat, data = mtcars)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1152 -1.8273 -0.2696  1.0502  5.5010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3945     8.0689   1.412  0.16892    
## wt           -4.3978     0.6781  -6.485 5.01e-07 ***
## qsec          0.9462     0.2616   3.616  0.00116 ** 
## drat          1.6561     1.2269   1.350  0.18789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8196 
## F-statistic: 47.93 on 3 and 28 DF,  p-value: 3.723e-11

qf(p=0.05, df1 = 3, df2 = 28, lower.tail = FALSE)

## [1] 2.946685

confint(lm.fit)

##                  2.5 %    97.5 %
## (Intercept) -5.1339149 27.922817
## wt          -5.7868161 -3.008776
## qsec         0.4102527  1.482155
## drat        -0.8571278  4.169418

QuantPsyc::lm.beta(lm.fit)

##         wt       qsec       drat 
## -0.7139693  0.2805421  0.1469244

There are couple of additional concepts in multiple regression that we need to look at

Variance Inflation Factor (VIF)

In a multiple regression model, it is possible that one of the dependent variables can be linearly predicted from the other dependent variables. This phenomenon is called multicollinearity. And we can use Variance Inflation Factor (VIF) to measure this.

The definition as per Wikipedia is

“In statistics, the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.”

Lets find the VIF for the independent variables in our model. We will use the vif() from the car package.

car::vif(lm.fit)

##       wt     qsec     drat 
## 2.082097 1.033884 2.035472

To explain this further, we will create another model with qsec as the dependent variable and wt and drat as the independent variables. This model gives us multiple R-squared of 0.033 which is variance explained by this model. We can conclude that wt and drat are able to explain only about 3% of the variance in qsec, which inturn implies that there is no (or very little) correlation between qsec and the other 2 variables. And hence qsec doesnt display collinearity with the remaining variables.

But how is this related to VIF ? If we take the reciprocal of the VIF of qsec we get 0.967 which is same as the variance unexplained by our new model (1-0.033 = 0.967). This is also known as the tolerance.

In other words, tolerance is defined as that concept that says this is how much variance is left over in this particular independent variable once we know all of the other independents in the model. A really low amount of variance left over means that this particular independent is highly redundant with everything else in the model.

The general rule of thumb is that any variable that has a VIF of over five (five or larger) or a tolerance of 0.2 is kind of redundant. Effectively saying that we have only got about 20% of the variance of this independent variable left over once I account for all of the other variables in the model. And hence can be omitted from the model.

Partial Correlation

Partial correlation is the correlation between 2 variables holding other variables constant. It gives the proportion of variance accounted for uniquely by each variable.

ppcor::pcor(mtcars[,c(1,6,7)])

## $estimate
##             mpg         wt      qsec
## mpg   1.0000000 -0.8885492 0.5456251
## wt   -0.8885492  1.0000000 0.4176413
## qsec  0.5456251  0.4176413 1.0000000
## 
## $p.value
##               mpg           wt        qsec
## mpg  0.000000e+00 2.518948e-11 0.001499883
## wt   2.518948e-11 0.000000e+00 0.019400101
## qsec 1.499883e-03 1.940010e-02 0.000000000
## 
## $statistic
##             mpg         wt     qsec
## mpg    0.000000 -10.429771 3.506179
## wt   -10.429771   0.000000 2.475278
## qsec   3.506179   2.475278 0.000000
## 
## $n
## [1] 32
## 
## $gp
## [1] 1
## 
## $method
## [1] "pearson"

This post is Work in Progress

Geographic Data and Visualisation in R

Kaggle Porto Seguro Part I - Exploratory Data Analysis

Linear Regression Part III - Plots