Multiple regression

From InterSciWiki
Jump to: navigation, search

SCCS R packageWikipedia:Variance_inflation_factor#Interpretation vif > 5 multicollinearity

Multiple regression. Instead of seeing if variables measure the same thing, we may want to see if different variables combine to predict others, possibly through their causal effects. It may be that the combination of the mother's role and the father's role in the nuclear family combine to reduce rates of crime. Here two separate variables may have an outcome for the third. Spss commands /Analyze/Regression ... etc (to be continued: Wikipedia:Regression).

Contents

Texts and Examples for Robust Regression

New software: Quick R Multiple (Linear) Regression

http://www.statmethods.net/stats/regression.html Multiple (Linear) Regression

R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity.

FITTING THE MODEL

  • Multiple Linear Regression Example

fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) *show results

  • Other useful functions

coefficients(fit)

  • model coefficients

confint(fit, level=0.95)

  • CIs for model parameters

fitted(fit)

  • predicted values

residuals(fit) residuals anova(fit)

  • anova table

vcov(fit)

  • covariance matrix for model parameters

influence(fit)

  • regression diagnostics

DIAGNOSTIC PLOTS

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observations.

  • diagnostic plots

layout(matrix(c(1,2,3,4),2,2))

  • optional 4 graphs/page

plot(fit)

click to view

For a more comprehensive evaluation of model fit see regression diagnostics.

COMPARING MODELS

You can compare nested models with the anova( ) function. The following code provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2.

  • compare models

fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata) fit2 <- lm(y ~ x1 + x2) anova(fit1, fit2)

CROSS VALIDATION

You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.

  • K-fold cross-validation

library(DAAG) cv.lm(df=mydata, fit, m=3) *3 fold cross-validation

Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.

You can assess R2 shrinkage via K-fold cross-validation. Using the crossval() function from the bootstrap package, do the following:

  • Assessing R2 shrinkage using 10-Fold Cross-Validation

fit <- lm(y~x1+x2+x3,data=mydata)

library(bootstrap)

  • define functions

theta.fit <- function(x,y){lsfit(x,y)} theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

  • matrix of predictors

X <- as.matrix(mydata[c("x1","x2","x3")])

  • vector of predicted values

y <- as.matrix(mydata[c("y")])

results <- crossval(X,y,theta.fit,theta.predict,ngroup=10) cor(y, fit$fitted.values)**2 *raw R2 cor(y,results$cv.fit)**2 *cross-validated R2

VARIABLE SELECTION

Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.

  • Stepwise Regression

library(MASS) fit <- lm(y~x1+x2+x3,data=mydata) step <- stepAIC(fit, direction="both") step$anova *display results

Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package. In the following code nbest indicates the number of subsets of each size to report. Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.).

  • All Subsets Regression

library(leaps) attach(mydata) leaps<-regsubsets(y~x1+x2+x3+x4,data=mydata,nbest=10)

  • view results

summary(leaps)

  • plot a table of models showing variables in each model.
  • models are ordered by the selection statistic.

plot(leaps,scale="r2")

  • plot statistic by subset size

library(car) subsets(leaps, statistic="rsq")

 click to view

Other options for plot( ) are bic, Cp, and adjr2. Other options for plotting with subset( ) are bic, cp, adjr2, and rss.

RELATIVE IMPORTANCE

The relaimpo package provides measures of relative importance for each of the predictors in the model. See help(calc.relimp) for details on the four measures of relative importance provided.

  • Calculate Relative Importance for Each Predictor

library(relaimpo) calc.relimp(fit,type=c("lmg","last","first","pratt"),

  rela=TRUE)
  • Bootstrap Measures of Relative Importance (1000 samples)

boot <- boot.relimp(fit, b = 1000, type = c("lmg",

 "last", "first", "pratt"), rank = TRUE, 
 diff = TRUE, rela = TRUE)

booteval.relimp(boot) *print result plot(booteval.relimp(boot,sort=TRUE)) *plot result

click to view

GRAPHIC ENHANCEMENTS

The car package offers a wide variety of plots for regression, including added variable plots, and enhanced diagnostic and scatter plots.

GOING FURTHER

NONLINEAR REGRESSION

The nls package provides functions for nonlinear regression. See John Fox's Nonlinear Regression and Nonlinear Least Squares for an overview. Huet and colleagues' Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS and R Examples is a valuable reference book.

ROBUST REGRESSION

There are many functions in R to aid with robust regression. For example, you can perform robust regression with the rlm( ) function in the MASS package. John Fox's (who else?) Robust Regression provides a good starting overview. The UCLA Statistical Computing website has Robust Regression Examples.

The robust package provides a comprehensive library of robust methods, including regression. The robustbase package also provides basic robust statistics including model selection methods. And David Olive has provided an detailed online review of Applied Robust Statistics with sample R code.

New software: Quick R Regression Diagnostics

Regression Diagnostics An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of Regression Diagnostics. Dr. Fox's car package provides advanced utilities for regression modeling.

  • Assume that we are fitting a multiple linear regression on the MTCARS data

library(car) fit <- lm(mpg~disp+hp+wt+drat, data=mtcars) This example is for exposition only. We will ignore the fact that this may not be a great way of modeling the this particular set of data! OUTLIERS

  • Assessing Outliers

outlierTest(fit)

  • Bonferroni p-value for most extreme obs

qqPlot(fit, main="QQ Plot") #qq plot for studentized resid leveragePlots(fit) *leverage plots

click to view leverage - outlier test

Bonferroni: The sequential Bonferroni (Holm 1979, Rice 1989) evaluates the increasing probability discounts in a series of ordered significance values. E.g., for Brown and Eff (2009) the only variables that pass the test are negative PCsize.2 (squared) (p = 0.011) and PCAP (p=0.012 ), two variables created by combining variables through principal components and then taking the first principal component. Food scarcity (p=.016) is a close third but even with a p<.05 threshold the probability of at least one of these three (.95*.95=.90, .95*.95*.95=.86 (.95*.95*.95=.86 occurring at random, even with a p<.05 threshold, the third variable does not pass the group significance test. With a p<.10 threshold and excluding the control variable for Date of observation (logdate) the “hidden variables” model also passes the Bonferroni test, p=.008 for No_rain_Dry and p=.034 for AnimXbridewealth, with Missions a close third (p=.043).

INFLUENTIAL OBSERVATIONS

  • Influential Observations
  • added variable plots

av.Plots(fit)

  • Cook's D plot
  • identify D values > 4/(n-k-1)

cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2)) plot(fit, which=4, cook.levels=cutoff)

  • Influence Plot

influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

  click to view Average variable plot
  click to view influenceplot
  click to view CooksD 

NON-NORMALITY

  • Normality of Residuals
  • qq plot for studentized resid

qqPlot(fit, main="QQ Plot")

  • distribution of studentized residuals

library(MASS) sresid <- studres(fit) hist(sresid, freq=FALSE,

  main="Distribution of Studentized Residuals")

xfit<-seq(min(sresid),max(sresid),length=40) yfit<-dnorm(xfit) lines(xfit, yfit)

 click to view qqplot2
 click to view Studentized residuals

NON-CONSTANT ERROR VARIANCE

  • Evaluate homoscedasticity
  • non-constant error variance test

ncvTest(fit)

  • plot studentized residuals vs. fitted values

spreadLevelPlot(fit)

 click to view spreadlevel plot

MULTI-COLLINEARITY

  • Evaluate Collinearity

vif(fit) *variance inflation factors sqrt(vif(fit)) > 2 *problem? NONLINEARITY

  • Evaluate Nonlinearity
  • component + residual plot

crPlots(fit)

  • Ceres plots

ceresPlots(fit)

 click to view component residual plots
 click to view ceres plot

NON-INDEPENDENCE OF ERRORS

  • Test for Autocorrelated Errors

durbinWatsonTest(fit) ADDITIONAL DIAGNOSTIC HELP The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness, kurtosis, and heteroscedasticity.

  • Global test of model assumptions

library(gvlma) gvmodel <- gvlma(fit) summary(gvmodel)

GOING FURTHER If you would like to delve deeper into regression diagnostics, two books written by John Fox can help: Applied regression analyses, linear models, and related methods and An R and S-Plus companion to applied regression.

Links

Testing the assumptions of linear regression by Robert Nau

Common Correlation and Reliability Analysis with SPSS for Windows by Robert A. Yaffee

Panda

P.A.N.D.A. Practical Analysis of Nutrition Data. Krishna Belbase UNICEF and Tulane University

Panda Ch5 pg3 Multi-way Analysis When and Why
Panda Ch5 pg3 Multi-way Analysis Confounding
Panda Ch5 pg3 Multi-way Analysis Interactions
Personal tools