Code
data("mariokart")In this case study, we consider Ebay auctions of a video game called Mario Kart for the Nintendo Wii. The outcome variable of interest is the total price of an auction, which is the highest bid plus the shipping cost. We will try to determine how total price is related to each characteristic in an auction while simultaneously controlling for other variables. For instance, all other characteristics held constant, are longer auctions associated with higher or lower prices? And, on average, how much more do buyers tend to pay for additional Wii wheels (plastic steering wheels that attach to the Wii controller) in auctions? Multiple regression will help us answer these and other questions.
When running a multiple linear regression model, we state the hypotheses to capture all predictor variables of interest.
Hypotheses: There is no linear relationship between price and condition of the game given that duration, wheels, and stock photo are included in the model.
\(H_0: \beta=0| wheels, stock photo\)
There is no linear relationship between price and stock photos given that duration, wheels, and condition are included in the model.
\(H_0: \beta=0| cond, wheels\)
There is no linear relationship between price and wheels given that duration and cond are included in the model.
\(H_0: \beta=0\)| cond, wheels.
The mariokart dataset can be found in the openintro package. Today we will use a package we have nt used before. This package is called caret and is used for assessing model accuracy. To intall the caret package, run install.packages("caret") in the console. One you are done load the following packages:
After installing caret and loading all libraries, import the dataset using
data("mariokart")We begin by running a model with cond as the only predictor variable. See below:
Model_1<-lm(total_pr ~ cond, data=mariokart)
tidy(Model_1)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 53.8 3.33 16.2 7.04e-34
2 condused -6.62 4.34 -1.52 1.30e- 1
Write down the equation for the model, note whether the slope is statistically different from zero, and interpret the coefficient.
Sometimes there are underlying structures or relationships between predictor variables. For instance, new games sold on Ebay tend to come with more Wii wheels, which may have led to higher prices for those auctions. We would like to fit a model that includes all potentially important variables simultaneously, which would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables.
We want to construct a model that accounts for not only the game condition but simultaneously accounts for three other variables. Run the following code:
Model_2<-lm(total_pr ~
cond
+stock_photo
+duration
+wheels,
data=mariokart)
tidy(Model_2)# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 43.5 8.37 5.20 0.000000705
2 condused -2.58 5.23 -0.494 0.622
3 stock_photoyes -6.75 5.17 -1.31 0.194
4 duration 0.379 0.939 0.403 0.687
5 wheels 9.95 2.72 3.66 0.000359
Write the equation of the new model.
What does the slope for wheels represent?
use your model to compute the residual for the first observation.
As we saw earlier the semester, a scatter plot marix is a nice way to assess associations between different pairs of variables. Below is a scatterplot matrix for the variables above:
In the previous chapters, we assessed model accuracy using the adjusted R-squared in the context of multiple linear regression and the confusion matrix in the context of logistic regression. We introduce a new tool known as Cross validation (CV) that is commonly used in assessing regression models. In cross validation, we start by randomly splitting a dataset into k subsets (known as folds). Next, we use the k-1 part of the data to create a regression model then we assess the accuracy using the rest of the data. Note that the data used to create the model and the data used to check the model accuracy are independent. We then repeat this process using a different fold (subset) until we have used all the folds. If you split your data into 5 folds, each fold (subset) will be one-fifth of the data. Each time you run a model, you will use four fifths of the data and one-fifth to test the accuracy. In the case of 5 folds, the process is repeated 5 times.
Let us start by running CV with cond as the only predictor variable. We are going to use \(k=3\).
#specify the cross-validation method
set.seed(123)
ctrl <- trainControl(method = "cv", number = 3)
#fit a regression model and use k-fold CV to evaluate performance
cv_model_1 <- train(total_pr ~ cond,
data = mariokart,
method = "lm",
trControl = ctrl)
#view summary of k-fold CV
print(cv_model_1)Linear Regression
143 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 95, 96, 95
Resampling results:
RMSE Rsquared MAE
21.37937 0.1347325 8.898653
Tuning parameter 'intercept' was held constant at a value of TRUE
Here is how to interpret the output:
No pre-processing occured. That is, we didn’t scale the data in any way before fitting the models.
The resampling method we used to evaluate the model was cross-validation with 3 folds. The sample size for the training sets were 95,96, and 95.
RMSE: The root mean squared error. This measures the average difference between the predictions made by the model and the actual observations. The lower the RMSE, the more closely a model can predict the actual observations.
Rsquared: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the R-squared, the more closely a model can predict the actual observations.
MAE: The mean absolute error. This is the average absolute difference between the predictions made by the model and the actual observations. The lower the MAE, the more closely a model can predict the actual observations.
Each of the three metrics provided in the output (RMSE, R-squared, and MAE) give us an idea of how well the model performed on previously unseen data.
We can use the following code to view the model predictions made for each fold:
cv_model_1$resample RMSE Rsquared MAE Resample
1 9.301291 2.653776e-01 7.457067 Fold1
2 41.872020 7.529513e-05 11.340381 Fold2
3 12.964785 1.387445e-01 7.898510 Fold3
Notice that, as expected, the average of the three RSME’s is the same as the RSME in cv_model_1.
In practice we typically fit several different models and compare the three metrics provided by the output seen here to decide which model produces the lowest test error rates and is therefore the best model to use.
Let us run a second cv model that excludes stock_photo and see what we get.
set.seed(132)
#specify the cross-validation method
ctrl_1 <- trainControl(method = "cv", number = 3)
#fit a regression model and use k-fold CV to evaluate performance
cv_model_2 <- train(total_pr ~ cond
+duration
+wheels,
data = mariokart, method = "lm",
trControl = ctrl_1)
#view summary of k-fold CV
print(cv_model_2)Linear Regression
143 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 95, 95, 96
Resampling results:
RMSE Rsquared MAE
19.56315 0.3528253 8.116431
Tuning parameter 'intercept' was held constant at a value of TRUE
Here, we see that the model without stock photo is better because we have a lower RSME and a higher R-squared. The idea is to run several different models and settle on one that fits best.