Thetazero Pubs
wtravesresiduals residual data mod gpa model models white fit fitted

Homework 4 Solutions

(a) GPA is the response variable. FirstGen is a categorical variable; though it is numeric it is not quantitative ? for example, we cannot add the values. Our study is an observational study; if it were a controlled experiment we'd need to set treatment groups and randomly allocate students to those groups. An example of this approach would have been if we'd randomly allocated students to dorm rooms or off-campus housing. We'd then be able to infer whether type of housing causes higher or lower first year GPA.


(b) Get the data and break into training and testing subsets:

setwd("/Users/traves/Dropbox/SM339/stat2 data files")
data = read.csv("FirstYearGPA.csv")
set.seed(1001)
trainingrows = sample(1:nrow(data), 150, replace = FALSE)
training = data[trainingrows, ]
testing = data[-trainingrows, ]
attach(training)
mean(GPA)
## [1] 3.072

The mean GPA of the training set is 3.07.

(c) Fit full model:

full = lm(GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White + CollegeBound)
summary(full)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + 
##     White + CollegeBound)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0437 -0.2469  0.0324  0.2172  0.9452 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.505389   0.416735    1.21   0.2273    
## HSGPA         0.517050   0.087315    5.92  2.3e-08 ***
## SATV          0.001280   0.000478    2.68   0.0083 ** 
## SATM         -0.000675   0.000573   -1.18   0.2409    
## Male          0.041667   0.070393    0.59   0.5549    
## HU            0.020430   0.004926    4.15  5.8e-05 ***
## SS            0.009360   0.007015    1.33   0.1843    
## FirstGen     -0.012299   0.095552   -0.13   0.8978    
## White         0.167020   0.087638    1.91   0.0587 .  
## CollegeBound -0.045537   0.109518   -0.42   0.6782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.385 on 140 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.359 
## F-statistic: 10.3 on 9 and 140 DF,  p-value: 4.69e-12

The fitted equation is:

GPA = 0.5053888 + 0.5170505HSGPA+0.0012802SATV-0.0006748SATM+0.0416671Male+0.0204302HU+0.0093600SS-0.0122991FirstGen+0.1670201White-0.0455375CollegeBound


The \( R^2 \) value is 0.3974. I'd drop the FirstGen variable from the model first since it has the highest p-value (0.89777); its coefficient cannot be reliably distinguished from zero and so it does not affect the model very much.

(d) Drop variables one at a time until all variables in the model have p-value less than 0.05:

mod1 = lm(GPA ~ HSGPA + SATV + SATM + Male + HU + SS + White + CollegeBound)
summary(mod1)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + SATM + Male + HU + SS + White + 
##     CollegeBound)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0427 -0.2468  0.0328  0.2178  0.9472 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.498143   0.411473    1.21   0.2281    
## HSGPA         0.515832   0.086497    5.96  1.9e-08 ***
## SATV          0.001288   0.000473    2.73   0.0072 ** 
## SATM         -0.000674   0.000571   -1.18   0.2400    
## Male          0.041370   0.070110    0.59   0.5561    
## HU            0.020545   0.004827    4.26  3.8e-05 ***
## SS            0.009378   0.006989    1.34   0.1818    
## White         0.168977   0.086007    1.96   0.0514 .  
## CollegeBound -0.044326   0.108731   -0.41   0.6841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.383 on 141 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.363 
## F-statistic: 11.6 on 8 and 141 DF,  p-value: 1.33e-12
mod2 = lm(GPA ~ HSGPA + SATV + SATM + Male + HU + SS + White)
summary(mod2)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + SATM + Male + HU + SS + White)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.049 -0.248  0.029  0.221  0.944 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.451612   0.394163    1.15   0.2538    
## HSGPA        0.523473   0.084194    6.22  5.3e-09 ***
## SATV         0.001267   0.000468    2.71   0.0077 ** 
## SATM        -0.000688   0.000568   -1.21   0.2282    
## Male         0.040427   0.069865    0.58   0.5637    
## HU           0.020433   0.004805    4.25  3.8e-05 ***
## SS           0.009378   0.006969    1.35   0.1805    
## White        0.173414   0.085065    2.04   0.0433 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.382 on 142 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.367 
## F-statistic: 13.3 on 7 and 142 DF,  p-value: 3.74e-13
mod3 = lm(GPA ~ HSGPA + SATV + SATM + HU + SS + White)
summary(mod3)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + SATM + HU + SS + White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0652 -0.2506  0.0288  0.2183  0.9622 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.427608   0.391061    1.09   0.2760    
## HSGPA        0.515156   0.082764    6.22  5.1e-09 ***
## SATV         0.001272   0.000467    2.72   0.0073 ** 
## SATM        -0.000580   0.000536   -1.08   0.2806    
## HU           0.020358   0.004792    4.25  3.9e-05 ***
## SS           0.009329   0.006952    1.34   0.1818    
## White        0.174144   0.084858    2.05   0.0420 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.381 on 143 degrees of freedom
## Multiple R-squared:  0.395,  Adjusted R-squared:  0.37 
## F-statistic: 15.6 on 6 and 143 DF,  p-value: 1.05e-13
mod4 = lm(GPA ~ HSGPA + SATV + HU + SS + White)
summary(mod4)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0755 -0.2668  0.0372  0.2315  0.9498 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.232778   0.347432    0.67    0.504    
## HSGPA       0.506545   0.082431    6.15  7.4e-09 ***
## SATV        0.001022   0.000406    2.51    0.013 *  
## HU          0.021110   0.004745    4.45  1.7e-05 ***
## SS          0.011111   0.006759    1.64    0.102    
## White       0.159404   0.083809    1.90    0.059 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.382 on 144 degrees of freedom
## Multiple R-squared:  0.39,   Adjusted R-squared:  0.369 
## F-statistic: 18.4 on 5 and 144 DF,  p-value: 4.03e-14
mod5 = lm(GPA ~ HSGPA + SATV + HU + White)
summary(mod5)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0392 -0.2735  0.0338  0.2578  0.9414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.274812   0.348518    0.79    0.432    
## HSGPA       0.520937   0.082445    6.32  3.1e-09 ***
## SATV        0.001032   0.000409    2.53    0.013 *  
## HU          0.018508   0.004499    4.11  6.5e-05 ***
## White       0.175115   0.083750    2.09    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 145 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.362 
## F-statistic: 22.1 on 4 and 145 DF,  p-value: 2.9e-14

We drop variables in the following order: CollegeBound, Male, SATM, and SS. The resulting fitted equation is:

GPA = 0.2748118 + 0.5209371HSGPA + 0.0010324SATV + 0.0185075HU + 0.1751148White.

(e) Plot residuals:

plot(rstandard(mod5) ~ mod5$fitted)
abline(0, 0, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-4

There is no clear pattern in the residuals that should concern us greatly. The variation at the edges is a little less than in the center but that may be the result of so few observations on the edges more than a clear pattern.

require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
densityplot(mod5$residuals, main = "Density plot of residuals of reduced model", 
    xlab = "Residual values")

plot of chunk unnamed-chunk-5

qqnorm(mod5$residuals, main = "QQ-plot of residuals of reduced model")
qqline(mod5$residuals, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-5

The density and QQ plot suggests that the residuals from our model are normally distributed. We should be reasonably confident that doing statistical inference on this data is appropriate.

(f) 95% CI:

beta = 0.0185075
se = 0.0044988
tstar = qt(0.975, df = 150 - 5)
low = beta - tstar * se
high = beta + tstar * se
print(c(low, high))
## [1] 0.009616 0.027399

A 95% CI for the effect of taking an extra humanities credit it (0.0096,0.0274). For each extra humanities credit the student takes, we are 95% confident that their GPA will rise by between (0.0096 and 0.0274).

(g) Prediction interval:

new = data.frame(HSGPA = 3.41, SATV = 700, SAT = 690, CollegeBound = 1, Male = 1, 
    White = 1, FirstGen = 0, HU = 10, SS = 19)
predict(mod5, newdata = new, interval = "prediction", level = 0.95)
##     fit   lwr   upr
## 1 3.134 2.369 3.899

We are 95% confident that his first year GPA will lie between 2.27 and 3.90.

(h) Compare models:

mod6 = lm(GPA ~ SATV + SATM + CollegeBound + HSGPA)
summary(mod6)
## 
## Call:
## lm(formula = GPA ~ SATV + SATM + CollegeBound + HSGPA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9362 -0.2706  0.0062  0.2869  0.9160 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.594066   0.430062    1.38  0.16930    
## SATV          0.001744   0.000483    3.61  0.00042 ***
## SATM         -0.000700   0.000556   -1.26  0.21018    
## CollegeBound -0.051327   0.115406   -0.44  0.65717    
## HSGPA         0.552583   0.090155    6.13  7.9e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.411 on 145 degrees of freedom
## Multiple R-squared:  0.288,  Adjusted R-squared:  0.268 
## F-statistic: 14.7 on 4 and 145 DF,  p-value: 4.4e-10
summary(mod5)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0392 -0.2735  0.0338  0.2578  0.9414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.274812   0.348518    0.79    0.432    
## HSGPA       0.520937   0.082445    6.32  3.1e-09 ***
## SATV        0.001032   0.000409    2.53    0.013 *  
## HU          0.018508   0.004499    4.11  6.5e-05 ***
## White       0.175115   0.083750    2.09    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 145 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.362 
## F-statistic: 22.1 on 4 and 145 DF,  p-value: 2.9e-14

The adjusted \( R^2 \) is 0.2684 for the new model compared to 0.3617 for our model from part (d). In general, we prefer models with higher adjusted \( R^2 \) so we prefer the model from part (d).

(i) We should test this claim by checking the coefficient of Male in our model. Here we use the full model from part c since Male does not appear in the model from part d.

\( H_0: \) true coeff of Male is \( \leq \) 0

\( H_1: \) true coeff of Male is \( >0 \)

Test statistic is 0.041667/0.070393 = 0.592 and is distributed as a t-statistic with 140 degres of freedom. The p-value is 0.2774295. Since this is not less than 0.05 we'd fail to reject the null hypothesis and conclude that male students do not have a statistically significantly higher GPA than female students.

1 - pt(0.041667/0.070393, df = 140)
## [1] 0.2774

(j) The correlation coefficient between White and SATV is relatively high at 0.37. Both variables may be highly correlated with income, which is one of the things that the redesign of the SAT is intended to address.

cor(White, SATV)
## [1] 0.3701

(k) The model only accounts for 25.2% of the variability in GPA for the testing data, as compared to 37.9% of the variability in GPA for the training data. This drop is common and is due to overfitting: the training data is used to fit the model and so the model predicts the response variable very well for that data, but for other data it may not achieve the same level of predictive power.

attach(testing)
## The following objects are masked from training:
## 
##     CollegeBound, FirstGen, GPA, HSGPA, HU, Male, SATM, SATV, SS,
##     White
predGPA = 0.2748118 + 0.5209371 * HSGPA + 0.0010324 * SATV + 0.0185075 * HU + 
    0.1751148 * White
cor(predGPA, GPA)^2
## [1] 0.2519
summary(mod5)
## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + White)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0392 -0.2735  0.0338  0.2578  0.9414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.274812   0.348518    0.79    0.432    
## HSGPA       0.520937   0.082445    6.32  3.1e-09 ***
## SATV        0.001032   0.000409    2.53    0.013 *  
## HU          0.018508   0.004499    4.11  6.5e-05 ***
## White       0.175115   0.083750    2.09    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 145 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.362 
## F-statistic: 22.1 on 4 and 145 DF,  p-value: 2.9e-14
Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy