Learning Objectives

  • Choosing what variables to include in a model.
  • Chapter 12 of Statistical Sleuth.
  • Chapter 9 of KNNL

Motivation

  • You often have many predictor variables.

  • Including too few and you are potentially not controlling for key variables.

  • Including too many and your regression coefficient estimates might be unstable.

  • Model selection: Choosing which variables (and transformations of these variables) to include.

Steps of Selecting a Model

From the Statistical Sleuth:

  1. Identify the key objectives.
  2. Screen the available variables, deciding on a list that is sensitive to the objectives and excludes obvious redundancies.
  • Steps 3, 4, and 5: Repeat the following until satisfied:
    1. Perform exploratory analysis, examining graphical displays and correlation coefficients.
    2. Perform transformations as necessary.
    3. Examine a residual plot after fitting a rich model, performing further transformations and considering outliers.
  1. Use a computer-assisted technique for finding a suitable subset of explanatory variables, exerting enough control over the process to be sensitive to the questions of interest.
  2. Proceed with the analysis, using the selected explanatory variables.

Step 1: Identify Objectives and Questions of Interest

Objective: Association between X and Y controlling for Z

  • Goal is to determine the association between a response and some interesting predictors after adjusting for other nuisance predictors.

  • Example: Is there evidence of an association between salary and sex after adjusting for other, legitimate, determinants of salary?

    • Association of interest: sex versus salary
    • Nuisance variables to adjust for: Seniority, age, education, experience
  • Generally, a good strategy is to (during steps 6 and 7 from above)

    1. Perform automatic variable selection techniques with everything except the explanatory variables of interest, then
    2. Include the explanatory variables of interest to test for associations.
  • Automatic variable selection techinques destroy the interpretation of \(p\)-values. So

    • Do not interpret what set of nuisance variables were chosen,
    • Do not interpret the p-values of the nuisance variables
    • Do not interpret the the coefficients of the nuisance variables.
    • Only interpret the coefficients and p-values corresponding to the variables of interest.
  • Example: Using automatic selection procedures, it was determined to include experience, seniority, and education, but not age in the final model. After adding in sex, the resulting fit to the final model was

    ## # A tibble: 5 × 5
    ##   term         estimate std.error statistic  p.value
    ##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
    ## 1 (Intercept)  8.67      0.0955       90.8  9.51e-89
    ## 2 Senior      -0.00435   0.000944     -4.61 1.38e- 5
    ## 3 Educ         0.0164    0.00448       3.66 4.26e- 4
    ## 4 Exper        0.000261  0.000107      2.45 1.62e- 2
    ## 5 SexMale      0.130     0.0214        6.06 3.30e- 8
    • It is tempting to say that Age is not associated with salary, adjusting for other variables, while the other variables are associated with salary, adjusting for other variables. However, this is wrong.

    • Since we used an automatic variable selection procedure for Age, Seniority, Education, and Experience, we cannot interpret those coefficients or \(p\)-values.

    • We can only interpret the coefficient and \(p\)-value for sex, which we did not use a automatic variable selection technique on.

    • We estimate that females make about 89% the salary of males, adjusting for other, legitimate, predictors of base salary. The evidence is strong that this association is not due to chance alone (\(p = 3.3\times 10^{-8}\)).

    • Technically, we are also adjusting for age in this statement, even though it was not in the model, because it had the chance to be in the model.

Objective: Fishing

  • You have a response variable and many explanatory variables, and you want to know what variables are possibly associated with your response.

  • Then iterate through adding/removing variables, making transformations, checking residuals, until you develop a model with significant terms and no major issues.

  • \(p\)-values/confidence intervals don’t have proper interpretation.

    • Same problems with multiple comparisons — ran many tests and looked at data a lot to come to final model.
  • You generally build a model and tell stories with it.

  • Issues:

    1. Explanatory variables are not necessarily special. Inclusion/exclusion strongly affected by multicollinearity.
      • I.e., You are just generating hypotheses, not conclusions.
    2. Causal interpretations are not allowed, as usual, with observational studies.
    3. \(p\)-values are meaningless.
  • This is a strict interpretation, but folks in the real world typically fudge a little bit. How bad this is depends on how much fudging you do. My opinion:

    • If you have many variables, a small sample size, and use automatic variable selection to come up with a model with a few predictors, then you are out of luck.
    • If you have a large sample size, a small number of predictors, and you use some automatic procedures to come up with a couple models that you are going to use your expertise to choose between, then you are only fudging a little bit.
  • If your dataset is large, then you can split your data into a training set and a test set. You can do whatever variable selection techniques you want on your training set, then fit your final model on your test set to obtain \(p\)-values that have the proper interpretation.

    • Test set could be about 25% and the training set about 75%.
    • Splitting data is feasible if you have more than 6 observations per predictor variable in the training set.
    • As soon as you look at how well your method works on your test set, you can no longer do anything.
    • I.e. if you cheat and redo your analysis to better fit the test set, then you are back to the problems of fishing and there was no point in splitting the data in the first place.

Objective: Prediction

  • Include variables to maximize predictive power, don’t worry about interpretation.

  • This lecture is not when prediction is the goal.

  • The automatic variable selection procedures described in this lecture are not often used when the goal is prediction.

  • Take the Machine Learning course for details when the goal is prediction.

    • Typical methods include regularization and cross validation.

Step 2: Screen Available Variables

  • Choose a list of explanatory variables that are important to the objective.

  • Screen out redundant variables.

  • Use your domain expertise for screening variables.

  • Note: What variables are important will depend on the question being asked.

  • Example: Researchers were interested in what variables were associated with state average SAT scores. Possible predictors include

    • Takers: Percentage of high school seniors in the state who took the exam.
    • Income: Median income of families of test-takers (hundreds of dollars).
    • Years: Average number of years test-takers had formal studies.
    • Public: Percentage of test-takers who attended public schools.
    • Expend: Total state expenditure on secondary schools, in hundreds of dollars per student.
    • Rank: Median percentile ranking of test-takers within their secondary school classes.
    library(Sleuth3)
    data("case1201")
    sat <- case1201
    glimpse(sat)
    ## Rows: 50
    ## Columns: 8
    ## $ State  <fct> Iowa, SouthDakota, NorthDakota, Kansas, Nebraska, Montana, Minn…
    ## $ SAT    <int> 1088, 1075, 1068, 1045, 1045, 1033, 1028, 1022, 1017, 1011, 100…
    ## $ Takers <int> 3, 2, 3, 5, 5, 8, 7, 4, 5, 10, 5, 4, 9, 8, 7, 3, 6, 16, 19, 11,…
    ## $ Income <int> 326, 264, 317, 338, 293, 263, 343, 333, 328, 304, 358, 295, 330…
    ## $ Years  <dbl> 16.79, 16.07, 16.57, 16.30, 17.25, 15.91, 17.41, 16.57, 16.01, …
    ## $ Public <dbl> 87.8, 86.2, 88.3, 83.9, 83.6, 93.7, 78.3, 75.2, 97.0, 77.3, 74.…
    ## $ Expend <dbl> 25.60, 19.95, 20.62, 27.14, 21.05, 29.48, 24.84, 17.42, 25.96, …
    ## $ Rank   <dbl> 89.7, 90.6, 89.8, 86.3, 88.5, 86.4, 83.4, 85.9, 87.5, 84.2, 85.…
    • Goal 1: Business firm looking for a place to build a new facility. They want to know if SAT scores accurately reflect educational training of the labor market in that state.
      • Only care if SAT score is associated with Rank (the educational training variable) after accounting for selection bias (Takers).
    • Goal 2: Government wants to determine impact of state expenditures on SAT scores. Then include all variables as possible predictors to so we can see what effect expenditures has that cannot be accounted by other variables.
  • Problems with Including Too Few Variables

    • You are only picking up marginal associations.

    • E.g., we already know that men make more money than women. We want to see if men still make more money than women when we control for other variables.

    • Predictions are less accurate.

    • If you fit a model with \(X\), then this is what the model is seeing:

  • Problems with too many variables

    • Harder to estimate more parameters.

    • Formally, the variances of the sampling distributions of the coefficients in the model will get much larger.

    • Including highly correlated explanatory variables will really increase the variance of the sampling distributions of the coefficient estimates.

    • Intuitively, we are less sure if the association of \(Y\) and \(X_1\) is due to that actual associate or is it mediated through \(X_2\)?

    • Predictions are less accurate.

  • Demonstration when have too many variables

    • True model: \(E(Y|X_1) = X_1\)

    • Fit Model: \(E(Y|X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2X_2\)

    • Correlation between \(X_1\) and \(X_2\) is 0.9994.

    • We will simulate \(Y\) and plot the resulting OLS estimates.

    • Black is truth

Steps 3 through 5

  1. Exploratory data analysis.

    • Tons of scatterplots.

    • Look at correlation coefficients.

  2. Transformations based on EDA.

  3. Fit a rich model and look at residuals.

    • Look for curvature, non-constant variance, and outliers.
  • Iterate the above steps until you don’t see any issues.

SAT Example

  • Matrix scatterplot of SAT data

    library(GGally)
    ggpairs(sat, columns = 2:8)

  • Curvature between SAT scores and percentage takers. But it looks like constant variance at first, so maybe taking a log of percentage takers would help.

    sat <- mutate(sat, l_takers = log(Takers))
    ggplot(data = sat, mapping = aes(x = l_takers, y = SAT)) +
      geom_point()

  • There is a huge outlier in expenditures caused by Alaska

    arrange(sat, desc(Expend)) %>%
      select(State, Expend) %>%
      head()
    ##           State Expend
    ## 1        Alaska  50.10
    ## 2       NewYork  33.58
    ## 3 Massachusetts  31.74
    ## 4        Oregon  30.49
    ## 5       Montana  29.48
    ## 6  Pennsylvania  27.98
  • Let’s fit a rich model to identify any other transformations and other possible issues.

    lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
    a_rich <- augment(lm_rich)
    ggplot(data = a_rich, mapping = aes(x = .fitted, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = l_takers, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = Income, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = Years, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = Public, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = Expend, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    ggplot(data = a_rich, mapping = aes(x = Rank, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0)

    The only worry is the expenditures residual plot where, again, we see Alaska.

  • In Chapter 10, we will learn about leverage values, and we see that Alaska has a very high Cook’s distance (.cooksd) and a high leverage (.hat).

    a_rich %>%
      bind_cols(select(sat, State)) %>%
      select(State, .cooksd, .hat) %>%
      arrange(desc(.cooksd))
    ## # A tibble: 50 × 3
    ##    State         .cooksd   .hat
    ##    <fct>           <dbl>  <dbl>
    ##  1 Alaska         1.60   0.572 
    ##  2 Louisiana      0.207  0.345 
    ##  3 Mississippi    0.152  0.147 
    ##  4 SouthCarolina  0.146  0.129 
    ##  5 Tennessee      0.143  0.207 
    ##  6 WestVirginia   0.0880 0.169 
    ##  7 NewHampshire   0.0728 0.0913
    ##  8 NorthCarolina  0.0437 0.102 
    ##  9 Indiana        0.0345 0.198 
    ## 10 Illinois       0.0339 0.106 
    ## # … with 40 more rows
    • A high leverage is typically above \(2p/n\), or in this case \(2 * 7 / 50 = 0.28\)

    • A high Cook’s distance is around or above 1.

  • Let’s remove Alaska. This is probably OK since Alaska uses expenditures on things like long distance travel for teachers, and higher heating bills, which do not have a direct effect on educational attainment. Thus, this expenditure value is likely not representative of our objective.

    sat <- filter(sat, State != "Alaska")
  • We would go through the above steps again to see if everything looks good in our rich model.

  • This is also the stage where we would include quadratic terms, do other variable transformations, etc.

Step 6: Automated Variable Selection

  • If appropriate, use an automatic variable selection technique to choose a suitable subset of explanatory variables.

  • An automatic variable selection technique has a criterion where a higher or lower value indicates a better fit.

  • The computer adds/deletes variables from the model until it reaches a point where it cannot increase the criterion any more.

  • The selected model is considered the “best” one.

  • You need to make sure your “rich” model has no major issues before implementing these approaches.

  • Most statisticians dislike automatic variable selection approaches, since they have bad sampling properties, but they can be useful to

    1. Control for many predictors when your objective is the relationship between a couple key predictors and a response variables, and
    2. Fishing for hypotheses, as long as you don’t take your results too seriously and always hedge when you present results.

Criteria

  • AIC: Akaike’s Information Criterion \[ AIC = n\log\left(\frac{SSE}{n}\right) + 2p \]
    • Lower AIC is better.
    • Fit (in terms of SSE) + penalty on the number of parameters.
    • As \(p\) increases, SSE decreases, but this is possibly offset by the penalty.
    • This is the criterion base R includes by default for stepwise procedures.
  • BIC (SBC): The Bayesian information Criterion (aka Schwarz’s Bayesian Criterion) penalizes the parameters a little more than AIC. \[ BIC = n\log\left(\frac{SSE}{n}\right) + \log(n)p \]
    • Lower BIC is better.
    • Again, fit + penalty.
    • As \(p\) increases, SSE decreases, but this is possibly offset by the penalty.
    • Better than AIC when you have not been as careful about deleting redundant predictors.
  • Mallows \(C_p\) \[ C_p = p + (n-p)\frac{\hat{\sigma}^2 - \hat{\sigma}^2_{full}}{\hat{\sigma}^2_{full}}, \]
    • \(\hat{\sigma}^2\) is the MSE under the model under consideration.
    • \(\hat{\sigma}^2_{full}\) is the MSE under the model that uses all possible explanatory variables.
    • Lower \(C_p\) is better.
    • Mallow’s \(C_p\) is an estimate of the “Total Mean Squared Error” which is the sum of the bias squared and the variance. Small values of Mallow’s \(C_p\) indicate that the model both has low bias and low variance.
    • If there is no bias, then \(C_p \approx p\) (since \(\hat{\sigma}^2 \approx \hat{\sigma}^2_{full}\)). So all models withere \(C_p\) is near \(p\) are considered “good” models.
  • \(R^2\): \[ R^2 = 1 - \frac{SSE}{SSTO} \]
    • Higher \(R^2\) is better (explains more variation..
    • Do not use this one for variable selection.
    • Recall that \(R^2\) will decrease as we add more predictors, so we cannot use it to compare models with differen numbers of predictors.
  • Adjusted \(R_a^2\): Look at whether there is a “plateau” when you add predictors. \[ R^2_a = 1 - \frac{SSE/(n-p)}{SSTO/(n-1)} \]
    • Higher \(R^2_a\) is better (explains more variation).

Automated Procedures

  • “Best” subsets: Look at all possible models given the set of predictors, choose the one with the best criterion.

  • Forward/Backward Selection: Start with a model with no predictors \(Y_i = \beta_0 + \epsilon_i\).

    • Fit all models where you add a predictor. Choose the one that improves the criterion the best.
    • Fit all models where you remove a predictor. Choose the one that improves the criterion the best.
    • Iterate until you cannot add/remove any more predictors.
  • The forward/backward approach is not gauranteed to find the model with the “best” criterion.

  • When you add/subtract a categorical variable, you should add/subtract all of the indicators associated with that categorical variable.

  • Make sure you also include first-order terms if second order terms are kept in the model.

Step 7: Proceed with Caution

  • Proceed with analysis with chosen explanatory variables.

  • Evaluate residual plots for your final model. Perform other model checks.

  • Tell stories with the data using \(p\)-values, coefficient estimates, confidence intervals, coefficients of determination, etc…

  • Step 7 is what we’ve been discussing this whole semester.

Implementation in R

  • You can get a version of AIC, a version of BIC, \(R^2\), and \(R_a^2\) via glance() from {broom}

    lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
    glance(lm_rich)  %>%
      select(AIC, BIC, r.squared, adj.r.squared)
    ## # A tibble: 1 × 4
    ##     AIC   BIC r.squared adj.r.squared
    ##   <dbl> <dbl>     <dbl>         <dbl>
    ## 1  453.  468.     0.913         0.900
  • AIC Note:

    • {broom} uses the base AIC() function which returns \[ AIC = n\log\left(\frac{SSE}{n}\right) + n + n\log(2\pi) + 2(p+1) \] (the “\(p+1\)” comes from including estimation of \(\sigma^2\) as a parameter).

    • Use extractAIC() to get the book’s definition of AIC \[ AIC = n\log\left(\frac{SSE}{n}\right) + 2p \]

    • We can verify the equivalence manually

      n <- nrow(sat)
      AIC(lm_rich) - n - n * log(2 * pi) - 2
      ## [1] 311.9
      extractAIC(lm_rich)[[2]]
      ## [1] 311.9
  • BIC Note:

    • Similarly, {broom} uses the AIC() function to get a different version of BIC. \[ BIC = n\log\left(\frac{SSE}{n}\right) + n + n\log(2\pi) + \log(n)(p+1) \]

    • Use extractAIC() to get the book’s definition of BIC \[ BIC = n\log\left(\frac{SSE}{n}\right) + \log(n)p \]

    • For both functions, to get BIC, set the k argument to be \(\log(n)\).

    • We can verify the equivalence manually

      n <- nrow(sat)
      AIC(lm_rich, k = log(n)) - n - n * log(2 * pi) - log(n)
      ## [1] 325.1
      extractAIC(lm_rich, k = log(n))[[2]]
      ## [1] 325.1
  • You can get Mallow’s \(C_p\) by using extractAIC() and specifying the scale argument to be the estimated residual variance under the biggest model.

    sigma_full <- sigma(lm_rich)
    lm_smaller <- lm(SAT ~ l_takers + Years + Expend + Rank, data = sat)
    extractAIC(fit = lm_smaller, scale = sigma_full^2)[[2]]
    ## [1] 4.031

    We can verify this manually

    p <- ncol(model.matrix(lm_smaller))
    n <- nrow(model.matrix(lm_smaller))
    sigma_full <- sigma(lm_rich)
    sigma_reduced <- sigma(lm_smaller)
    p + (n - p) * (sigma_reduced^2 - sigma_full^2) / sigma_full^2
    ## [1] 4.031
  • Use the step() function to choose a model by AIC/BIC via automated search.

    • The object argument is the intial model that it will search from.
    • The scope argument specifies the simplest and the most complicated models possible.
      • If you give it a single formula, this is the most complicated model possible.
      • If you give it a list of formulas, these are the most complicated and simplest models possible.
    • You can use the output of step() like the output of lm() (via tidy(), augment(), etc).
  • To use both a lower and upper bound, do:

    lm_init <- lm(SAT ~ Income + Years, data = sat)
    sout <- step(object = lm_init, scope = list(upper = SAT ~ l_takers + Income + Years + Public + Expend + Rank, lower = SAT ~ Income))
    ## Start:  AIC=393.9
    ## SAT ~ Income + Years
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Rank      1     99984  34294 329
    ## + l_takers  1     94604  39674 336
    ## + Public    1     23088 111190 387
    ## <none>                  134278 394
    ## - Years     1      9072 143350 395
    ## + Expend    1        14 134264 396
    ## 
    ## Step:  AIC=329
    ## SAT ~ Income + Years + Rank
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Expend    1     10326  23968 313
    ## <none>                   34294 329
    ## + Public    1       759  33535 330
    ## + l_takers  1       412  33882 330
    ## - Years     1     13581  47876 343
    ## - Rank      1     99984 134278 394
    ## 
    ## Step:  AIC=313.4
    ## SAT ~ Income + Years + Rank + Expend
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + l_takers  1      2552  21417 310
    ## <none>                   23968 313
    ## + Public    1       422  23547 315
    ## - Years     1      7353  31322 325
    ## - Expend    1     10326  34294 329
    ## - Rank      1    110295 134264 396
    ## 
    ## Step:  AIC=309.9
    ## SAT ~ Income + Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS AIC
    ## <none>                  21417 310
    ## + Public    1        20 21397 312
    ## - l_takers  1      2552 23968 313
    ## - Years     1      3011 24428 314
    ## - Rank      1      3162 24578 315
    ## - Expend    1     12465 33882 330
    tidy(sout)
    ## # A tibble: 6 × 5
    ##   term        estimate std.error statistic   p.value
    ##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
    ## 1 (Intercept)  291.      256.         1.14 0.261    
    ## 2 Income         0.113     0.113      1.01 0.319    
    ## 3 Years         13.5       5.49       2.46 0.0180   
    ## 4 Rank           5.06      2.01       2.52 0.0155   
    ## 5 Expend         3.87      0.774      5.00 0.0000100
    ## 6 l_takers     -31.2      13.8       -2.26 0.0287
  • To use just an upper bound, do:

    lm_init <- lm(SAT ~ Income + Years, data = sat)
    sout <- step(object = lm_init, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank)
    ## Start:  AIC=393.9
    ## SAT ~ Income + Years
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Rank      1     99984  34294 329
    ## + l_takers  1     94604  39674 336
    ## + Public    1     23088 111190 387
    ## <none>                  134278 394
    ## - Years     1      9072 143350 395
    ## + Expend    1        14 134264 396
    ## - Income    1     84760 219038 416
    ## 
    ## Step:  AIC=329
    ## SAT ~ Income + Years + Rank
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Expend    1     10326  23968 313
    ## <none>                   34294 329
    ## + Public    1       759  33535 330
    ## + l_takers  1       412  33882 330
    ## - Income    1      3322  37616 332
    ## - Years     1     13581  47876 343
    ## - Rank      1     99984 134278 394
    ## 
    ## Step:  AIC=313.4
    ## SAT ~ Income + Years + Rank + Expend
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + l_takers  1      2552  21417 310
    ## <none>                   23968 313
    ## + Public    1       422  23547 315
    ## - Income    1      3048  27016 317
    ## - Years     1      7353  31322 325
    ## - Expend    1     10326  34294 329
    ## - Rank      1    110295 134264 396
    ## 
    ## Step:  AIC=309.9
    ## SAT ~ Income + Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS AIC
    ## - Income    1       505 21922 309
    ## <none>                  21417 310
    ## + Public    1        20 21397 312
    ## - l_takers  1      2552 23968 313
    ## - Years     1      3011 24428 314
    ## - Rank      1      3162 24578 315
    ## - Expend    1     12465 33882 330
    ## 
    ## Step:  AIC=309.1
    ## SAT ~ Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS AIC
    ## <none>                  21922 309
    ## + Income    1       505 21417 310
    ## + Public    1       185 21737 311
    ## - Rank      1      2676 24598 313
    ## - Years     1      2870 24792 313
    ## - l_takers  1      5094 27016 317
    ## - Expend    1     13620 35542 331
    tidy(sout)
    ## # A tibble: 5 × 5
    ##   term        estimate std.error statistic    p.value
    ##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
    ## 1 (Intercept)   399.     232.         1.72 0.0929    
    ## 2 Years          13.1      5.48       2.40 0.0207    
    ## 3 Rank            4.40     1.90       2.32 0.0252    
    ## 4 Expend          4.00     0.764      5.23 0.00000452
    ## 5 l_takers      -38.1     11.9       -3.20 0.00257
  • Modify the k argument to be \(log(n)\) to fit by BIC

    sout <- step(object = lm_init, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank, k = log(nrow(sat)))
    ## Start:  AIC=399.6
    ## SAT ~ Income + Years
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Rank      1     99984  34294 337
    ## + l_takers  1     94604  39674 344
    ## + Public    1     23088 111190 394
    ## - Years     1      9072 143350 399
    ## <none>                  134278 400
    ## + Expend    1        14 134264 403
    ## - Income    1     84760 219038 420
    ## 
    ## Step:  AIC=336.6
    ## SAT ~ Income + Years + Rank
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + Expend    1     10326  23968 323
    ## <none>                   34294 337
    ## - Income    1      3322  37616 337
    ## + Public    1       759  33535 339
    ## + l_takers  1       412  33882 340
    ## - Years     1     13581  47876 349
    ## - Rank      1     99984 134278 400
    ## 
    ## Step:  AIC=322.9
    ## SAT ~ Income + Years + Rank + Expend
    ## 
    ##            Df Sum of Sq    RSS AIC
    ## + l_takers  1      2552  21417 321
    ## <none>                   23968 323
    ## - Income    1      3048  27016 325
    ## + Public    1       422  23547 326
    ## - Years     1      7353  31322 332
    ## - Expend    1     10326  34294 337
    ## - Rank      1    110295 134264 403
    ## 
    ## Step:  AIC=321.3
    ## SAT ~ Income + Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS AIC
    ## - Income    1       505 21922 319
    ## <none>                  21417 321
    ## - l_takers  1      2552 23968 323
    ## - Years     1      3011 24428 324
    ## - Rank      1      3162 24578 324
    ## + Public    1        20 21397 325
    ## - Expend    1     12465 33882 340
    ## 
    ## Step:  AIC=318.5
    ## SAT ~ Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS AIC
    ## <none>                  21922 319
    ## - Rank      1      2676 24598 320
    ## - Years     1      2870 24792 321
    ## + Income    1       505 21417 321
    ## + Public    1       185 21737 322
    ## - l_takers  1      5094 27016 325
    ## - Expend    1     13620 35542 338
    tidy(sout)
    ## # A tibble: 5 × 5
    ##   term        estimate std.error statistic    p.value
    ##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
    ## 1 (Intercept)   399.     232.         1.72 0.0929    
    ## 2 Years          13.1      5.48       2.40 0.0207    
    ## 3 Rank            4.40     1.90       2.32 0.0252    
    ## 4 Expend          4.00     0.764      5.23 0.00000452
    ## 5 l_takers      -38.1     11.9       -3.20 0.00257
  • Modify the scale argument to fit by \(C_p\)

    lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
    sigma_full <- sigma(lm_rich)
    sout <- step(object = lm_init, scale = sigma_full^2, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank)
    ## Start:  AIC=220.6
    ## SAT ~ Income + Years
    ## 
    ##            Df Sum of Sq    RSS    Cp
    ## + Rank      1     99984  34294  26.3
    ## + l_takers  1     94604  39674  36.9
    ## + Public    1     23088 111190 177.3
    ## <none>                  134278 220.6
    ## + Expend    1        14 134264 222.6
    ## - Years     1      9072 143350 236.4
    ## - Income    1     84760 219038 384.9
    ## 
    ## Step:  AIC=26.32
    ## SAT ~ Income + Years + Rank
    ## 
    ##            Df Sum of Sq    RSS     Cp
    ## + Expend    1     10326  23968   8.05
    ## <none>                   34294  26.32
    ## + Public    1       759  33535  26.83
    ## + l_takers  1       412  33882  27.51
    ## - Income    1      3322  37616  30.84
    ## - Years     1     13581  47876  50.98
    ## - Rank      1     99984 134278 220.58
    ## 
    ## Step:  AIC=8.05
    ## SAT ~ Income + Years + Rank + Expend
    ## 
    ##            Df Sum of Sq    RSS     Cp
    ## + l_takers  1      2552  21417   5.04
    ## <none>                   23968   8.05
    ## + Public    1       422  23547   9.22
    ## - Income    1      3048  27016  12.03
    ## - Years     1      7353  31322  20.48
    ## - Expend    1     10326  34294  26.32
    ## - Rank      1    110295 134264 222.55
    ## 
    ## Step:  AIC=5.04
    ## SAT ~ Income + Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS    Cp
    ## - Income    1       505 21922  4.03
    ## <none>                  21417  5.04
    ## + Public    1        20 21397  7.00
    ## - l_takers  1      2552 23968  8.05
    ## - Years     1      3011 24428  8.95
    ## - Rank      1      3162 24578  9.25
    ## - Expend    1     12465 33882 27.51
    ## 
    ## Step:  AIC=4.03
    ## SAT ~ Years + Rank + Expend + l_takers
    ## 
    ##            Df Sum of Sq   RSS    Cp
    ## <none>                  21922  4.03
    ## + Income    1       505 21417  5.04
    ## + Public    1       185 21737  5.67
    ## - Rank      1      2676 24598  7.28
    ## - Years     1      2870 24792  7.67
    ## - l_takers  1      5094 27016 12.03
    ## - Expend    1     13620 35542 28.77
    tidy(sout)
    ## # A tibble: 5 × 5
    ##   term        estimate std.error statistic    p.value
    ##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
    ## 1 (Intercept)   399.     232.         1.72 0.0929    
    ## 2 Years          13.1      5.48       2.40 0.0207    
    ## 3 Rank            4.40     1.90       2.32 0.0252    
    ## 4 Expend          4.00     0.764      5.23 0.00000452
    ## 5 l_takers      -38.1     11.9       -3.20 0.00257

Exercise

The following is from the Statistical Sleuth.

Southwestern U.S. desert wildflower enthusiasts know that a triggering rainfall between late September and early December and regular rains through March often lead to a good wildflower show in the spring. The dataset for this exercise might help with the prediction. It includes monthly rainfalls from September to March and the subjectively rated quality of the following spring wildflower display for each of a number of years at each of four desert locations in the southwestern United States (Upland Sonoran Desert near Tucson, the lower Colorado River Valley section of the Sonoran Desert, the Baja California region of the Sonoran Desert, and the Mojave Desert). The quality of the display was judged subjectively with ordered rating categories of poor, fair, good, great, and spectacular. The column labeled Score is a numerical representation of the rating, with 0 for poor, 1 for fair, 2 for good, 3 for great, and 4 for spectacular. Although this is a made-up number and the suitability of regression for such a discrete response is questionable, an informal regression analysis might nevertheless be helpful for casual prediction. Variables include

  • Year: year of observed wildflower season
  • Region: a factor variable with 4 levels: “baja”, “colorado”, “mojave”, and “upland”
  • Sep: the September rainfall, in inches
  • Oct: the October rainfall, in inches
  • Nov: the November rainfall, in inches
  • Dec: the December rainfall, in inches
  • Jan: the January rainfall, in inches
  • Feb: the February rainfall, in inches
  • Mar: the March rainfall, in inches
  • Total: the total rainfall from September through March, in inches
  • Rating: a factor with a subjective assessment of the quality of wildflower bloom with levels “FAIR”, “GOOD”, “GREAT”, “POOR”, and “SPECTACULAR”
  • Score: a numerical variable corresponding to the order of rating categories, with Poor=0, Fair=1, Good=2, Great=3, and Spectacular=4

You can load these data into R via

library(Sleuth3)
data("ex1221")
flower <- ex1221
glimpse(ex1221)
## Rows: 122
## Columns: 12
## $ Year   <int> 1970, 1971, 1972, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 198…
## $ Region <fct> baja, baja, baja, baja, baja, baja, baja, baja, baja, baja, baj…
## $ Sep    <dbl> 0.27, 1.35, 0.00, 0.00, 0.00, 0.00, 0.55, 0.00, 0.04, 0.02, 0.0…
## $ Oct    <dbl> 0.00, 0.00, 0.01, 0.00, 0.04, 0.00, 0.00, 2.55, 0.40, 0.00, 0.0…
## $ Nov    <dbl> 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.40, 0.00, 1.02, 0.00, 0.0…
## $ Dec    <dbl> 0.00, 0.12, 0.04, 0.00, 0.00, 0.00, 0.00, 0.49, 0.95, 0.04, 0.0…
## $ Jan    <dbl> 0.00, 0.00, 0.00, 0.04, 0.00, 0.00, 0.26, 0.27, 0.86, 0.12, 0.9…
## $ Feb    <dbl> 0.04, 0.04, 0.00, 0.00, 0.00, 0.04, 0.26, 0.87, 0.20, 0.40, 0.0…
## $ Mar    <dbl> 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, 0.37, 0.05, 0.08, 0.00, 0.4…
## $ Total  <dbl> 0.43, 1.51, 0.06, 0.04, 0.04, 0.04, 1.84, 4.23, 3.55, 0.58, 1.4…
## $ Rating <fct> POOR, POOR, POOR, POOR, POOR, POOR, POOR, GOOD, GOOD, POOR, POO…
## $ Score  <int> 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 3, 3, 0, 1, 0, 3, 2, 0, …
  1. Perform an exploratory data analysis on these data. What do you see?
  2. Fit a rich model to these data, determine what variable transformations might be necessary.
  3. Apply these variable transformations until you obtain a rich model that appears to meet the assumptions of the linear model.
  4. Use an automatic variable selection technique to come up with a (or a couple) model(s) from which to choose a final predictive model.
  5. State your final predictive model, and state any conclusions.
  6. What is the predicted quality score for a season with these rainfall amounts in the Upland region: Sep: 0.45, Oct: 0.02, Nov: 0.80, Dec: 0.76, Jan: 0.17, Feb: 1.22, and Mar: 0.37? Appropriately quantify your uncertainty in these predictions.

Although the conditions are far from the ideal normality assumptions for the justification of the prediction interval, the rough prediction interval might be a useful way to clarify the precision with which the quality score can actually be predicted. (Data from Arizona-Sonora Desert Museum, “Wildflower Flourishes and Flops—a 50-Year History,”)