You often have many predictor variables.
Including too few and you are potentially not controlling for key variables.
Including too many and your regression coefficient estimates might be unstable.
Model selection: Choosing which variables (and transformations of these variables) to include.
From the Statistical Sleuth:
Goal is to determine the association between a response and some interesting predictors after adjusting for other nuisance predictors.
Example: Is there evidence of an association between salary and sex after adjusting for other, legitimate, determinants of salary?
Generally, a good strategy is to (during steps 6 and 7 from above)
Automatic variable selection techinques destroy the interpretation of \(p\)-values. So
Example: Using automatic selection procedures, it was determined to include experience, seniority, and education, but not age in the final model. After adding in sex, the resulting fit to the final model was
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.67 0.0955 90.8 9.51e-89
## 2 Senior -0.00435 0.000944 -4.61 1.38e- 5
## 3 Educ 0.0164 0.00448 3.66 4.26e- 4
## 4 Exper 0.000261 0.000107 2.45 1.62e- 2
## 5 SexMale 0.130 0.0214 6.06 3.30e- 8
It is tempting to say that Age is not associated with salary, adjusting for other variables, while the other variables are associated with salary, adjusting for other variables. However, this is wrong.
Since we used an automatic variable selection procedure for Age, Seniority, Education, and Experience, we cannot interpret those coefficients or \(p\)-values.
We can only interpret the coefficient and \(p\)-value for sex, which we did not use a automatic variable selection technique on.
We estimate that females make about 89% the salary of males, adjusting for other, legitimate, predictors of base salary. The evidence is strong that this association is not due to chance alone (\(p = 3.3\times 10^{-8}\)).
Technically, we are also adjusting for age in this statement, even though it was not in the model, because it had the chance to be in the model.
You have a response variable and many explanatory variables, and you want to know what variables are possibly associated with your response.
Then iterate through adding/removing variables, making transformations, checking residuals, until you develop a model with significant terms and no major issues.
\(p\)-values/confidence intervals don’t have proper interpretation.
You generally build a model and tell stories with it.
Issues:
This is a strict interpretation, but folks in the real world typically fudge a little bit. How bad this is depends on how much fudging you do. My opinion:
If your dataset is large, then you can split your data into a training set and a test set. You can do whatever variable selection techniques you want on your training set, then fit your final model on your test set to obtain \(p\)-values that have the proper interpretation.
Include variables to maximize predictive power, don’t worry about interpretation.
This lecture is not when prediction is the goal.
The automatic variable selection procedures described in this lecture are not often used when the goal is prediction.
Take the Machine Learning course for details when the goal is prediction.
Choose a list of explanatory variables that are important to the objective.
Screen out redundant variables.
Use your domain expertise for screening variables.
Note: What variables are important will depend on the question being asked.
Example: Researchers were interested in what variables were associated with state average SAT scores. Possible predictors include
Takers
: Percentage of high school seniors in the state
who took the exam.Income
: Median income of families of test-takers
(hundreds of dollars).Years
: Average number of years test-takers had formal
studies.Public
: Percentage of test-takers who attended public
schools.Expend
: Total state expenditure on secondary schools,
in hundreds of dollars per student.Rank
: Median percentile ranking of test-takers within
their secondary school classes.library(Sleuth3)
data("case1201")
sat <- case1201
glimpse(sat)
## Rows: 50
## Columns: 8
## $ State <fct> Iowa, SouthDakota, NorthDakota, Kansas, Nebraska, Montana, Minn…
## $ SAT <int> 1088, 1075, 1068, 1045, 1045, 1033, 1028, 1022, 1017, 1011, 100…
## $ Takers <int> 3, 2, 3, 5, 5, 8, 7, 4, 5, 10, 5, 4, 9, 8, 7, 3, 6, 16, 19, 11,…
## $ Income <int> 326, 264, 317, 338, 293, 263, 343, 333, 328, 304, 358, 295, 330…
## $ Years <dbl> 16.79, 16.07, 16.57, 16.30, 17.25, 15.91, 17.41, 16.57, 16.01, …
## $ Public <dbl> 87.8, 86.2, 88.3, 83.9, 83.6, 93.7, 78.3, 75.2, 97.0, 77.3, 74.…
## $ Expend <dbl> 25.60, 19.95, 20.62, 27.14, 21.05, 29.48, 24.84, 17.42, 25.96, …
## $ Rank <dbl> 89.7, 90.6, 89.8, 86.3, 88.5, 86.4, 83.4, 85.9, 87.5, 84.2, 85.…
Rank
(the
educational training variable) after accounting for selection bias
(Takers
).Problems with Including Too Few Variables
You are only picking up marginal associations.
E.g., we already know that men make more money than women. We want to see if men still make more money than women when we control for other variables.
Predictions are less accurate.
If you fit a model with \(X\), then this is what the model is seeing:
Problems with too many variables
Harder to estimate more parameters.
Formally, the variances of the sampling distributions of the coefficients in the model will get much larger.
Including highly correlated explanatory variables will really increase the variance of the sampling distributions of the coefficient estimates.
Intuitively, we are less sure if the association of \(Y\) and \(X_1\) is due to that actual associate or is it mediated through \(X_2\)?
Predictions are less accurate.
Demonstration when have too many variables
True model: \(E(Y|X_1) = X_1\)
Fit Model: \(E(Y|X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2X_2\)
Correlation between \(X_1\) and \(X_2\) is 0.9994.
We will simulate \(Y\) and plot the resulting OLS estimates.
Black is truth
Exploratory data analysis.
Tons of scatterplots.
Look at correlation coefficients.
Transformations based on EDA.
Fit a rich model and look at residuals.
Matrix scatterplot of SAT data
library(GGally)
ggpairs(sat, columns = 2:8)
Curvature between SAT scores and percentage takers. But it looks like constant variance at first, so maybe taking a log of percentage takers would help.
sat <- mutate(sat, l_takers = log(Takers))
ggplot(data = sat, mapping = aes(x = l_takers, y = SAT)) +
geom_point()
There is a huge outlier in expenditures caused by Alaska
arrange(sat, desc(Expend)) %>%
select(State, Expend) %>%
head()
## State Expend
## 1 Alaska 50.10
## 2 NewYork 33.58
## 3 Massachusetts 31.74
## 4 Oregon 30.49
## 5 Montana 29.48
## 6 Pennsylvania 27.98
Let’s fit a rich model to identify any other transformations and other possible issues.
lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
a_rich <- augment(lm_rich)
ggplot(data = a_rich, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = l_takers, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = Income, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = Years, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = Public, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = Expend, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
ggplot(data = a_rich, mapping = aes(x = Rank, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
The only worry is the expenditures residual plot where, again, we see Alaska.
In Chapter 10, we will learn about leverage values, and we see
that Alaska has a very high Cook’s distance (.cooksd
) and a
high leverage (.hat
).
a_rich %>%
bind_cols(select(sat, State)) %>%
select(State, .cooksd, .hat) %>%
arrange(desc(.cooksd))
## # A tibble: 50 × 3
## State .cooksd .hat
## <fct> <dbl> <dbl>
## 1 Alaska 1.60 0.572
## 2 Louisiana 0.207 0.345
## 3 Mississippi 0.152 0.147
## 4 SouthCarolina 0.146 0.129
## 5 Tennessee 0.143 0.207
## 6 WestVirginia 0.0880 0.169
## 7 NewHampshire 0.0728 0.0913
## 8 NorthCarolina 0.0437 0.102
## 9 Indiana 0.0345 0.198
## 10 Illinois 0.0339 0.106
## # … with 40 more rows
A high leverage is typically above \(2p/n\), or in this case \(2 * 7 / 50 = 0.28\)
A high Cook’s distance is around or above 1.
Let’s remove Alaska. This is probably OK since Alaska uses expenditures on things like long distance travel for teachers, and higher heating bills, which do not have a direct effect on educational attainment. Thus, this expenditure value is likely not representative of our objective.
sat <- filter(sat, State != "Alaska")
We would go through the above steps again to see if everything looks good in our rich model.
This is also the stage where we would include quadratic terms, do other variable transformations, etc.
If appropriate, use an automatic variable selection technique to choose a suitable subset of explanatory variables.
An automatic variable selection technique has a criterion where a higher or lower value indicates a better fit.
The computer adds/deletes variables from the model until it reaches a point where it cannot increase the criterion any more.
The selected model is considered the “best” one.
You need to make sure your “rich” model has no major issues before implementing these approaches.
Most statisticians dislike automatic variable selection approaches, since they have bad sampling properties, but they can be useful to
“Best” subsets: Look at all possible models given the set of predictors, choose the one with the best criterion.
Forward/Backward Selection: Start with a model with no predictors \(Y_i = \beta_0 + \epsilon_i\).
The forward/backward approach is not gauranteed to find the model with the “best” criterion.
When you add/subtract a categorical variable, you should add/subtract all of the indicators associated with that categorical variable.
Make sure you also include first-order terms if second order terms are kept in the model.
Proceed with analysis with chosen explanatory variables.
Evaluate residual plots for your final model. Perform other model checks.
Tell stories with the data using \(p\)-values, coefficient estimates, confidence intervals, coefficients of determination, etc…
Step 7 is what we’ve been discussing this whole semester.
You can get a version of AIC, a version of BIC, \(R^2\), and \(R_a^2\) via glance()
from
{broom}
lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
glance(lm_rich) %>%
select(AIC, BIC, r.squared, adj.r.squared)
## # A tibble: 1 × 4
## AIC BIC r.squared adj.r.squared
## <dbl> <dbl> <dbl> <dbl>
## 1 453. 468. 0.913 0.900
AIC Note:
{broom}
uses the base AIC()
function
which returns \[
AIC = n\log\left(\frac{SSE}{n}\right) + n + n\log(2\pi) + 2(p+1)
\] (the “\(p+1\)” comes from
including estimation of \(\sigma^2\) as
a parameter).
Use extractAIC()
to get the book’s definition of AIC
\[
AIC = n\log\left(\frac{SSE}{n}\right) + 2p
\]
We can verify the equivalence manually
n <- nrow(sat)
AIC(lm_rich) - n - n * log(2 * pi) - 2
## [1] 311.9
extractAIC(lm_rich)[[2]]
## [1] 311.9
BIC Note:
Similarly, {broom}
uses the AIC()
function to get a different version of BIC. \[
BIC = n\log\left(\frac{SSE}{n}\right) + n + n\log(2\pi) + \log(n)(p+1)
\]
Use extractAIC()
to get the book’s definition of BIC
\[
BIC = n\log\left(\frac{SSE}{n}\right) + \log(n)p
\]
For both functions, to get BIC, set the k
argument
to be \(\log(n)\).
We can verify the equivalence manually
n <- nrow(sat)
AIC(lm_rich, k = log(n)) - n - n * log(2 * pi) - log(n)
## [1] 325.1
extractAIC(lm_rich, k = log(n))[[2]]
## [1] 325.1
You can get Mallow’s \(C_p\) by
using extractAIC()
and specifying the scale
argument to be the estimated residual variance under the
biggest model.
sigma_full <- sigma(lm_rich)
lm_smaller <- lm(SAT ~ l_takers + Years + Expend + Rank, data = sat)
extractAIC(fit = lm_smaller, scale = sigma_full^2)[[2]]
## [1] 4.031
We can verify this manually
p <- ncol(model.matrix(lm_smaller))
n <- nrow(model.matrix(lm_smaller))
sigma_full <- sigma(lm_rich)
sigma_reduced <- sigma(lm_smaller)
p + (n - p) * (sigma_reduced^2 - sigma_full^2) / sigma_full^2
## [1] 4.031
Use the step()
function to choose a model by AIC/BIC
via automated search.
object
argument is the intial model that it will
search from.scope
argument specifies the simplest and the most
complicated models possible.
step()
like the output of
lm()
(via tidy()
, augment()
,
etc).To use both a lower and upper bound, do:
lm_init <- lm(SAT ~ Income + Years, data = sat)
sout <- step(object = lm_init, scope = list(upper = SAT ~ l_takers + Income + Years + Public + Expend + Rank, lower = SAT ~ Income))
## Start: AIC=393.9
## SAT ~ Income + Years
##
## Df Sum of Sq RSS AIC
## + Rank 1 99984 34294 329
## + l_takers 1 94604 39674 336
## + Public 1 23088 111190 387
## <none> 134278 394
## - Years 1 9072 143350 395
## + Expend 1 14 134264 396
##
## Step: AIC=329
## SAT ~ Income + Years + Rank
##
## Df Sum of Sq RSS AIC
## + Expend 1 10326 23968 313
## <none> 34294 329
## + Public 1 759 33535 330
## + l_takers 1 412 33882 330
## - Years 1 13581 47876 343
## - Rank 1 99984 134278 394
##
## Step: AIC=313.4
## SAT ~ Income + Years + Rank + Expend
##
## Df Sum of Sq RSS AIC
## + l_takers 1 2552 21417 310
## <none> 23968 313
## + Public 1 422 23547 315
## - Years 1 7353 31322 325
## - Expend 1 10326 34294 329
## - Rank 1 110295 134264 396
##
## Step: AIC=309.9
## SAT ~ Income + Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS AIC
## <none> 21417 310
## + Public 1 20 21397 312
## - l_takers 1 2552 23968 313
## - Years 1 3011 24428 314
## - Rank 1 3162 24578 315
## - Expend 1 12465 33882 330
tidy(sout)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 291. 256. 1.14 0.261
## 2 Income 0.113 0.113 1.01 0.319
## 3 Years 13.5 5.49 2.46 0.0180
## 4 Rank 5.06 2.01 2.52 0.0155
## 5 Expend 3.87 0.774 5.00 0.0000100
## 6 l_takers -31.2 13.8 -2.26 0.0287
To use just an upper bound, do:
lm_init <- lm(SAT ~ Income + Years, data = sat)
sout <- step(object = lm_init, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank)
## Start: AIC=393.9
## SAT ~ Income + Years
##
## Df Sum of Sq RSS AIC
## + Rank 1 99984 34294 329
## + l_takers 1 94604 39674 336
## + Public 1 23088 111190 387
## <none> 134278 394
## - Years 1 9072 143350 395
## + Expend 1 14 134264 396
## - Income 1 84760 219038 416
##
## Step: AIC=329
## SAT ~ Income + Years + Rank
##
## Df Sum of Sq RSS AIC
## + Expend 1 10326 23968 313
## <none> 34294 329
## + Public 1 759 33535 330
## + l_takers 1 412 33882 330
## - Income 1 3322 37616 332
## - Years 1 13581 47876 343
## - Rank 1 99984 134278 394
##
## Step: AIC=313.4
## SAT ~ Income + Years + Rank + Expend
##
## Df Sum of Sq RSS AIC
## + l_takers 1 2552 21417 310
## <none> 23968 313
## + Public 1 422 23547 315
## - Income 1 3048 27016 317
## - Years 1 7353 31322 325
## - Expend 1 10326 34294 329
## - Rank 1 110295 134264 396
##
## Step: AIC=309.9
## SAT ~ Income + Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS AIC
## - Income 1 505 21922 309
## <none> 21417 310
## + Public 1 20 21397 312
## - l_takers 1 2552 23968 313
## - Years 1 3011 24428 314
## - Rank 1 3162 24578 315
## - Expend 1 12465 33882 330
##
## Step: AIC=309.1
## SAT ~ Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS AIC
## <none> 21922 309
## + Income 1 505 21417 310
## + Public 1 185 21737 311
## - Rank 1 2676 24598 313
## - Years 1 2870 24792 313
## - l_takers 1 5094 27016 317
## - Expend 1 13620 35542 331
tidy(sout)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 399. 232. 1.72 0.0929
## 2 Years 13.1 5.48 2.40 0.0207
## 3 Rank 4.40 1.90 2.32 0.0252
## 4 Expend 4.00 0.764 5.23 0.00000452
## 5 l_takers -38.1 11.9 -3.20 0.00257
Modify the k
argument to be \(log(n)\) to fit by BIC
sout <- step(object = lm_init, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank, k = log(nrow(sat)))
## Start: AIC=399.6
## SAT ~ Income + Years
##
## Df Sum of Sq RSS AIC
## + Rank 1 99984 34294 337
## + l_takers 1 94604 39674 344
## + Public 1 23088 111190 394
## - Years 1 9072 143350 399
## <none> 134278 400
## + Expend 1 14 134264 403
## - Income 1 84760 219038 420
##
## Step: AIC=336.6
## SAT ~ Income + Years + Rank
##
## Df Sum of Sq RSS AIC
## + Expend 1 10326 23968 323
## <none> 34294 337
## - Income 1 3322 37616 337
## + Public 1 759 33535 339
## + l_takers 1 412 33882 340
## - Years 1 13581 47876 349
## - Rank 1 99984 134278 400
##
## Step: AIC=322.9
## SAT ~ Income + Years + Rank + Expend
##
## Df Sum of Sq RSS AIC
## + l_takers 1 2552 21417 321
## <none> 23968 323
## - Income 1 3048 27016 325
## + Public 1 422 23547 326
## - Years 1 7353 31322 332
## - Expend 1 10326 34294 337
## - Rank 1 110295 134264 403
##
## Step: AIC=321.3
## SAT ~ Income + Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS AIC
## - Income 1 505 21922 319
## <none> 21417 321
## - l_takers 1 2552 23968 323
## - Years 1 3011 24428 324
## - Rank 1 3162 24578 324
## + Public 1 20 21397 325
## - Expend 1 12465 33882 340
##
## Step: AIC=318.5
## SAT ~ Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS AIC
## <none> 21922 319
## - Rank 1 2676 24598 320
## - Years 1 2870 24792 321
## + Income 1 505 21417 321
## + Public 1 185 21737 322
## - l_takers 1 5094 27016 325
## - Expend 1 13620 35542 338
tidy(sout)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 399. 232. 1.72 0.0929
## 2 Years 13.1 5.48 2.40 0.0207
## 3 Rank 4.40 1.90 2.32 0.0252
## 4 Expend 4.00 0.764 5.23 0.00000452
## 5 l_takers -38.1 11.9 -3.20 0.00257
Modify the scale
argument to fit by \(C_p\)
lm_rich <- lm(SAT ~ l_takers + Income + Years + Public + Expend + Rank, data = sat)
sigma_full <- sigma(lm_rich)
sout <- step(object = lm_init, scale = sigma_full^2, scope = SAT ~ l_takers + Income + Years + Public + Expend + Rank)
## Start: AIC=220.6
## SAT ~ Income + Years
##
## Df Sum of Sq RSS Cp
## + Rank 1 99984 34294 26.3
## + l_takers 1 94604 39674 36.9
## + Public 1 23088 111190 177.3
## <none> 134278 220.6
## + Expend 1 14 134264 222.6
## - Years 1 9072 143350 236.4
## - Income 1 84760 219038 384.9
##
## Step: AIC=26.32
## SAT ~ Income + Years + Rank
##
## Df Sum of Sq RSS Cp
## + Expend 1 10326 23968 8.05
## <none> 34294 26.32
## + Public 1 759 33535 26.83
## + l_takers 1 412 33882 27.51
## - Income 1 3322 37616 30.84
## - Years 1 13581 47876 50.98
## - Rank 1 99984 134278 220.58
##
## Step: AIC=8.05
## SAT ~ Income + Years + Rank + Expend
##
## Df Sum of Sq RSS Cp
## + l_takers 1 2552 21417 5.04
## <none> 23968 8.05
## + Public 1 422 23547 9.22
## - Income 1 3048 27016 12.03
## - Years 1 7353 31322 20.48
## - Expend 1 10326 34294 26.32
## - Rank 1 110295 134264 222.55
##
## Step: AIC=5.04
## SAT ~ Income + Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS Cp
## - Income 1 505 21922 4.03
## <none> 21417 5.04
## + Public 1 20 21397 7.00
## - l_takers 1 2552 23968 8.05
## - Years 1 3011 24428 8.95
## - Rank 1 3162 24578 9.25
## - Expend 1 12465 33882 27.51
##
## Step: AIC=4.03
## SAT ~ Years + Rank + Expend + l_takers
##
## Df Sum of Sq RSS Cp
## <none> 21922 4.03
## + Income 1 505 21417 5.04
## + Public 1 185 21737 5.67
## - Rank 1 2676 24598 7.28
## - Years 1 2870 24792 7.67
## - l_takers 1 5094 27016 12.03
## - Expend 1 13620 35542 28.77
tidy(sout)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 399. 232. 1.72 0.0929
## 2 Years 13.1 5.48 2.40 0.0207
## 3 Rank 4.40 1.90 2.32 0.0252
## 4 Expend 4.00 0.764 5.23 0.00000452
## 5 l_takers -38.1 11.9 -3.20 0.00257
The following is from the Statistical Sleuth.
Southwestern U.S. desert wildflower enthusiasts know that a
triggering rainfall between late September and early December and
regular rains through March often lead to a good wildflower show in the
spring. The dataset for this exercise might help with the prediction. It
includes monthly rainfalls from September to March and the subjectively
rated quality of the following spring wildflower display for each of a
number of years at each of four desert locations in the southwestern
United States (Upland Sonoran Desert near Tucson, the lower Colorado
River Valley section of the Sonoran Desert, the Baja California region
of the Sonoran Desert, and the Mojave Desert). The quality of the
display was judged subjectively with ordered rating categories of poor,
fair, good, great, and spectacular. The column labeled
Score
is a numerical representation of the rating, with
0
for poor, 1
for fair, 2
for
good, 3
for great, and 4
for spectacular.
Although this is a made-up number and the suitability of regression for
such a discrete response is questionable, an informal regression
analysis might nevertheless be helpful for casual prediction. Variables
include
Year
: year of observed wildflower seasonRegion
: a factor variable with 4 levels: “baja”,
“colorado”, “mojave”, and “upland”Sep
: the September rainfall, in inchesOct
: the October rainfall, in inchesNov
: the November rainfall, in inchesDec
: the December rainfall, in inchesJan
: the January rainfall, in inchesFeb
: the February rainfall, in inchesMar
: the March rainfall, in inchesTotal
: the total rainfall from September through March,
in inchesRating
: a factor with a subjective assessment of the
quality of wildflower bloom with levels “FAIR”, “GOOD”, “GREAT”, “POOR”,
and “SPECTACULAR”Score
: a numerical variable corresponding to the order
of rating categories, with Poor=0, Fair=1, Good=2, Great=3, and
Spectacular=4You can load these data into R via
library(Sleuth3)
data("ex1221")
flower <- ex1221
glimpse(ex1221)
## Rows: 122
## Columns: 12
## $ Year <int> 1970, 1971, 1972, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 198…
## $ Region <fct> baja, baja, baja, baja, baja, baja, baja, baja, baja, baja, baj…
## $ Sep <dbl> 0.27, 1.35, 0.00, 0.00, 0.00, 0.00, 0.55, 0.00, 0.04, 0.02, 0.0…
## $ Oct <dbl> 0.00, 0.00, 0.01, 0.00, 0.04, 0.00, 0.00, 2.55, 0.40, 0.00, 0.0…
## $ Nov <dbl> 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.40, 0.00, 1.02, 0.00, 0.0…
## $ Dec <dbl> 0.00, 0.12, 0.04, 0.00, 0.00, 0.00, 0.00, 0.49, 0.95, 0.04, 0.0…
## $ Jan <dbl> 0.00, 0.00, 0.00, 0.04, 0.00, 0.00, 0.26, 0.27, 0.86, 0.12, 0.9…
## $ Feb <dbl> 0.04, 0.04, 0.00, 0.00, 0.00, 0.04, 0.26, 0.87, 0.20, 0.40, 0.0…
## $ Mar <dbl> 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, 0.37, 0.05, 0.08, 0.00, 0.4…
## $ Total <dbl> 0.43, 1.51, 0.06, 0.04, 0.04, 0.04, 1.84, 4.23, 3.55, 0.58, 1.4…
## $ Rating <fct> POOR, POOR, POOR, POOR, POOR, POOR, POOR, GOOD, GOOD, POOR, POO…
## $ Score <int> 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 3, 3, 0, 1, 0, 3, 2, 0, …
Although the conditions are far from the ideal normality assumptions for the justification of the prediction interval, the rough prediction interval might be a useful way to clarify the precision with which the quality score can actually be predicted. (Data from Arizona-Sonora Desert Museum, “Wildflower Flourishes and Flops—a 50-Year History,”)