Recall our model for simple linear regression: \[\begin{align} Y_i &= \beta_0 + \beta_1 X_i + \epsilon_i\\ \epsilon_i &\overset{i.i.d.}{\sim} N(0, \sigma^2) \end{align}\]
A common task is to test the following hypotheses:
Overview of the strategy of hypothesis testing:
Exercise: What is wrong with the statement: “We test against the null of \(H_0: \hat{\beta}_1 = 0\).”
Exercise: What is wrong with the following interpretation of the \(p\)-value: “The \(p\)-value is the probability that the null hypothesis is true, so small \(p\)-values suggest we reject the null hypothesis.”
Exercise: What is wrong with the following: “The \(p\)-value is large, so we accept the null hypothesis”.
The sampling distribution of \(\hat{\beta}_1\) is \[ \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\right). \]
We will prove the mean result of this. If you get a PhD in Statistics, you go through these type of calculations all of the time.
Note that \(\hat{\beta}_1 = \sum_{i=1}^nk_iY_i\) where \(k_i = \frac{X_i - \bar{X}}{\sum_{i=1}^n(X_i - \bar{X})^2}\).
Property from theory: Any linear combination of independent normal random variables is also normal.
So we just need the mean and variance of \(\hat{\beta}_1\) to know its distribution.
\[\begin{align} E[\hat{\beta}_1] &= E\left[\sum_{i=1}^nk_iY_i\right]\\ &=\sum_{i=1}^nk_iE[Y_i]\\ &=\sum_{i=1}^nk_i(\beta_0 + \beta_1 X_i)\\ &= \beta_0\sum_{i=1}^nk_i + \beta_1\sum_{i=1}^nk_iX_i. \end{align}\]
We will now prove that \(\sum_{i=1}^nk_i = 0\) and \(\sum_{i=1}^nk_iX_i = 1\).
\[\begin{align} \sum_{i=1}^nk_i &= \frac{\sum_{i=1}^n(X_i - \bar{X})}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^nX_i - \sum_{i=1}^n\bar{X}}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{n\bar{X} - n\bar{X}}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= 0. \end{align}\]
\[\begin{align} \sum_{i=1}^nk_iX_i &= \frac{\sum_{i=1}^nX_i(X_i - \bar{X})}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^nX_i^2 - \bar{X}\sum_{i=1}^nX_i}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^nX_i^2 - n\bar{X}^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^nX_i^2 - 2n\bar{X}^2 + n\bar{X}^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^nX_i^2 - 2\bar{X}\sum_{i=1}^nX_i + n\bar{X}^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^n(X_i^2 - 2\bar{X}X_i + \bar{X}^2)}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= \frac{\sum_{i=1}^n(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\\ &= 1 \end{align}\]
Putting this together, we have \[ E[\hat{\beta}_1] = \beta_0\sum_{i=1}^nk_i + \beta_1\sum_{i=1}^nk_iX_i = \beta_0 \times 0 + \beta_1 \times 1 = \beta_1. \]
The proof that \(var(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}\) is similar.
Exercise: What happens to the variance of \(\hat{\beta}_1\) when the \(X_i\)’s are more spread out?
The standard deviation of the sampling distribution of an estimator is called the standard error.
This quantity is important for describing how certain we are about an estimate.
The standard error \(\hat{\beta}_1\) is \(\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}}\).
We don’t know \(\sigma^2\), but we have an estimator for it, the mean squared error (MSE).
\[ MSE = \frac{1}{n-2} \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 \]
So the estimated standard error of \(\hat{\beta}_1\) is \[ s(\hat{\beta}_1) = \sqrt{\frac{MSE}{\sum_{i=1}^n (X_i - \bar{X})^2}} \]
We usually just say “standard error”, even when we are talking about the estimated standard error.
Our goal when testing against \(H_0: \beta_1 = 0\) is to find a statistic that (i) can provide evidence against \(H_0\) based on its value and (ii) we know the distribution of under \(H_0\).
The distribution of \[ \frac{\hat{\beta}_1 - \beta_1}{s(\hat{\beta}_1)} \sim t_{n-2} \]
Thus, if \(H_0: \beta_1 = 0\) were true, we would have that \[ t^* = \frac{\hat{\beta}_1}{s(\hat{\beta}_1)} \sim t_{n-2} \] But if \(H_A\) were true, then \(t^*\) would not follow a \(t_{n-2}\) distribution.
We can compare \(t^*\) to a \(t_{n-2}\) distribution to see how extreme it is.
\[ p-\text{value} = 2*\text{pt}(-|t^*|, n-2). \]
R will automatically calculate \(t\)-values and \(p\)-values, so we never need to manually make these calculations.
We first fit the linear model using lm()
as before,
saving the output to a variable.
library(tidyverse)
library(broom)
hibbs <- read_csv("https://dcgerard.github.io/stat_415_615/data/hibbs.csv")
lmhibbs <- lm(vote ~ growth, data = hibbs)
We then again use the tidy()
function from the
{broom}
package.
t_hibbs <- tidy(lmhibbs)
The statistic
variable contains the \(t\)-statistics, while the
p.value
variable contains the \(p\)-value for the test that the parameter
equals 0.
t_hibbs
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 46.2 1.62 28.5 8.41e-14
## 2 growth 3.06 0.696 4.40 6.10e- 4
In the above output the \(t\)-statistic against the null of \(\beta_1 = 0\) is 4.4.
So in the above output, the \(p\)-value against the null of \(\beta_1 = 0\) is 6.1e-04
=
\(6.1 \times 10^{-4} =
0.00061\).
We can verify that the \(p\)-value can be derived directly from the \(t\)-statistic.
n <- nrow(hibbs)
2 * pt(q = -abs(4.396), df = n - 2)
## [1] 0.0006095
Exercise: Fill in the missing values (the
NA
’s) from the following R output:
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -7.52 NA -1.37 0.180
## 2 drat 7.68 1.51 NA 0.0000178
Exercise: Verify your answer to the previous
exercise by fitting a regression of mpg
on
drat
from the mtcars
dataset.
Remember: interpreting confidence intervals:
Let’s construct a \((1-\alpha)\) confidence interval (typically \(\alpha = 0.05\) so we would construct a 95% confidence interval).
Let \[ \ell = \text{qt}(\alpha/2, n-2)\\ u = \text{qt}(1 - \alpha/2, n-2)\\ \] For some \(0 < \alpha < 1\).
Note that \(\ell = - u\)
alpha <- 0.05
n <- 10
qt(alpha / 2, n - 2)
## [1] -2.306
qt(1 - alpha / 2, n - 2)
## [1] 2.306
Since \(t^* = \frac{\hat{\beta}_1 - \beta_1}{s(\hat{\beta}_1)} \sim t_{n-2}\), we have that \[ Pr\left(-u \leq \frac{\hat{\beta}_1 - \beta_1}{s(\hat{\beta}_1)} \leq u \right) = 1 - \alpha \]
Thus, we have \[\begin{align} &Pr\left(-u \leq \frac{\hat{\beta}_1 - \beta_1}{s(\hat{\beta}_1)} \leq u \right) = 1 - \alpha\\ &\Leftrightarrow Pr\left(-u s(\hat{\beta}_1) \leq \hat{\beta}_1 - \beta_1 \leq u s(\hat{\beta}_1) \right) = 1 - \alpha\\ &\Leftrightarrow Pr\left(-u s(\hat{\beta}_1) - \hat{\beta}_1\leq -\beta_1 \leq u s(\hat{\beta}_1) - \hat{\beta}_1 \right) = 1 - \alpha\\ &\Leftrightarrow Pr\left(\hat{\beta}_1 + u s(\hat{\beta}_1)\geq \beta_1 \geq \hat{\beta}_1 - u s(\hat{\beta}_1) \right) = 1 - \alpha\\ \end{align}\]
Thus \(\hat{\beta}_1 \pm u s(\hat{\beta}_1)\), where \(u = \text{qt}(1 - \alpha/2, n-2)\) is a \((1-\alpha)\) confidence interval for \(\beta_1\).
Intervals that we calculate in this class will always be of the form \[ \text{estimate} \pm \text{multiplier} \times \text{standard error} \]
qt(p = 1 - alpha / 2, df = n - 2)
.Again, it is the interval that is random (and varies across repeated samples), not the parameter.
R will return confidence intervals if you ask, so you typically do not need to do these calculations in practice.
First, use lm()
to fit the linear model.
hibbs <- read_csv("https://dcgerard.github.io/stat_415_615/data/hibbs.csv")
lmhibbs <- lm(vote ~ growth, data = hibbs)
Use the tidy()
function from the
{broom}
package, but use the conf.int = TRUE
argument. The confidence interval will be in the conf.low
and conf.high
variables.
t_hibbs <- tidy(lmhibbs, conf.int = TRUE)
select(t_hibbs, term, estimate, p.value, conf.low, conf.high)
## # A tibble: 2 × 5
## term estimate p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 46.2 8.41e-14 42.8 49.7
## 2 growth 3.06 6.10e- 4 1.57 4.55
You can change the level using the conf.level
argument. E.g. to calculate a 99% confidence interval you would do
t_hibbs <- tidy(lmhibbs, conf.int = TRUE, conf.level = 0.99)
select(t_hibbs, term, estimate, p.value, conf.low, conf.high)
## # A tibble: 2 × 5
## term estimate p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 46.2 8.41e-14 41.4 51.1
## 2 growth 3.06 6.10e- 4 0.988 5.13
A full interpretation might go something like this:
We estimate that incumbants to have on average 3.1% higher vote-share in years which experience 1% more growth (95% confidence interval of 1.6% higher to 4.6% higher). We have strong evidence that this association is not due to chance alone (\(p = 0.00061\), \(n = 16\)).
Exercise: Make a plot of
qt(1 - alpha, 10)
from alpha = 0.01
to
alpha = 0.1
. What does the the plot tell you about the
width of confidence intervals as alpha
increases?
Exercise: From the mtcars
dataset,
obtain the confidence interval for the slope parameter from a regression
of mpg
on drat
.
Always provide \(p\)-values. Never state that results are “significant” or “not significant” without reporting a \(p\)-value.
Always report effect size (i.e. slope) estimates with their corresponding 95% confidence intervals.
Always report the sample size.
Almost never use 1-sided tests (folks will look at you suspiciously).
Don’t interpret \(p\)-values so discretely. I.e. 0.049 is not more significant than 0.051.
Rules of thumb:
We derived the sampling distribution of \(\hat{\beta}_1\) using normality. But this is not that important as long as (i) you have a large sample size or (ii) the data aren’t too skewed.
The sampling distribution assumes the \(X_i\)’s don’t change in repeated samples, but this only shows up as important if you are literally trying to reproduce a study.
Inference on \(\beta_0\)
(Intercept)
”)A confidence interval for the mean response at a given predictor level \(E[Y_i|X_i]\).
A prediction interval for a single observation at a given predictor level.
A confidence band to capture the entire regression line.
We will go through these three intervals in turn.
Recall \(E[Y_i|X_i] = \beta_0 + \beta_1 X_i\)
Let \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\).
Since \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are random (and having sampling distributions), that means that \(\hat{Y}_i\) is random (and has a sampling distribution).
\[ \hat{Y}_i \sim N\left(\beta_0 + \beta_1 X_i, \sigma^2\left[\frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\right]\right) \]
\(\hat{Y}_i\) is an unbiased estimator of \(\beta_0 + \beta_1 X_i\).
The estimated standard error is \(s(\hat{Y}_i)\) where \[ s^2(\hat{Y}_i) = \text{MSE}\left[\frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}\right] \]
Exercise: What value of \(X_i\) results in the smallest standard error?
From stat theory, we have that \[ \frac{\hat{Y}_i - E[Y_i|X_i]}{s(\hat{Y}_i)} \sim t_{n-2} \]
We can use this to obtain a confidence interval for \(E[Y_i|X_i]\). \[ \hat{Y}_i \pm \text{qt}(1 - \alpha / 2, n-2)s(\hat{Y}_i) \]
Graphic of interpretation:
This confidence interval only applies when a single mean response is to be estimated.
These are called point-wise confidence intervals because they provide confidence intervals for the mean at a single \(X_i\).
If you use geom_smooth()
from
{ggplot2}
, the resulting confidence intervals are all
point-wise.
Exercise: Write a paragraph on the strategy you would use to test \(H_0: E[Y_i|X_i] = \mu\) for some prespecified \(\mu\).
Take the output of lm()
lmhibbs <- lm(vote ~ growth, data = hibbs)
Create a new data frame that contains the desired level(s) of \(X_i\).
newdf <- data.frame(growth = c(1.2, 2.4))
Use predict()
with the
interval = "confidence"
argument to obtain confidence
intervals. They are in the lwr
and upr
columns.
predict(object = lmhibbs, newdata = newdf, interval = "confidence") %>%
cbind(newdf)
## fit lwr upr growth
## 1 49.92 47.65 52.19 1.2
## 2 53.59 51.44 55.75 2.4
Change the level using the level
argument.
predict(object = lmhibbs, newdata = newdf, interval = "confidence", level = 0.99) %>%
cbind(newdf)
## fit lwr upr growth
## 1 49.92 46.77 53.07 1.2
## 2 53.59 50.60 56.58 2.4
What if, instead of estimating the mean at a given \(X_i\), we want to estimate the value of a single observational/experimental unit at \(X_i\).
Prediction: Estimating the value of a single experimental/observational unit.
So “estimation” is reserved for parameters, and “prediction” is reserved for unit values.
In linear regression, the predicted value at \(X_i\) is the same as the estimated mean at \(X_i\).
\[ \hat{Y}_{i(new)} = \hat{\beta}_0 + \hat{\beta}_1X_{i(new)} \]
The difference between prediction and estimation for linear regression, then, is in intervals.
A \((1-\alpha)\) prediction interval captures \((1-\alpha)\) new values, over both repeated samples and repeated values.
It’s easier to begin thinking about prediction intervals as if \(\beta_0\), \(\beta_1\), and \(\sigma^2\) were known. Suppose we know these values, then \[ Y_{i(new)} \sim N(\beta_0 + \beta_1X_{i(new)}, \sigma^2), \]
But \(\beta_0\), \(\beta_1\), and \(\sigma^2\) are not known. So if we just added and subtracted the 2 MSE from the estimated regression line, we could make some big mistakes if our regression line estimate was way off.
Use the following as the prediction variance: \[\begin{align} \sigma^2(\text{pred}) &= \text{var}(\hat{\beta}_0 + \hat{\beta}_1X_i) + \text{var}(\epsilon_i)\\ &= \text{var}(\hat{\beta}_0 + \hat{\beta}_1X_i) + \sigma^2\\ \end{align}\]
We can estimate \(\sigma^2\) with the MSE, and we can estimate \(\text{var}(\hat{\beta}_0 + \hat{\beta}_1X_i)\) with the standard error of the estimated mean response at \(X_{i(new)}\), \(s^2(\hat{Y}_i)\).
Doing this, we obtain the following estimated prediction variance: \[ s^2(\text{pred}) = s^2(\hat{Y}_i) + MSE. \]
KEY IDEA: Variance of prediction = variance of estimated mean + error variance.
The corresponding prediction interval is \[ \hat{\beta}_0 + \hat{\beta}_1X_{i(new)} \pm \text{qt}(1 - \alpha/2, n-2)s(\text{pred}) \]
Graphic for interpretation:
Exercise: Which is bigger, the prediction interval or the confidence interval for the mean, at a given \(X_i\)? Why?
The steps are the exact same as the confidence intervals for the
mean, except we use the interval = "prediction"
argument in
predict()
.
lmhibbs <- lm(vote ~ growth, data = hibbs)
newdf <- data.frame(growth = c(1.2, 2.4))
predict(object = lmhibbs, newdata = newdf, interval = "prediction") %>%
cbind(newdf)
## fit lwr upr growth
## 1 49.92 41.54 58.31 1.2
## 2 53.59 45.24 61.95 2.4
Change the level using the level
argument.
predict(object = lmhibbs, newdata = newdf, interval = "prediction", level = 0.99) %>%
cbind(newdf)
## fit lwr upr growth
## 1 49.92 38.28 61.56 1.2
## 2 53.59 42.00 65.19 2.4
A \((1 - \alpha)\) confidence band is a procedure that produces a region that captures the entire regression line in \((1 - \alpha)\) of repeated samples.
We won’t go into how this works, but the formula for such a band is \[ \hat{\beta}_0 + \hat{\beta}_1X_i \pm Ws(\hat{Y}_i), \text{ where} \]
Graphic for interpretation:
There is no base R function that I can find that does this, but here is one if you want to use it:
#' Working-Hotelling bands for simple linear regression.
#'
#' Intervals of the form "fit +/- w * standard-error", where w^2 is
#' found by \code{p * qf(level, p, n - p)}.
#'
#' @param object An object of class "lm".
#' @param newdata A data frame containing the new data.
#' @param level The confidence level of the band.
#'
#' @author David Gerard
whbands <- function(object, newdata, level = 0.95) {
stopifnot(inherits(object, "lm"))
stopifnot(inherits(newdata, "data.frame"))
stopifnot(is.numeric(level), length(level) == 1)
pout <- stats::predict(object = object,
newdata = newdata,
se.fit = TRUE,
interval = "none")
n <- nrow(stats::model.matrix(object))
p <- ncol(stats::model.matrix(object))
w <- sqrt(p * stats::qf(p = level, df1 = p, df2 = n - p))
lwr <- pout$fit - w * pout$se.fit
upr <- pout$fit + w * pout$se.fit
pout$fit <- cbind(fit = pout$fit, lwr = lwr, upr = upr)
return(pout)
}
Applying this function to the Hibbs data
lmhibbs <- lm(vote ~ growth, data = hibbs)
newdf <- data.frame(growth = seq(from = min(hibbs$growth),
to = max(hibbs$growth),
length.out = 100))
whfit <- whbands(object = lmhibbs, newdata = newdf)
whfit$fit %>%
cbind(newdf) ->
newdf
ggplot() +
geom_point(data = hibbs, mapping = aes(x = growth, y = vote)) +
geom_line(data = newdf, mapping = aes(x = growth, y = lwr)) +
geom_line(data = newdf, mapping = aes(x = growth, y = upr))
“wh” stands for “Working-Hotelling”.
In this section, we will talk about a different strategy to testing \(H_0: \beta_1 = 0\).
In simple linear regression, this results in the exact same \(p\)-value as the test that uses the \(t\)-statistic.
However, this strategy is more applicable to general linear hypotheses that we’ll discuss in multiple linear regression.
Testing \(H_0: \beta_1 = 0\) is really a comparison between the two models: \[\begin{align} H_0: Y_i &= \beta_0 + \epsilon_i\\ H_A: Y_i &= \beta_0 + \beta_1 X_i + \epsilon_i \end{align}\]
We will call the first model the reduced model and the second model the full model. This is because the reduced model is a subset of the full model (you get the reduced from the full by setting \(\beta_1 = 0\)).
Our strategy will be to compare the residuals under \(H_0\) and \(H_A\). If \(H_A\) were true, we would expect the those residuals to be much smaller than the residuals under \(H_0\) (because the line fits a lot better).
If \(H_0\) were true, then we would expect the residuals under \(H_A\) to only be a little bit smaller than those under \(H_0\).
We fit \(H_A\) by the method of least squares, obtaining the OLS estimates and the corresponding residuals.
We fit \(H_0\) also by least squares. It turns out that under \(H_0\), the OLS estimate is just \(\bar{Y}\).
We measure how small the residuals are by the sum of squared residuals.
Sum of squares of reduced model
\[ SSE(R) = \sum_{i=1}^n(Y_i - \bar{Y})^2 \]
Sum of squares of full model
\[ SSE(F) = \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 \]
NOTE: \(SSE(F)\) will always be less than or equal to \(SSE(R)\).
Under the null that the reduced model is true, we have that the following statistic has a sampling distribution that is \(F(1, n-2)\) \[ F^* = \frac{[SSE(R) - SSE(F)]/(df_R - df_F)}{SSE(F)/df_F} \]
\(df_r = n - 1\): sample size of \(n\) minus 1 for the single parameter we estimate in the reduced model.
\(df_f = n - 2\): sample size of \(n\) minus 2 for the two parameters we estimate in the full model.
If \(H_A\) were true, then we would expect \(F^*\) to be large, because this would increase the size of \(SSE(R) - SSE(F)\). So we only compare \(F^*\) to the right tail of \(F(1, n-2)\) distribution.
Let’s verify this result manually:
lmhibbs <- lm(vote ~ growth, data = hibbs)
aout <- augment(lmhibbs)
sse_full <- sum(aout$.resid^2)
ybar <- mean(hibbs$vote)
sse_reduced <- sum((aout$vote - ybar)^2)
n <- nrow(hibbs)
df_r <- n - 1
df_f <- n - 2
f_star <- ((sse_reduced - sse_full) / (df_r - df_f)) / (sse_full / df_f)
f_star
## [1] 19.32
pf(q = f_star, df1 = 1, df2 = n - 2, lower.tail = FALSE)
## [1] 0.00061
This is the same as the \(p\)-value from the output of \(tidy()\).
tidy(lmhibbs)$p.value[[2]]
## [1] 0.00061
Confusingly, statisticians have alternative terms for these sums of squares that you might see.
The \(SSE(F)\) is sometimes called the \(SSE\) (sum of squares error).
\[ SSE = \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 \]
The \(SSE(R)\) is sometimes called the \(SSTO\) (sum of squares total).
\[ SSTO = \sum_{i=1}^n(Y_i - \bar{Y})^2 \]
The difference \(SSE(R) - SSE(F)\) is sometimes called \(SSR\) (sum of squares regression).
\[ SSR = \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2 \]
We have the result that \[ SSTO = SSR + SSE \]
Degrees of freedom are how many units of independent information we have.
\(SSTO\) has \(n - 1\) degrees of freedom, called \(df_{TO}\) (\(n\) observations minus 1 parameter that we estimate).
\(SSE\) has \(n-2\) degrees of freedom, called \(df_{E}\) (\(n\) observations minus 2 parameters that we estimate).
\(SSR\) has \(1\) degree of freedom, called \(df_{R}\) (\(\hat{Y}\) used 2 parameters, while \(\bar{Y}\) only used 1 parameter, so the difference is 1).
It is always the case that \(df_{TO} = df_E + df_R\).
A sum of squares divided by its degrees of freedom is called a mean square.
The mean squared error is \[ MSE = SSE / df_E = SSE / (n-2) \]
The mean square regression is \[ MSR = SSR / df_R = SSR / 1 \]
We can show that this is an equivalent definition of \(F^*\). \[ F^* = \frac{MSR}{MSE} = \frac{SSR / df_R}{SSE / df_E} = \frac{(SSTO - SSE) / (df_{TO} - df_{E})}{SSE / df_E} \]
Because you can (i) “partition” the total sum of squares into the regression sum of squares and the error sum of squares, and (ii) “partition” the total degrees of freedom into the error degrees of freedom and the regression degrees of freedom, some folks represent regression fits in terms of an ANOVA (analysis of variance) table
It’s called “analysis of variance” because we look at the sum of squares and how they partition between different models.
We often present the results of an ANOVA in a table, which can contain some or all of the following elements (usually some components are missing):
SS | df | MS | \(p\)-value | |
---|---|---|---|---|
Regression | \(SSR = \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2\) | \(df_R = p - 1\) | MSR | pf(\(F^*\), \(df_R\), \(df_E\), lower.tail = FALSE) |
Error | \(SSE = \sum_{i=1}^n(Y_i - \hat{Y})^2\) | \(df_E = n - p\) | MSE | |
Total | \(SSTO = \sum_{i=1}^n(Y_i - \bar{Y})^2\) | \(df_{TO} = n - 1\) |
Here \(p\) is the number of predictors (so \(p = 1\) in the simple linear regression).
To calculate the ANOVA table in R, use the Anova()
function from the {car}
package. You just pass the
lm
object to the Anova()
function.
library(car)
Anova(mod = lmhibbs)
## Anova Table (Type II tests)
##
## Response: vote
## Sum Sq Df F value Pr(>F)
## growth 274 1 19.3 0.00061
## Residuals 198 14
Exercise: From the output of
Anova()
, label each of the following (which can be found in
the above table): \(df_R\), \(df_E\), \(SSE\), \(SSR\).
Exercise: Using the above ANOVA table, derive the MSE (the estimated variance).
There exists an anova()
function in R, and it
produces the same results for simple linear regression, but it produces
worse results for multiple linear regression.
We can use the linearHypothesis()
function from the
{car}
package to run a general linear hypothesis.
linearHypothesis(model = lmhibbs, "growth = 0")
## Linear hypothesis test
##
## Hypothesis:
## growth = 0
##
## Model 1: restricted model
## Model 2: vote ~ growth
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 15 472
## 2 14 198 1 274 19.3 0.00061
Exercise: From the output of
linearHypothesis()
, label each of the following (which can
be found in the above table): \(df_R\),
\(df_E\), \(df_{TO}\), \(SSE\), \(SSR\), \(SSTO\).
You can use linearHypothesis()
to test for other
values of \(\beta_1\). E.g. \(H_0: \beta_1 = 2\) versus \(H_A: \beta_1 \neq 2\).
linearHypothesis(model = lmhibbs, hypothesis.matrix = "growth = 2")
## Linear hypothesis test
##
## Hypothesis:
## growth = 2
##
## Model 1: restricted model
## Model 2: vote ~ growth
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 15 231
## 2 14 198 1 32.9 2.32 0.15
\(\frac{1}{n-2}SSE\) is the estimated variance under the linear regression model.
\(\frac{1}{n-1}SSTO\) is the estimated variance under the model \(Y_i = \beta_0 + \epsilon_i\) (no linear association between \(X\) and \(Y\)).
A common measure of the strength of the linear association between \(X\) and \(Y\) is the coefficient of determination \[ R^2 = 1 - \frac{SSE}{SSTO} \]
If \(SSE\) is much smaller than \(SSTO\), then the residuals in the linear model are much smaller than the residuals in the reduced model, and \(R^2\) is close to 1.
If \(SSE\) is about the same as \(SSTO\), then then residuals in the linear model are about the same as the residuals in the reduced model, and \(R^2\) is close to 0.
You can interpret \(R^2\) as the proportionate reduction of total variation associated with the use of the predictor variable \(X\).
Properties:
Important notes:
You obtain the \(R^2\) from the
glance()
function from the {broom}
package.
glance(lmhibbs)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.580 0.550 3.76 19.3 0.000610 1 -42.8 91.7 94.0
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Different R^2 values