Calibrated RUV4 where the control genes are used to estimate hidden confounders and a variance inflation parameter.

This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004).

vruv4(
  Y,
  X,
  ctl,
  k = NULL,
  cov_of_interest = ncol(X),
  likelihood = c("t", "normal"),
  limmashrink = TRUE,
  degrees_freedom = NULL,
  include_intercept = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list(),
  adjust_bias = FALSE
)

Arguments

Y	A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.
X	A matrix of numerics. The covariates of interest.
ctl	A vector of logicals of length `ncol(Y)`. If position i is `TRUE` then position i is considered a negative control.
k	A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).
cov_of_interest	A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS.
likelihood	Either `"normal"` or `"t"`. If `likelihood = "t"`, then the user may provide the degrees of freedom via `degrees_freedom`.
limmashrink	A logical. Should we apply hierarchical shrinkage to the variances (`TRUE`) or not (`FALSE`)? If `degrees_freedom = NULL` and `limmashrink = TRUE` and `likelihood = "t"`, then we'll also use the limma returned degrees of freedom.
degrees_freedom	if `likelihood = "t"`, then this is the user-defined degrees of freedom for that distribution. If `degrees_freedom` is `NULL` then the degrees of freedom will be the sample size minus the number of covariates minus `k`.
include_intercept	A logical. If `TRUE`, then it will check `X` to see if it has an intercept term. If not, then it will add an intercept term. If `FALSE`, then `X` will be unchanged.
gls	A logical. Should we use generalized least squares (`TRUE`) or ordinary least squares (`FALSE`) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.
fa_func	A factor analysis function. The function must have as inputs a numeric matrix `Y` and a rank (numeric scalar) `r`. It must output numeric matrices `alpha` and `Z` and a numeric vector `sig_diag`. `alpha` is the estimate of the coefficients of the unobserved confounders, so it must be an `r` by `ncol(Y)` matrix. `Z` must be an `r` by `nrow(Y)` matrix. `sig_diag` is the estimate of the column-wise variances so it must be of length `ncol(Y)`. The default is the function `pca_naive` that just uses the first `r` singular vectors as the estimate of `alpha`. The estimated variances are just the column-wise mean square.
fa_args	A list. Additional arguments you want to pass to fa_func.
adjust_bias	A logical. Should we also use the control genes to adjust for bias (`TRUE`) or not (`FALSE`)

Value

A list whose elements are:

multiplier A numeric. The estimated variance inflation parameter.

betahat_ols A vector of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest. This is when not including the estimated confounding variables.

sebetahat_ols A vector of positive numerics. The pre-inflation standard errors of ruv$betahat (NOT ruv$betahat_ols).

betahat A matrix of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest WHEN YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.

sebetahat A matrix of positive numerics. This is equal to sebethat_ols * sqrt(multiplier). This is the post-inflation adjusted standard errors for ruv$betahat.

tstats A vector of numerics. The t-statistics for testing against the null hypothesis of the coefficient of the covariate of interest being zero. This is after estimating the variance inflation parameter.

pvalues A vector of numerics. The p-values of said test above.

alphahat A matrix of numerics. The estimates of the coefficients of the hidden confounders. Only identified up to a rotation on the rowspace.

Zhat A matrix of numerics. The estimates of the confounding variables. Only identified up to a rotation on the columnspace.

sigma2 A vector of positive numerics. The estimates of the variances PRIOR to inflation.

sigma2_adjusted A vector of positive numerics. The estimates of the variances AFTER to inflation. This is equal to sigma2 * multiplier.

mult_matrix A matrix of numerics. Equal to solve(t(R22) %*% R22). One multiplies sigma2 or simga2_adjused by the diagonal elements of mult_matrix to get the standard errors of betahat.

degrees_freedom The degrees of freedom. If likelihood = "t", then this was the degrees of freedom used.

R22 A matrix of numerics numeric. This is the submatrix of the R in the QR decomposition of X that corresponds to the covariates of interest. This is mostly returned for debugging purposes and may be removed in the future.

Z2 A matrix of numerics of length 1. This is the estimated confounders (after a rotation). Not useful on it's own and is mostly returned for debugging purposes. It may be removed in the future.

Details

The model is $$Y = XB + ZA + E,$$ where $Y$ is a matrix of responses (e.g. log-transformed gene expression levels), $X$ is a matrix of covariates, $B$ is a matrix of coefficients, $Z$ is a matrix of unobserved confounders, $A$ is a matrix of unobserved coefficients of the unobserved confounders, and $E$ is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of $Y$ are the observations (e.g. individuals) and the columns of $Y$ are the response variables (e.g. genes).

This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. An additional modification is to use a t-likelihood in the second step of the procedure, improving robustness to model misspecification.

For instructions and examples on how to specify your own factor analysis, run the following code in R: utils::vignette("customFA", package = "vicar"). If it doesn't work, then you probably haven't built the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.

References

Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." Multivariate behavioral research, 27(4), pp.509-540. doi: 10.1207/s15327906mbr2704_2
Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." Journal of the American Statistical Association, 99(465), pp.96-104. doi: 10.1198/016214504000000089
Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. https://statistics.berkeley.edu/tech-reports/820
Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166. doi: 10.5705/ss.202018.0345
Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." The Annals of Statistics 45, no. 5: 1863-1894. doi: 10.1214/16-AOS1511

Author

David Gerard

Examples

library(vicar)

## Generate data and controls ---------------------------------------------
set.seed(1327)
n <- 13
p <- 201
k <- 2
q <- 3
is_null       <- rep(FALSE, length = p)
is_null[1:101] <- TRUE
ctl           <- rep(FALSE, length = p)
ctl[1:73]     <- TRUE

X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
     matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E

## Fit RUV4 assuming only second covariate is of interest -----------------
ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE,
                likelihood = "normal", cov_of_interest = 2)
graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)

## Compare to linear model ------------------------------------------------
lmout <- coefficients(summary(stats::lm(Y ~ X)))
lmp <- sapply(lmout, function(x) { x[3, 4] })
graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)

## Other ways to fit RUV4 from various packages ------------------------------
ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
                          X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
                          adj.method = "nc", nc = ctl, calibrate = FALSE,
                          nc.var.correction = FALSE)
vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal")
vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal",
                  gls = FALSE)
ruv4ols  <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE],
                      Z = X[, -2, drop = FALSE], ctl = ctl, k = k)

ruv4p <- ruv4out$beta.p.value
vruv4p <- vruv4out$pvalues
ruv4olsp <- ruv4ols$p
vruv4olsp <- vruv4ols$pvalues

## These should be monotonically related
## They often aren't because CATE often returns p-values that are exactly 0.
graphics::plot(ruv4p, vruv4p)
cor(c(ruv4p), c(vruv4p), method = "kendall")
#> [1] 0.9883652

## These should be monotonically related
graphics::plot(ruv4olsp, vruv4olsp)
cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")
#> [1] 0.738408