This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004).

vruv4(
  Y,
  X,
  ctl,
  k = NULL,
  cov_of_interest = ncol(X),
  likelihood = c("t", "normal"),
  limmashrink = TRUE,
  degrees_freedom = NULL,
  include_intercept = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list(),
  adjust_bias = FALSE
)

Arguments

Y

A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.

X

A matrix of numerics. The covariates of interest.

ctl

A vector of logicals of length ncol(Y). If position i is TRUE then position i is considered a negative control.

k

A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).

cov_of_interest

A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS.

likelihood

Either "normal" or "t". If likelihood = "t", then the user may provide the degrees of freedom via degrees_freedom.

limmashrink

A logical. Should we apply hierarchical shrinkage to the variances (TRUE) or not (FALSE)? If degrees_freedom = NULL and limmashrink = TRUE and likelihood = "t", then we'll also use the limma returned degrees of freedom.

degrees_freedom

if likelihood = "t", then this is the user-defined degrees of freedom for that distribution. If degrees_freedom is NULL then the degrees of freedom will be the sample size minus the number of covariates minus k.

include_intercept

A logical. If TRUE, then it will check X to see if it has an intercept term. If not, then it will add an intercept term. If FALSE, then X will be unchanged.

gls

A logical. Should we use generalized least squares (TRUE) or ordinary least squares (FALSE) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.

fa_func

A factor analysis function. The function must have as inputs a numeric matrix Y and a rank (numeric scalar) r. It must output numeric matrices alpha and Z and a numeric vector sig_diag. alpha is the estimate of the coefficients of the unobserved confounders, so it must be an r by ncol(Y) matrix. Z must be an r by nrow(Y) matrix. sig_diag is the estimate of the column-wise variances so it must be of length ncol(Y). The default is the function pca_naive that just uses the first r singular vectors as the estimate of alpha. The estimated variances are just the column-wise mean square.

fa_args

A list. Additional arguments you want to pass to fa_func.

adjust_bias

A logical. Should we also use the control genes to adjust for bias (TRUE) or not (FALSE)

Value

A list whose elements are:

multiplier A numeric. The estimated variance inflation parameter.

betahat_ols A vector of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest. This is when not including the estimated confounding variables.

sebetahat_ols A vector of positive numerics. The pre-inflation standard errors of ruv$betahat (NOT ruv$betahat_ols).

betahat A matrix of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest WHEN YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.

sebetahat A matrix of positive numerics. This is equal to sebethat_ols * sqrt(multiplier). This is the post-inflation adjusted standard errors for ruv$betahat.

tstats A vector of numerics. The t-statistics for testing against the null hypothesis of the coefficient of the covariate of interest being zero. This is after estimating the variance inflation parameter.

pvalues A vector of numerics. The p-values of said test above.

alphahat A matrix of numerics. The estimates of the coefficients of the hidden confounders. Only identified up to a rotation on the rowspace.

Zhat A matrix of numerics. The estimates of the confounding variables. Only identified up to a rotation on the columnspace.

sigma2 A vector of positive numerics. The estimates of the variances PRIOR to inflation.

sigma2_adjusted A vector of positive numerics. The estimates of the variances AFTER to inflation. This is equal to sigma2 * multiplier.

mult_matrix A matrix of numerics. Equal to solve(t(R22) %*% R22). One multiplies sigma2 or simga2_adjused by the diagonal elements of mult_matrix to get the standard errors of betahat.

degrees_freedom The degrees of freedom. If likelihood = "t", then this was the degrees of freedom used.

R22 A matrix of numerics numeric. This is the submatrix of the R in the QR decomposition of X that corresponds to the covariates of interest. This is mostly returned for debugging purposes and may be removed in the future.

Z2 A matrix of numerics of length 1. This is the estimated confounders (after a rotation). Not useful on it's own and is mostly returned for debugging purposes. It may be removed in the future.

Details

The model is $$Y = XB + ZA + E,$$ where \(Y\) is a matrix of responses (e.g. log-transformed gene expression levels), \(X\) is a matrix of covariates, \(B\) is a matrix of coefficients, \(Z\) is a matrix of unobserved confounders, \(A\) is a matrix of unobserved coefficients of the unobserved confounders, and \(E\) is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of \(Y\) are the observations (e.g. individuals) and the columns of \(Y\) are the response variables (e.g. genes).

This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. An additional modification is to use a t-likelihood in the second step of the procedure, improving robustness to model misspecification.

For instructions and examples on how to specify your own factor analysis, run the following code in R: utils::vignette("customFA", package = "vicar"). If it doesn't work, then you probably haven't built the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.

References

  • Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." Multivariate behavioral research, 27(4), pp.509-540. doi: 10.1207/s15327906mbr2704_2

  • Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." Journal of the American Statistical Association, 99(465), pp.96-104. doi: 10.1198/016214504000000089

  • Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. https://statistics.berkeley.edu/tech-reports/820

  • Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166. doi: 10.5705/ss.202018.0345

  • Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." The Annals of Statistics 45, no. 5: 1863-1894. doi: 10.1214/16-AOS1511

See also

ruv3 for a version of RUV4 that can also be considered a version of RUV2.

RUV4 For the version of RUV4 in the ruv package.

cate For the version of RUV4 in the cate package.

Author

David Gerard

Examples

library(vicar) ## Generate data and controls --------------------------------------------- set.seed(1327) n <- 13 p <- 201 k <- 2 q <- 3 is_null <- rep(FALSE, length = p) is_null[1:101] <- TRUE ctl <- rep(FALSE, length = p) ctl[1:73] <- TRUE X <- matrix(stats::rnorm(n * q), nrow = n) B <- matrix(stats::rnorm(q * p), nrow = q) B[2, is_null] <- 0 Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) + matrix(rnorm(n * k), nrow = n) A <- matrix(stats::rnorm(k * p), nrow = k) E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n) Y <- X %*% B + Z %*% A + E ## Fit RUV4 assuming only second covariate is of interest ----------------- ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE, likelihood = "normal", cov_of_interest = 2) graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"), pch = 1)
## should look uniform graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)
## Compare to linear model ------------------------------------------------ lmout <- coefficients(summary(stats::lm(Y ~ X))) lmp <- sapply(lmout, function(x) { x[3, 4] }) graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"), pch = 1)
## should look uniform graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)
## Other ways to fit RUV4 from various packages ------------------------------ ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE], X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc", adj.method = "nc", nc = ctl, calibrate = FALSE, nc.var.correction = FALSE) vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2, include_intercept = FALSE, ctl = ctl, limmashrink = FALSE, likelihood = "normal") vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2, include_intercept = FALSE, ctl = ctl, limmashrink = FALSE, likelihood = "normal", gls = FALSE) ruv4ols <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE], Z = X[, -2, drop = FALSE], ctl = ctl, k = k) ruv4p <- ruv4out$beta.p.value vruv4p <- vruv4out$pvalues ruv4olsp <- ruv4ols$p vruv4olsp <- vruv4ols$pvalues ## These should be monotonically related ## They often aren't because CATE often returns p-values that are exactly 0. graphics::plot(ruv4p, vruv4p)
cor(c(ruv4p), c(vruv4p), method = "kendall")
#> [1] 0.9883652
## These should be monotonically related graphics::plot(ruv4olsp, vruv4olsp)
cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")
#> [1] 0.738408