Removing Unwanted Variation 3.

RUV2 (RUV2) and RUV4 (cate or RUV4) are actually classes of methods indexed by the factor analysis used. RUV3 is the intersection of RUV2 and RUV4. That is, it is the class of methods that can be considered both RUV2 and RUV4.

ruv3(
  Y,
  X,
  ctl,
  k = NULL,
  cov_of_interest = ncol(X),
  include_intercept = TRUE,
  limmashrink = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list()
)

Arguments

Y	A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.
X	A matrix of numerics. The covariates of interest.
ctl	A vector of logicals of length `ncol(Y)`. If position i is `TRUE` then position i is considered a negative control.
k	A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).
cov_of_interest	A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS.
include_intercept	A logical. If `TRUE`, then it will check `X` to see if it has an intercept term. If not, then it will add an intercept term. If `FALSE`, then `X` will be unchanged.
limmashrink	A logical. Should we apply hierarchical shrinkage to the variances (`TRUE`) or not (`FALSE`)? If `degrees_freedom = NULL` and `limmashrink = TRUE` and `likelihood = "t"`, then we'll also use the limma returned degrees of freedom.
gls	A logical. Should we use generalized least squares (`TRUE`) or ordinary least squares (`FALSE`) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.
fa_func	A factor analysis function. The function must have as inputs a numeric matrix `Y` and a rank (numeric scalar) `r`. It must output numeric matrices `alpha` and `Z` and a numeric vector `sig_diag`. `alpha` is the estimate of the coefficients of the unobserved confounders, so it must be an `r` by `ncol(Y)` matrix. `Z` must be an `r` by `nrow(Y)` matrix. `sig_diag` is the estimate of the column-wise variances so it must be of length `ncol(Y)`. The default is the function `pca_naive` that just uses the first `r` singular vectors as the estimate of `alpha`. The estimated variances are just the column-wise mean square.
fa_args	A list. Additional arguments you want to pass to fa_func.

Value

betahat The estimates of the coefficients of interest. The values corresponding to control genes are 0.

sebetahat_unadjusted The unadjusted standard errors of betahat. The values corresponding to control genes are NA.

tstats_unadjusted The t-statistics corresponding to the coefficients of interest. These use sebetahat_unadjusted as the standard errors. The values corresponding to control genes are NA.

pvalues_unadjusted The p-values using said statistics above.

sebetahat_adjusted The unadjusted standard errors of betahat. This equals sebetahat_unadjusted * multiplier. The values corresponding to control genes are NA.

tstats_adjusted The t-statistics corresponding to the coefficients of interest. These use sebetahat_adjusted as the standard errors. The values corresponding to control genes are NA.

pvalues_unadjusted The p-values using said statistics above.

betahat_ols The ordinary least squares (OLS) estimates for all of the coefficients.

sebetahat_ols The standard errors from OLS regression.

tstats_ols The t-statistics from OLS regression.

pvalues_ols The p-values from OLS regression.

sigma2_unadjusted The unadjusted variance estimates.

sigma2_adjusted The adjusted variance estimates. This is equal to sigma2_unadjusted * multiplier.

Zhat The estimates of the confounders.

alphahat The estimates of the coefficients of the confounders.

multiplier The estimate of the variance inflation parameter.

mult_matrix The unscaled covariance of betahat after including the confounders.

mult_matrix_ols The OLS version of mult_matrix.

degrees_freedom The degrees of freedom used when calculating the p-values.

debuglist A list of elements that aren't really useful except for unit testing and debugging.

resid_mat Y21 - Z2

Details

The model is $$Y = XB + ZA + E,$$ where $Y$ is a matrix of responses (e.g. log-transformed gene expression levels), $X$ is a matrix of covariates, $B$ is a matrix of coefficients, $Z$ is a matrix of unobserved confounders, $A$ is a matrix of unobserved coefficients of the unobserved confounders, and $E$ is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of $Y$ are the observations (e.g. individuals) and the columns of $Y$ are the response variables (e.g. genes).

For instructions and examples on how to specify your own factor analysis, run the following code in R: utils::vignette("customFA", package = "vicar"). If it doesn't work, then you probably haven't built the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.

References

Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166. doi: 10.5705/ss.202018.0345

Author

David Gerard

Examples

library(vicar)

## Generate data and controls ---------------------------------------------
set.seed(127)
n <- 13
p <- 201
k <- 2
q <- 3
is_null       <- rep(FALSE, length = p)
is_null[1:101] <- TRUE
ctl           <- rep(FALSE, length = p)
ctl[1:73]     <- TRUE

X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
     matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E

## Fit RUV3, CATE (RUV4), and RUV2 ----------------------------------------
## The parameters chosen in CATE are to make the comparisons as close as possible.
ruv3out <- ruv3(Y = Y, X = X, k = k, cov_of_interest = 2,
                include_intercept = FALSE, ctl = ctl,
                limmashrink = FALSE)
ruv2out <- ruv::RUV2(Y = Y, X = X[, 2, drop = FALSE],
                     Z = X[, -2, drop = FALSE], ctl = ctl, k = k)
ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
                     X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
                     adj.method = "nc", nc = ctl, calibrate = FALSE,
                     nc.var.correction = FALSE)

ruv3p <- ruv3out$pvalues_unadjusted
ruv2p <- ruv2out$p
ruv4p <- ruv4out$beta.p.value

## Control genes are known to be 0! ---------------------------------------
ruv2p[ctl] <- NA
ruv4p[ctl] <- NA

## Plot ROC curves are very similar in this dataset------------------------
order_ruv3 <- order(ruv3p, na.last = NA)
order_ruv2 <- order(ruv2p, na.last = NA)
order_ruv4 <- order(ruv4p, na.last = NA)

nnull <- sum(is_null[!ctl])
nsig  <- sum(!is_null[!ctl])
fpr3 <- cumsum(is_null[order_ruv3]) / nnull
tpr3 <- cumsum(!is_null[order_ruv3]) / nsig

fpr2 <- cumsum(is_null[order_ruv2]) / nnull
tpr2 <- cumsum(!is_null[order_ruv2]) / nsig

fpr4 <- cumsum(is_null[order_ruv4]) / nnull
tpr4 <- cumsum(!is_null[order_ruv4]) / nsig

graphics::plot(fpr3, tpr3, type = "l", xlab = "False Positive Rate",
               ylab = "True Positive Rate", main = "ROC Curves")
graphics::lines(fpr2, tpr2, col = 3)
graphics::lines(fpr4, tpr4, col = 4)
graphics::legend("bottomright", legend = c("RUV2", "RUV3", "RUV4"),
                 col = c(3, 1, 4), lty = 1)