We extract latent factors from the log of mat
using an SVD, then
generate an underlying group-assignment variable from a conditional
normal distribution (conditional on the latent factors). This underlying
group-assignment variable is used to assign groups.
corassign(mat, nfac = NULL, corvec = NULL, return = c("group", "full"))
A matrix of count data. The rows index the individuals and the columns index the genes.
The number of latent factors. If NULL
, then we will
use est.factor.num
from the cate
package to choose the number of latent factors.
The vector of correlations. corvec[i]
is the correlation
between latent factor i
and the underlying group-assignment variable.
You can think of the correlations in corvec
as a kind of "tetrachoric
correlation." If NULL
, then it assumes independence between
factors and group assignment. Note that the correlations of the
latent factors with the observed group-assignment vector (instead of the
latent group-assignment vector) will be corvec * sqrt(2 / pi)
.
What should we return? Just the group assignment
("group"
) or a list of a bunch of things ("full"
).
A list with some or all of the following elements:
x
The vector of group assignments. 0L
indicates
membership to one group and 1L
indicates membership to
the other group.
nfac
The number of assumed latent factors.
facmat
A matrix, whose columns contain the latent factors.
groupfac
The underlying group-assignment factor.
corvec
The correlation vector. Note that this is the
correlation between random variables observed in groupfac
and facmat
,
If return = "group"
, then the list only contains x
.
If nfac
is provided, then corvec
must be the same length as nfac
.
If nfac
is not provided, then it is assumed that the first nfac
elements of corvec
are the underlying correlations, if nfac
turns out to be
smaller than the length of corvec
. If nfac
turns
out to be larger than the length of corvec
, then the factors without
defined correlations are assumed to have correlation 0.
Jingshu Wang and Qingyuan Zhao (2015). cate: High Dimensional Factor Analysis and Confounder Adjusted Testing and Estimation. R package version 1.0.4. https://cran.r-project.org/package=cate
## Simulate data from given matrix of counts
## In practice, you would obtain Y from a real dataset, not simulate it.
set.seed(1)
nsamp <- 1000
ngene <- 10
Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene)
## Set target correlation to be 0.9 and nfac to be 1
corvec <- 0.9
nfac <- 1
## Group assignment
cout <- corassign(mat = t(Y),
nfac = nfac,
corvec = corvec,
return = "full")
## Correlation between facmat and groupfac should be about 0.9
cor(cout$facmat, cout$groupfac)
#> [,1]
#> [1,] 0.9064812
## Correlation between facmat and x should be about 0.9 * sqrt(2 / pi)
cor(cout$facmat, cout$x)
#> [,1]
#> [1,] 0.7341039
corvec * sqrt(2 / pi)
#> [1] 0.7180961