It is a good idea to subsample (each iteration) the genes and samples from a real RNA-seq dataset prior to applying thin_diff (and related functions) so that your conclusions are not dependent on the specific structure of your dataset. This function is designed to efficiently do this for you.

select_counts(
  mat,
  nsamp = ncol(mat),
  ngene = nrow(mat),
  gselect = c("random", "max", "mean_max", "custom"),
  gvec = NULL,
  filter_first = FALSE,
  nskip = 0L
)

Arguments

mat

A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.

nsamp

The number of samples (columns) to select from mat.

ngene

The number of genes (rows) to select from mat.

gselect

How should we select the subset of genes? Options include:

random

Randomly select the genes, with each gene having an equal probability of being included in the subsampled matrix.

max

Choose the ngene most median-expressed genes. Ties are broken by mean-expression.

mean_max

Choose the ngene most mean-expressed genes.

custom

A user-specified list of genes. If gselect = "custom" then gvec needs to be non-NULL.

gvec

A logical vector of length nrow(mat). A TRUE in position \(i\) indicates inclusion into the smaller dataset. Hence, sum(gvec) should equal ngene.

filter_first

Should we first filter genes by the method of Chen et al. (2016) (TRUE) or not (FALSE)? If TRUE then the edgeR package should be installed.

nskip

The number of median-maximally expressed genes to skip. Not used if gselect = "custom".

Value

A numeric matrix, which is a ngene by nsamp sub-matrix of mat. If rownames(mat) is NULL, then the row names of the returned matrix are the indices in mat of the selected genes. If colnames(mat) is NULL, then the column names of the returned matrix are the indices in mat of the selected samples.

Details

The samples (columns) are chosen randomly, with each sample having an equal probability of being in the sub-matrix. The genes are selected according to one of four schemes (see the description of the gselect argument).

If you have edgeR installed, then some functionality is provided for filtering out the lowest expressed genes prior to applying subsampling (see the filter_first argument). This filtering scheme is described in Chen et al. (2016). If you want more control over this filtering, you should use the filterByExpr function from edgeR directly. You can install edgeR by following instructions at doi: 10.18129/B9.bioc.edgeR .

References

  • Chen, Yunshun, Aaron TL Lun, and Gordon K. Smyth. "From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline." F1000Research 5 (2016). doi: 10.12688/f1000research.8987.2 .

Author

David Gerard

Examples

## Simulate data from given matrix of counts
## In practice, you would obtain mat from a real dataset, not simulate it.
set.seed(1)
n   <- 100
p   <- 1000
mat <- matrix(stats::rpois(n * p, lambda = 50), nrow = p)

## Subsample the matrix, then feed it into a thinning function
submat <- select_counts(mat = mat, nsamp = 10, ngene = 100)
thout  <- thin_2group(mat = submat, prop_null = 0.5)

## The rownames and colnames (if NULL in mat) tell you which genes/samples
## were selected.
rownames(submat)
#>   [1] "12"  "22"  "40"  "57"  "58"  "70"  "79"  "110" "113" "124" "125" "137"
#>  [13] "141" "142" "153" "186" "200" "236" "237" "239" "251" "254" "257" "259"
#>  [25] "260" "285" "290" "293" "321" "329" "333" "342" "343" "356" "359" "369"
#>  [37] "383" "394" "401" "420" "421" "435" "437" "467" "468" "476" "481" "490"
#>  [49] "492" "496" "507" "519" "520" "549" "588" "597" "616" "620" "641" "647"
#>  [61] "648" "652" "655" "657" "659" "683" "704" "714" "717" "728" "729" "732"
#>  [73] "734" "740" "745" "764" "766" "773" "778" "779" "782" "789" "801" "816"
#>  [85] "836" "845" "848" "849" "857" "863" "877" "904" "913" "919" "930" "947"
#>  [97] "952" "971" "992" "999"
colnames(submat)
#>  [1] "3"  "9"  "13" "14" "40" "80" "84" "86" "94" "99"