Fit flexdog to multiple SNPs. — multidog • updog

This is a convenience function that will run flexdog over many SNPs. Support is provided for parallel computing through the doParallel package. This function has not been extensively tested. Please report any bugs to https://github.com/dcgerard/updog/issues.

Usage

multidog(
  refmat,
  sizemat,
  ploidy,
  model = c("norm", "hw", "bb", "s1", "s1pp", "f1", "f1pp", "flex", "uniform", "custom"),
  nc = 1,
  p1_id = NULL,
  p2_id = NULL,
  bias_init = exp(c(-1, -0.5, 0, 0.5, 1)),
  prior_vec = NULL,
  ...
)

Arguments

refmat

A matrix of reference read counts. The columns index the individuals and the rows index the markers (SNPs). This matrix must have rownames (for the names of the markers) and column names (for the names of the individuals). These names must match the names in sizemat.

sizemat

A matrix of total read counts. The columns index the individuals and the rows index the markers (SNPs). This matrix must have rownames (for the names of the markers) and column names (for the names of the individuals). These names must match the names in refmat.

ploidy

The ploidy of the species. Assumed to be the same for each individual.

model

What form should the prior (genotype distribution) take? See Details for possible values.

nc

The number of computing cores to use when doing parallelization on your local machine. See the section "Parallel Computation" for how to implement more complicated evaluation strategies using the future package.

When you are specifying other evaluation strategies using the future package, you should also set nc = NA.

The value of nc should never be more than the number of cores available in your computing environment. You can determine the maximum number of available cores by running future::availableCores() in R.

p1_id

The ID of the first parent. This should be a character of length 1. This should correspond to a single column name in refmat and sizemat.

p2_id

The ID of the second parent. This should be a character of length 1. This should correspond to a single column name in refmat and sizemat.

bias_init

A vector of initial values for the bias parameter over the multiple runs of flexdog_full().

prior_vec

The pre-specified genotype distribution. Only used if model = "custom" and must otherwise be NULL. If specified, then it should be a vector of length ploidy + 1 with non-negative elements that sum to 1.

...

Additional parameters to pass to flexdog_full().

Value

A list-like object of two data frames.

snpdf

A data frame containing properties of the SNPs (markers). The rows index the SNPs. The variables include:

snp: The name of the SNP (marker).
bias: The estimated allele bias of the SNP.
seq: The estimated sequencing error rate of the SNP.
od: The estimated overdispersion parameter of the SNP.
prop_mis: The estimated proportion of individuals misclassified in the SNP.
num_iter: The number of iterations performed during the EM algorithm for that SNP.
llike: The maximum marginal likelihood of the SNP.
ploidy: The provided ploidy of the species.
model: The provided model for the prior genotype distribution.
p1ref: The user-provided reference read counts of parent 1.
p1size: The user-provided total read counts of parent 1.
p2ref: The user-provided reference read counts of parent 2.
p2size: The user-provided total read counts of parent 2.
Pr_k: The estimated frequency of individuals with genotype k, where k can be any integer between 0 and the ploidy level.
Model specific parameter estimates: See the return value of par in the help page of flexdog.

inddf

A data frame containing the properties of the individuals at each SNP. The variables include:

snp: The name of the SNP (marker).
ind: The name of the individual.
ref: The provided reference counts for that individual at that SNP.
size: The provided total counts for that individual at that SNP.
geno: The posterior mode genotype for that individual at that SNP. This is the estimated reference allele dosage for a given individual at a given SNP.
postmean: The posterior mean genotype for that individual at that SNP. This is a continuous genotype estimate of the reference allele dosage for a given individual at a given SNP.
maxpostprob: The maximum posterior probability. This is the posterior probability that the individual was genotyped correctly.
Pr_k: The posterior probability that a given individual at a given SNP has genotype k, where k can vary from 0 to the ploidy level of the species.
logL_k: The genotype log-likelihoods for dosage k for a given individual at a given SNP, where k can vary f rom 0 to the ploidy level of the species.

Details

You should format your reference counts and total read counts in two separate matrices. The rows should index the markers (SNPs) and the columns should index the individuals. Row names are how we ID the SNPs and column names are how we ID the individuals, and so they are required attributes.

If your data are in VCF files, I would recommend importing them using the VariantAnnotation package from Bioconductor https://bioconductor.org/packages/VariantAnnotation/. It's a great VCF parser.

See the details of flexdog for the possible values of model.

If model = "f1", model = "s1", model = "f1pp" or model = "s1pp" then the user may provide the individual ID for parent(s) via the p1_id and p2_id arguments.

The output is a list containing two data frames. The first data frame, called snpdf, contains information on each SNP, such as the allele bias and the sequencing error rate. The second data frame, called inddf, contains information on each individual at each SNP, such as the estimated genotype and the posterior probability of being classified correctly.

SNPs that contain 0 reads (or all missing data) are entirely removed.

Parallel Computation

The multidog() function supports parallel computing. It does so through the future package.

If you are just running multidog() on a local machine, then you can use the nc argument to specify the parallelization. Any value of nc greater than 1 will result in multiple background R sessions to genotype all of the SNPs. The maximum value of nc you should try can be found by running future::availableCores(). Running multidog() using nc is equivalent to setting the future plan with future::plan(future::multisession, workers = nc).

Using the future package means that different evaluation strategies are possible. In particular, if you are using a high performance machine, you can explore using the future.batchtools package to evaluate multidog() using schedulers like Slurm or TORQUE/PBS.

To use a different strategy, set nc = NA and then run future::plan() prior to running multidog(). For example, to set up forked R processes on your current machine (instead of using background R sessions), you would run (will not work on Windows): future::plan(future::multicore), followed by running multidog() with nc = NA. See the examples below.

Author

David Gerard

Examples

if (FALSE) {
data("uitdewilligen")

## Run multiple R sessions using the `nc` variable.
mout <- multidog(refmat = t(uitdewilligen$refmat),
                 sizemat = t(uitdewilligen$sizemat),
                 ploidy = uitdewilligen$ploidy,
                 nc = 2)
mout$inddf
mout$snpdf

## Run multiple external R sessions on the local machine.
## Note that we set `nc = NA`.
cl <- parallel::makeCluster(2, timeout = 60)
future::plan(future::cluster, workers = cl)
mout <- multidog(refmat = t(uitdewilligen$refmat),
                 sizemat = t(uitdewilligen$sizemat),
                 ploidy = uitdewilligen$ploidy,
                 nc = NA)
mout$inddf
mout$snpdf

## Close cluster and reset future to current R process
parallel::stopCluster(cl)
future::plan(future::sequential)
}

Fit `flexdog` to multiple SNPs.