{doFuture}
Vignette{foreach}
Vignette{furrr}
WebsiteParallel computing is where you run many processes at the same time.
E.g., suppose I run this code to calculate the mean of a sample of 10 uniformly distributed observations, and repeat this 1000 times.
<- 1000
nsamp <- 10
nind <- rep(NA_real_, length.out = nsamp)
outvec for (i in seq_len(nsamp)) {
<- mean(runif(nind))
outvec[[i]] }
Each iteration does not depend on any other iteration, so we could potentially run each iteration independently and simultaneously.
This could potentially speed things up:
Parallel processing can be beneficial if all of the following are true:
Sometimes, if you have a quick for-loop, the overhead of parallelization actually can slow things down. This is why parts 2 and 3 are important.
E.g., we would never parallelize the for-loop example above.
We will talk about a relatively new approach to parallel processing in R through the use of “futures”.
A future is a value that may be available at some point in the future. This value is the result of an expression (either evaluated or unevaluated).
The {future}
package implements a nice future data structure built on top of environments.
library(future)
When you create a future, you create something that may eventually have a value.
When you use a future, you evaluate the expression, obtain that value, and use that value.
Futures can be evaluated sequentially or in parallel, on the same machine or on a distributed cluster of machines.
The idea of a future is that you can write the same R code for sequential single computer jobs, as well as for large parallel jobs on the supercomputer, and the R code will work in both scenarios. This is pretty awesome.
This is also great because different types of parallel processing sometimes require different types of code. Using a future means that you don’t need to worry about the parallel processing environment.
Example: Below, we do normal evaluation, binding x
to 10. This is evaluated right away, and so "Hello World"
is printed.
<- {
x cat("Hello World\n")
10
}
## Hello World
+ 1 x
## [1] 11
Below, we create a future that is not evaluated right away, so "Hello World"
is not printed. It just can be evaluated at some point.
%<-% {
x cat("Hello World\n")
10
}
When we use x
, then the future is evaluated, the expression runs, and “Hello World” is printed.
+ 1 x
## Hello World
## [1] 11
Alternatively, we can explicitly create a future with future::future()
<- future({
f cat("Hello World\n")
10
})
and evaluate it with future::value()
.
value(f) + 1
## Hello World
## [1] 11
You can create futures using %<-%
or future::future()
, like I did above. But it is more common to use a the {future}
package as a backend to a more familiar API.
{furrr}
: Uses the {purrr}
API.{foreach}
: Uses a for-loop API.{future.apply}
: Uses base R vectorization API (like apply()
, sapply()
, vapply()
, etc)You use future::plan()
to determine if a future will be evaluated sequentially or in parallel using a particular strategy.
You basically just run future::plan()
once, and the {future}
package will automatically run processes according to your plan.
I am going to demonstrate plans by using the Sys.getpid()
function, which returns the process ID of the R session. So different ID’s means that the evaluation occurs on different processes.
Using this, the following with run Sys.getpid()
three times:
::map_int(c(1, 1, 1), ~Sys.getpid()) purrr
## [1] 14799 14799 14799
We will use furrr::future_map_int()
instead to run these as futures.
plan("sequential")
is the default and basically just runs R like you normally would. So all evaluation occurs in the same R process/session.
plan("sequential")
::future_map_int(c(1,1,1), ~Sys.getpid()) furrr
## [1] 14799 14799 14799
If your computer has multiprocessing or multithreading capabilities, then you can run your code in parallel.
See how many cores you have available by
::availableCores() future
## system
## 8
Whatever number you get is the maximum that you should try. However, if you are on a super computer, then you typically want to use much fewer than that number, otherwise you will be using up too much resources and the admins will frown on that (always be nice to your admins). Always read the documentation your admins give you to determine your resource allowances.
plan("multisession", workers = n)
: Creates n
new R sessions, and evaluates futures on these new R sessions. This allows for parallel computing.
plan("multisession", workers = 3)
::future_map_int(c(1,1,1), ~Sys.getpid()) furrr
## [1] 14870 14871 14872
plan("multicore", workers = n)
: Creates n
copies of your current R process (called “forking”), and evaluates futures on these new R processes. This allows for parallel computing.
plan("multicore", workers = 3)
::future_map_int(c(1,1,1), ~Sys.getpid()) furrr
## [1] 15002 15003 15004
My understanding is that forking (using “multicore”) means that the processes share the same memory addresses for objects (these objects are read-only in the child processes). While creating new R sessions means that everything needs to be copied over to a new R session. So forking can be faster. But I guess forking can also be dangerous because it can cause the process to “lock” (stall) in some rare cases (for technical reasons). But my understanding is tenuous, so if you know better, let me know.
So you should typically use plan("multisession", workers = n)
if you are doing parallelization on one computer.
{furrr}
We used {furrr}
for the examples above. If you know {purrr}
then switching to parallel computing should be pretty easy.
Here are the steps:
{purrr}
. But use the {furrr}
drop-in replacements.plan()
call.Implemented functions:
future_map()
future_map2()
future_pmap()
future_walk()
future_imap()
future_modify()
You usually create a function that you want to call repeatedly
<- function(x) {
f Sys.sleep(1)
return(x)
}
Then you call {furrr}
functions on it.
Let’s compare sequential and parallel processing times:
plan("sequential")
system.time(
<- furrr::future_map_dbl(1:3, f)
x )
## user system elapsed
## 0.014 0.000 3.017
plan("multisession", workers = 3)
system.time(
<- furrr::future_map_dbl(1:3, f)
x )
## user system elapsed
## 0.075 0.003 1.580
{foreach}
{foreach}
is a package that gives another API for doing a for-loop.
library(foreach)
The biggest difference is that in a {foreach}
for-loop, you save the output to a final destination. In a regular for-loop, each iteration produces side effects.
The syntax for a {foreach}
for-loop is:
<- foreach(i = 1:n) %do% {
x ## code that returns some element each iteration.
## These will be combined in some way and assigned to x
}
The {foreach}
for-loop by default combines the outputs in a list.
<- foreach(i = 1:3) %do% {
x sqrt(i)
} x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414
##
## [[3]]
## [1] 1.732
You can change how the outputs are combined by the .combine
argument. Typically, you combine either using c
, rbind
, or cbind
.
Using c
returns a vector
<- foreach(i = 1:3, .combine = c) %do% {
x sqrt(i)
} x
## [1] 1.000 1.414 1.732
Using rbind
will make each returned vector a row.
<- foreach(i = 1:3, .combine = rbind) %do% {
x c(sqrt = sqrt(i), square = i^2)
} x
## sqrt square
## result.1 1.000 1
## result.2 1.414 4
## result.3 1.732 9
To do parallel processing with foreach()
:
{doFuture}
package.doFuture::registerDoFuture()
.
{foreach}
knows to use the parallelization backend from {future}
.doParallel::registerDoParallel()
then {foreach}
will use the {parallel}
backend.future::plan()
%do%
with %dopar%
.library(doFuture)
registerDoFuture()
plan("multisession", workers = 3)
<- foreach(i = 1:3, .combine = c) %dopar% {
x Sys.getpid()
} x
## [1] 15140 15139 15141
If you are simulating anything inside a {foreach}
for-loop (or using the random number generator in any way), then you need to use the {doRNG}
package.
{doRNG}
package.doFuture::registerDoFuture()
doRNG::registerDoRNG()
future::plan()
%do%
with %dopar%
.Doing this makes reproducibility possible.
library(doRNG)
registerDoFuture()
registerDoRNG()
plan("multisession", workers = 3)
set.seed(2)
<- foreach(i = 1:3, .combine = c) %dopar% {
x runif(1)
}
set.seed(2)
<- foreach(i = 1:3, .combine = c) %dopar% {
y runif(1)
}
== y x
## [1] TRUE TRUE TRUE
If you don’t do this, then reproducibility is impossible, even if you set the seed to the same value each time:
::registerDoFuture()
doFutureplan("multisession", workers = 3)
set.seed(2)
<- foreach(i = 1:3, .combine = c) %dopar% {
x runif(1)
}
set.seed(2)
<- foreach(i = 1:3, .combine = c) %dopar% {
y runif(1)
}
x
## [1] 0.1439 0.2100 0.6021
y
## [1] 0.1606 0.6253 0.9324
The {HardyWeinberg}
package contains four tests for Hardy-Weinberg Equilibrium, a \(\chi^2\)-test, a likelihood ratio test, an exact test, and a permutation test.
The following will simulate genotype frequencies under different sample sizes n
, allele frequencies af
(between 0 and 1), and fixation indices fx
(between 0 and 1, but typically less than 0.15). HWE is fulfilled when fx = 0
.
#' @param n Sample size
#' @param af Allele frequency
#' @param fx Fixation index. Null hypothesis of HWE is true if fx = 0.
<- function(n, af, fx) {
hwesim <- c(
probvec ^ 2 + af * (1 - af) * fx,
af 2 * af * (1 - af) * (1 - fx),
1 - af) ^ 2 + af * (1 - af) * fx
(
)<- c(rmultinom(n = 1, size = n, prob = probvec))
retvec names(retvec) <- c("AA", "AB", "BB")
return(retvec)
}
We can then run the HWE tests from the package:
<- hwesim(n = 100, af = 0.5, fx = 0.1)
x ::HWChisq(X = x) HardyWeinberg
## Chi-square test with continuity correction for Hardy-Weinberg equilibrium (autosomal)
## Chi2 = 0.35 DF = 1 p-value = 0.5541 D = -1.84 f = 0.07407
::HWLratio(X = x) HardyWeinberg
## Likelihood ratio test for Hardy-Weinberg equilibrium
## G2 = 0.5489 DF = 1 p-value = 0.4588
::HWExact(X = x) HardyWeinberg
## Haldane Exact test for Hardy-Weinberg equilibrium (autosomal)
## using SELOME p-value
## sample counts: nAA = 23 nAB = 46 nBB = 31
## H0: HWE (D==0), H1: D <> 0
## D = -1.84 p-value = 0.5452
If we want to explore the performance between these three methods (I’ll discuss the permutation approach soon), we would probably vary
n
\(\in \{10, 100, 1000, 10000\}\)af
\(\in \{0.01, 0.1, 0.25, 0.5\}\)fx
\(\in \{0, 0.05, 0.1, 0.15\}\)This is \(3 * 4^2 = 48\) unique combinations, and in each combination we would probably want to run 1000 replications. So that’s 48000 replications.
Let’s first create a data frame containing the possible parameter values:
<- 1000
nrep <- c(10, 100, 1000)
nvec <- c(0.01, 0.1, 0.25, 0.5)
afvec <- c(0, 0.05, 0.1, 0.15)
fxvec <- expand.grid(seed = 1:nrep,
paramdf n = nvec,
af = afvec,
fx = fxvec)
dim(paramdf)
## [1] 48000 4
head(paramdf)
## seed n af fx
## 1 1 10 0.01 0
## 2 2 10 0.01 0
## 3 3 10 0.01 0
## 4 4 10 0.01 0
## 5 5 10 0.01 0
## 6 6 10 0.01 0
If we were doing a for-loop, we would probably first populate the data frame with all of the output we want. Here, let’s say we want the \(p\)-values for each test:
$p_chisq <- NA_real_
paramdf$p_lratio <- NA_real_
paramdf$p_exact <- NA_real_ paramdf
We would then iterate through this data frame, populating the output
<- progress::progress_bar$new(total = nrow(paramdf))
pb for (i in seq_len(nrow(paramdf))) {
$tick()
pbset.seed(paramdf$seed[[i]])
<- paramdf$n[[i]]
n <- paramdf$af[[i]]
af <- paramdf$fx[[i]]
fx <- hwesim(n = n, af = af, fx = fx)
x
$p_chisq[[i]] <- HardyWeinberg::HWChisq(X = x, verbose = FALSE)$pval
paramdf$p_lratio[[i]] <- HardyWeinberg::HWLratio(X = x, verbose = FALSE)$pval
paramdf$p_exact[[i]] <- HardyWeinberg::HWExact(X = x, verbose = FALSE)$pval
paramdf }
When I ran this on my laptop, it took about a minute, which isn’t that bad, so I wouldn’t bother parallelizing here.
But if I also wanted to compare HardyWeinberg::HWPerm()
, this one function alone takes about 5 seconds by itself
<- hwesim(n = 1000, af = 0.5, fx = 0.15)
x system.time(
::HWPerm(x = x, verbose = FALSE)
HardyWeinberg )
## user system elapsed
## 5.186 0.002 5.201
So if I had to include that in the simulation study, it could take up to \(5 \times 48000 / 60 / 60 = 66.7\) hours, and I would totally parallelize in this scenario.
The code for parallelization looks like this:
<- expand.grid(seed = 1:nrep,
paramdf n = nvec,
af = afvec,
fx = fxvec)
registerDoFuture()
registerDoRNG()
plan("multisession", workers = availableCores())
<- foreach(i = seq_len(nrow(paramdf)), .combine = rbind) %dopar% {
pmat set.seed(paramdf$seed[[i]])
<- paramdf$n[[i]]
n <- paramdf$af[[i]]
af <- paramdf$fx[[i]]
fx <- hwesim(n = n, af = af, fx = fx)
x
<- rep(NA_real_, length.out = 3)
pvec names(pvec) <- c("p_chisq", "p_lratio", "p_exact")
1]] <- HardyWeinberg::HWChisq(X = x, verbose = FALSE)$pval
pvec[[2]] <- HardyWeinberg::HWLratio(X = x, verbose = FALSE)$pval
pvec[[3]] <- HardyWeinberg::HWExact(X = x, verbose = FALSE)$pval
pvec[[
pvec
}
<- cbind(paramdf, pmat) paramdf
Using 8 cores on my laptop, the above took about 20 seconds, which is about 3 times faster than sequential processing.
There are some small optimizations I could do to improve performance even more. E.g. in the above I am passing around that large data frame paramdf
to each new R session.
{iterators}
in the {foreach}
vignette.Side note: The simulation results indicate that the chi-square is not appropriate for small counts for small allele frequencies. The likelihood ratio and exact tests perform similarly.
library(tidyverse)
%>%
paramdf gather(p_chisq, p_lratio, p_exact, key = "method", value = "pvalue") %>%
group_by(n, af, fx, method) %>%
summarize(power = mean(pvalue < 0.05), .groups = "drop") %>%
mutate(method = str_remove(method, "p_")) %>%
mutate(af = factor(af), fx = factor(fx)) %>%
ggplot(aes(x = n, y = power, color = method)) +
facet_grid(af ~ fx) +
geom_line() +
geom_point() +
scale_x_log10() +
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"))
You can see if your R processes are running by using top
in Bash (not for Windows).
top
It will open a running list of processes, and if your parallelization is successful you should see multiple processes called R
.
You can exit the top
list by typing q
.
If you want to be fancy, you can try downloading htop
.
Recall the normal simple linear regression model: \[ y_i = \beta_0 + \beta_1x_i + \epsilon_i\\ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2) \]
Write a simulation study where you fix \(x_i \sim N(0, 1)\) and vary
Each unique combination of parameters should have 200 replicates.
You are exploring the estimation of \(\beta_1\) by ordinary least squares.
Use {foreach}
to implement this simulation study in parallel.
What factors seem to affect estimates of \(\beta_1\)?
future::availableCores()
: See how many cores are available.future::plan()
: Set the evaluation plan ("sequential"
/"multisession"
/"multicore"
).furrr::map()
and variants.doFuture::registerDoFuture()
: Register a parallel backend so that {foreach}
knows how to use %dopar%
.doRNG::registerDoRNG()
: Register a parallel backend that is reproducible when using %dopar%
.foreach()
used with %do%
or %dopar%
.