You can run an R script file in batch mode (i.e. in the background) from bash by using the following command:
R CMD BATCH --no-save --no-restore input_file.R output_file.Rout
Make sure to change “input_file.R” and “output_file.Rout”
--no-save --no-restore
options make sure that you are working with a clean environment and that you don’t save this environment after the command is executed. This is a good thing for reproducibility.The newer way is to use Rscript
. The following command will do mostly what R CMD BATCH
will do, except only the output of print()
will be sent to “output_file.Rout”, and the console output will not be sent.
Rscript input_file.R > output_file.Rout
Even though R CMD BATCH
is an older way to run R scripts in batch mode, I like it better because:
Rscript
forces the --no-echo
option, so console output is not printed.Rscript
when using it on the super computer. But that was a few years ago and they might have fixed things by now.One nice thing about Rscript
is that you can run one-liners of R code from bash using -e
:
Rscript -e 'print("hello world")'
## [1] "hello world"
This allows you to knit an R markdown file from bash by using the following command:
Rscript -e "library(rmarkdown);render('rmarkdown_file.Rmd')"
On Windows machines, the above commands should also work on Windows PowerShell, but not Git Bash for Windows.
Exercise: Write an R script that:
mtcars
mpg
on wt
Then run this script in batch mode.
On Unix-like machines (e.g. Ubuntu and Mac), you can make an R file executable using shebang.
At the very top of your R script, place the following
#!/path/to/Rscript --vanilla
where “/path/to/Rscript” is replaced with where Rscript is installed on your machine.
--vanilla
option makes sure that your script is reproducible in other users’ computing environments.You can find where Rscript
is located by running the following in bash:
which Rscript
## /usr/bin/Rscript
It is common for “/usr/bin/Rscript” to be your Rscript
location, unless you changed the defaults.
Then change the mode to executable using chmod
chmod +x filename.R
Then you can execute it by running the following in bash
./filename.R
You can make it so that you can execute this file anywhere on your computer by adding it to the $PATH
.
This makes creating command line interfaces for your R code relatively simple.
Suppose I have the following R script file called “ex1.R”
#!/usr/bin/Rscript --vanilla
cat("Hello World\n", file = "hello.txt")
Then I run
chmod +x ex1.R
Then, whenever I run the following, that script will run:
./ex1.R
American University has a supercomputer called Zorro.
To use it, you have to submit jobs via a schedular.
The schedular Zorro uses is LSF, so we’ll learn that.
There are many other, popular schedulars. The syntax is all different, but they all operate on similar ideas.
You need faculty sponsorship to get access to Zorro. Once you have a sponsorship, fill out a request form from the Zorro Website.
After you have permission to access Zorro, you interface with it using SSH.
You transfer files between the supercomputer and your computer through FTP.
You are required to use the AU VPN while accessing Zorro. All users should follow the AU instructions here to connect to the AU VPN.
Connect to the AU VPN.
Windows Users:
zorro.american.edu
to connect to the server. Press EnterMac Users:
In the terminal, type:
ssh -l username zorro.american.edu
where “username” is your AU username.
Type in your AU password and hit enter.
Once you are connected to Zorro, you use bash to navigate, run programs, submit jobs, etc..
You format the properties of a job in terms of a LSF file.
Example LSF File:
#BSUB -J minimal_example
#BSUB -q normal
#BSUB -o minimal_out.txt
#BSUB -e minimal_err.txt
#BSUB -u "youremail@american.edu"
#BSUB -B
#BSUB -N
#BSUB n=2
/path/to/R CMD BATCH --no-save --no-restore '--args nc=2' minimal.R minimal.Rout
You have a list of LSF options. Each option begins with #BSUB
You then have a bash command. In this case we are running R in batch mode via
/path/to/R CMD BATCH --no-save --no-restore '--args nc=2' minimal.R minimal.Rout
The “path/to/R” can be found by typing the following in bash:
which R
This might not be the version of R that you want. Zorro has a few copies or R installed. You can see these versions by typing in bash:
ls -d /app/R-*
As of the writing of this document, they have versions 3.6.0, 3.6.1, 4.0.2, and 4.1.0
You can use a specific version of R (e.g. 4.0.2) by using
/app/R-4.0.2/bin/R CMD BATCH --no-save --no-restore '--args nc=2' minimal.R minimal.Rout
That '--args nc=2'
trick is a way to pass arbitrary arguments to an R script. See here.
Typical options are:
#BSUB -J job_name
: Name of the job, so that you can see that name when you check the status of your job.#BSUB -q normal
: The queue to submit your job to. You shouldn’t change this for Zorro.#BSUB -o job_out.txt
: Where to send the job’s output.#BSUB -e job_err.txt
: Where to send the job’s error messages.#BSUB -u "youremail@american.edu"
: Send mail to the specified user.#BSUB -B
: Emails you when the job begins.#BSUB -N
: Emails you when the job ends.#BSUB n=2
: Submits a parallel job and specifies the number of tasks in the job (in this case, 2).A full list of options can be found here
You need to format your R script to be able to run in parallel.
Example R Script
## Set library for R packages ----
.libPaths(c("/home/dgerard/R/4.0.2/", .libPaths()))
## Attach packages for parallel computing ----
library(foreach)
library(doFuture)
## Determine number of cores ----
<- commandArgs(trailingOnly = TRUE)
args if (length(args) == 0) {
<- 1
nc else {
} eval(parse(text = args[[1]]))
}cat(nc, "\n")
## Register workers ----
if (nc == 1) {
registerDoSEQ()
plan(sequential)
else {
} registerDoFuture()
plan(multisession, workers = nc)
if (getDoParWorkers() == 1) {
stop("nc > 1, but only one core registered")
}
}
## Run R script ----
<- foreach(i = 1:2, .combine = c) %dopar% {
x Sys.sleep(1)
i
}
x
## Unregister workers ----
if (nc > 1) {
plan(sequential)
}
The above is the template I use.
The above only allows to run on multiple cores on one node. So the above does not allow for multi-node processing.
You should only need to modify two things in the above code:
## Run R script ----
, where you implement the your computations..libPaths()
call. This is where your R packages are locally stored (see below).bsub
commandsYou submit and control jobs with the bsub
command in bash.
The following will submit the job in "minimal.lsf"
bsub < minimal.lsf
minimal.lsf
file to the bsub command.bsub
like bsub -q normal -n 2 ...
Use bjobs
to display information on the jobs that you have submitted.
Display all of your jobs:
bjobs -a
Display all of the jobs of a user:
bjobs -u user_name
Display information about a particular job
bjobs job_id
Different job states that you will see:
PEND
: Waiting in a queue.RUN
: Currently running.DONE
: Successfully finished with no errors.EXIT
: Errored, did not finish successfully.Use bkill
to kill a job.
bkill job_id
You can see what packages are already installed on Zorro by typing
ls /app/R-4.0.2/lib64/R/library
You need to install R packages in a local directory, since global install is not supported (because not everyone wants your R packages).
You should create an R directory where you put all things R:
mkdir R
Then inside this directory, create a directory where you can place your packages for a specific version of R
mkdir 4.0.2
Then create the following R file, called “install.R”, which can be used to install R packages into that directory (but change my username to yours):
.libPaths(c("/home/dgerard/R/4.0.2/", .libPaths()))
install.packages(c("tidyverse",
"future",
"doFuture",
"foreach"),
lib = "/home/dgerard/R/4.0.2/",
repos = "http://cran.us.r-project.org")
{tidyverse}
, {future}
, {doFuture}
, and {foreach}
packages. But you can add more.If you want to use Bioconductor packages in R Version 4.0.2, run the following:
.libPaths(c("/home/dgerard/R/4.0.2/", .libPaths()))
install.packages("BiocManager",
lib = "/home/dgerard/R/4.0.2/",
repos = "http://cran.us.r-project.org")
::install(version = "3.12", lib = "/home/dgerard/R/4.0.2/", ask = FALSE)
BiocManager::install(c("tidyverse",
BiocManager"future",
"doFuture",
"foreach"),
lib = "/home/dgerard/R/4.0.2/",
ask = FALSE)
Set up an LSF file to run this script, called “install.lsf”:
#BSUB -J install_r_pkgs
#BSUB -q normal
#BSUB -o install_out.txt
#BSUB -e install_err.txt
#BSUB -u "youremail@american.edu"
#BSUB -B
#BSUB -N
#BSUB n=1
/app/R-4.0.2/bin/R CMD BATCH --no-save --no-restore install.R install.Rout
Then run this job
bsub < install.lsf
At the top of every R file from now on, put the following (but change my username to yours):
.libPaths(c("/home/dgerard/R/4.0.2/", .libPaths()))
Now you have access to those packages that you installed in the “R/4.0.2” directory.
For some R packages, you need to have additional software installed. But global install is not allowed on the super computer.
Here are the general steps:
Download a tar file using wget <software_url>
. You should download this into a common directory for all of your local installs, like “apps”.
Decompress the tar file using tar -zxvf <tar_file>
Move into the newly decompressed file using cd <new_file>
Look at the README for further steps. Usually there is a makefile and it’s as easy as running make
and/or make install
. But read the README first.
Add the location to the PATH
by using something like export PATH=$HOME/path/to/software:$PATH
. You should put this in your “.bashrc”
Source your “.bashrc” file with . ~/.bashrc.
Confirm that the software is installed with which <software>
R CMD BATCH
: Run R in batch mode (classic way).Rscript
: Run R in batch mode (newer way)..libPaths()
: Set and get the paths where R will search for installed packages.bsub
: Submit an LSF job.bjobs
: See status of LSF jobs.bkill
: Kill an LSF job.