Intro

Often times I receive inquiries on how to deploy R packages or conduct simulation studies on the Illinois Campus Cluster (ICC). After writing a few responses, I realized that it would probably benefit not only the Illinois R community but also the larger R community if this information was more widely available. The information is primarily a pointed discussion on using R non-interactively (e.g. command line, shell or terminal) that follows from Invoking R from the Commandline and Scripting with R in Appendix B Invoking R of An Introduction to R. Below is a collation of previous discussions I’ve had with various personnel on campus regarding clustered use of R.

Command Line R

To begin, let’s start with the options that are available when launching R via command line, shell, or terminal. In particular, we have R, Rscript, and R CMD BATCH. There is a fourth option that is written by Dirk Eddelbuettel called littler, however, I have yet to experiment with it as it is not part of what comes in a standard Base R installation.

The difference between the Base R options are relatively large.

For starters, using R by iteself indicates that an interactive session should be spawned and the commands should be executed from within it. This is particularly bad as the script could potentially stop working if it requires user input to advance the process (e.g. if(interactive())). Furthermore, the output is directed straight back to the R session in the terminal window. There is no output file that is generated. Thus, this method is not preferred when deploying to a cluster. However, for personal use, this provides a GUI experience free interaction with R that focuses on computational and not graphical results (e.g. no plotting).

With this being said, there are only really two options for cluster-based use: R CMD BATCH and Rscript.

The difference between the two can be stated succiently as:

R CMD BATCH:

  • Requires an input file (e.g. helloworld.R)
  • Saves to an output file (e.g. Run script helloworld.R get helloworld.r.Rout)
  • By default, echoes both input and output statement inline (e.g. as if you were actually typing them into console).
  • Is not able to write output to stdout.

Rscript:

  • Similar to bash scripts
  • Requires the use of a shebang (#!/usr/bin/Rscript)
  • Requires authorization before being able to be run (chmod +x script.r)
  • Output from print() and cat() are directly sent to STDOUT.
  • No additional file is made.
  • Able to issue one line comments (e.g. Rscript -e "print('hi!')")

To further emphasis the differences between the two let’s create two short examples.

For R CMD BATCH, create a file called hellobatch.R with contents:

print("Hello Batch World!")

To running this using R CMD BATCH use:

$ R CMD BATCH hellobatch.R

This yields a file called hellobatch.R.Rout in the same directory as hellobatch.R with contents:

print("Hello Batch World!")
## [1] "hello world"

proc.time()
## user  system elapsed
## 0.401 0.021  0.422

Under Rscript, we’ll use hellorscript.R with contents:

#!/usr/bin/Rscript
print("Hello Batch World!")

To run the script with Rscript, we must first authorize the file:

$ chmod +x hellorscript.R

Then, we can either run the file with:

$ Rscript hellorscript.R
$ ./hellorscript.R

Doing so will produce output directly in the terminal, e.g.

$ Rscript hellorscript.R
[1] "Hello Batch World!"

Personally, I opt more for the R CMD BATCH over the Rscript. Though, there is a considerable amount of folks that prefer the later as input and output (I/O) options are better. I may flip flop on this later. Stay tuned for an update.

Each of these commands responds to many different options or flags. The options presented below are truncated from the full list as these options are the ones that I’ve found to be the most relevant when working with R in a non-interactive state. To access a full list of these options, type:

$ R --help

Options:

Options Description
--save Do save workspace at the end of the session
--no-save Don't save the workspace
--no-environ Don't read the site and user environment files
--no-site-file Don't read the site-wide Rprofile
--no-init-file Don't read the user R profile
--vanilla Combine --no-save, --no-restore, --no-site-file, --no-init-file and --no-environ
--no-readline Don't use readline for command-line editing
-q, --quiet Don't print startup message
--silent Same as --quiet
--slave Make R run as quietly as possible
--interactive Force an interactive session
--verbose Print more information about progress
--args Skip the rest of the command line
-f FILE, --file=FILE Take input from 'FILE'
-e EXPR Execute R expression code (e.g. 'print("hello")' ) and exit

The main options that I end up using when executing queries on the cluster are:

$ R CMD BATCH --no-save --quiet --slave < $HOME/folder/script.R

Thus, the R session is not saved on close, there is no startup messages and there is NO command line echo respectively. Only the results are saved within script.Rout.

Note: You may need to further under this approach to suppress package startup messages using:

suppressPackageStartupMessages(library(gmwm))

Setup a local R library for installing and loading R Packages

The R package library directory is traditionally used in a system-wide approach. However, on a cluster, there is more than one user who is using the system at a given moment in time and each user has unique needs. That is to say, one user may want to use a package on an earlier version vs. another user who wants to be as up-to-date as possible. As a result, there is no real utopia that is available with a shared resource outside of having users maintain their own installs of R packages. Thus, each user must create and maintain their own library. As a result, packages must be installed BEFORE being used.

To create your own library, you will need to do the following:

# Create a directory for your R packages 
# Note: This counts against your 2 GB home dir limit on ICC
mkdir ~/Rlibs

#   Load the R modulefile 
# You may want to specify version e.g. R/3.2.2
module load R

# Set the R library environment variable (R_LIBS) to include your R package directory   
export R_LIBS=~/Rlibs

To ensure that the R_LIBS variable remains set even after logging out run the following command to permanently add it to the environment (e.g. this modifies your the .bashrc file, which is loaded on startup).

cat <<EOF >> ~/.bashrc
  if [ -n $R_LIBS ]; then
      export R_LIBS=~/Rlibs:$R_LIBS
  else
      export R_LIBS=~/Rlibs
  fi
EOF

(Above code snippets are based upon from ICC R Help Docs)

To add packages in the future to your private library, use:

# Use the install.packages function to install your R package.  
$ Rscript -e "install.packages('devtools', '~/Rlibs', 'http://ftp.ussg.iu.edu/CRAN/')"

Note: You will need to install packages prior to queuing the script.

Another feature that is nice with this approach is the ability to use devtools to install packages from external repositories (e.g. GitHub, BitBucket )

$ Rscript -e "devtools::install_github('SMAC-Group/gmwm')"

Passing arguments

Woah boy, this one is a doozie. In addition to BASE R, there are many different options on CRAN…

It seems as if the concensus is really around optparse

I’m more of a simpleton when it comes to batch arguments and just use the base R packaged commands. My normal file construction with passed args is:

sampler.R

# Expect command line args at the end. 
args = commandArgs(trailingOnly = TRUE)
# Extract and cast as numeric from character
rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

Then call the file with:

$ R CMD BATCH sampler.R 5 100

You should receive 5 observations from a normal distribution centered at 100 in sampler.R.Rout

Simulation Study

One the recent simulation studies that I have done has been to perform an estimation study using the gmwm. The details behind what exactly is happening is located over at SMAC. This section is more or less going to talk about the structure of the code and not the meaning of it as a result.

In order to submit to the Illinois Campus Cluster, one must create a PBS file to be used with qsub. Alternatively, one can just define a cluster job within a one-line statement.

In this case, the batch file here setups up various pbs qualities from the duration the script can run (walltime), the number of machines (nodes) and cores to request on each machine, the name of the job, the queue the job should run on, and the I/O option. Within the R CMD BATCH, I also opt to avoid saving the R session, attempt to make R run as quietly as possible (no startup nor any echos of previous statements), and pass the output directory results should be saved to.

#!/bin/bash
#
## Set the maximum amount of runtime to 4 Hours (queue limit) 
## Note: The simulation finishes in an hour and 22 minutes
#PBS -l walltime=04:00:00
##
## Request one node with and one core
#PBS -l nodes=1:ppn=1
#PBS -l naccesspolicy=singleuser
## Name the job, queue in the secondary queue, and merge standard output into error output
#PBS -N gmwm_est
#PBS -q secondary
#PBS -j oe
#####################################

## Grab the job id from an environment variable and create a directory for the
## data output
export JOBID=`echo "$PBS_JOBID" | cut -d"[" -f1`
mkdir $PBS_O_WORKDIR/"$JOBID"

cd $PBS_O_WORKDIR/"$JOBID"

# Load R
module load R

## Run R script in batch mode without file output
R CMD BATCH --no-save --quiet --slave $HOME/gmwm/gmwm_comm.R "$HOME/gmwm/"

The next script is gmwm_comm.R. This script is responsible for computing and saving the results.

# Attempt to install if necessary and automatically load
inst_pkgs = load_pkgs = c("gmwm")
inst_pkgs = inst_pkgs[!(inst_pkgs %in% installed.packages()[,"Package"])]
if(length(inst_pkgs)) install.packages(inst_pkgs)

pkgs_loaded = lapply(load_pkgs, require, character.only=T)

# Grab trailing command line args
args = commandArgs(trailingOnly = TRUE)

# Store output directory
output.dir = args[1]

# Control how many replications
B = 100

# Store results
results = matrix(NA, B, 4)

# Values for Simulation
freq = 400
delta.t = 1/freq

tau = 5.308*10^2
sigma_b = 1.148*10^(-6)
sigma_w = 1.064*10^(-4)

# Conversion to GMWM Model Parameters
phi = exp(-1/tau * delta.t)
sigma_ar = -sigma_b^2*tau/2 * (exp(-2*delta.t/tau) - 1)
sigma_wn = 1/delta.t * sigma_w^2

# Model statement
mod = AR1(phi = phi, sigma2 = sigma_ar) + WN(sigma2 = sigma_wn)

for(i in 1:B){
  
  # Set a seed to reproduce error componentss
  set.seed(5567 + i)
  
  # Generate Data
  d = gen.gts(mod, 17280000)
  
  # Estimate
  o = gmwm.imu(mod, d)
  
  # Store results
  results[i,] = c(i, o$estimate[1,], o$estimate[2,], o$estimate[3,])
  
}

# Save the result matrix as an R object
save(results, file=paste0(output.dir,"res_gmwm_corr.rda"))

# Export the result matrix as an R object.
write.csv(results, file=paste0(output.dir,"res_gmwm_corr.csv"), row.names = F)

Parallelized Stream coming soon (TM)