R to Armadillo using RcppArmadillo for Speed and Portability

Intro

A lot of the research that I do at the University of Illinois at Urbana-Champaign (UIUC) is computationally intensive. As a result, I'm always looking into ways to speed up the computations. One of the ways I've found to be very helpful is to become knowledgeable about High Performance Computing (HPC). In a nut shell, HPC techniques enable the use of a typical desktop computer to solve computationally intense problems through the use of parallelization, compiled code, gpu programming, and memory management for large data sets. These techniques can be scaled up via cloud computing or using your friendly supercomputer. There are several packages that R offers that greatly help in this endeavor.

In particular, I've grown very fond of the RcppArmadillo package produced by Romain Francois, Dirk Eddelbuettel and Doug Bates. The package makes available within the R space the templated C++ linear algebra library known as Armadillo by Conrad Sanderson. There are two reasons why I like writing within c: 1. the speed gains are crazy and, more importantly, 2. easy ports of the statistical method(s) to any framework that has a C++ interface (e.g. Mathematica, MATLAB, et cetera). This drastically reduces the amount of recoding necessary to make the method functional on a different framework.

With this being said, there is a problem that arises when writing on one particular framework while trying to make a method framework independent. The problem of interest relates to that of R's base functions being very useful, non-exportable, and hard to read thanks to R's unique SEXP structures. Thus, to ensure a clean port of the code, we cannot call an R function from C++ with Rcpp.

It is important to note that by calling an R function from within C++ we should receive worse performance. That is, if you were to benchmark calling the function directly in R vs. C++, the R benchmark would easily win. This is because the data has to be protected when pipped into R and then again when it is sent back to C++. Furthermore, we have to wait for R to finish the calculations before the script can proceed.

Example Problem – Differencing Time Series

The following example is simplistic. However, it captures the need to be able to use R's base functions completely within C++.

Problem: We want to take the mean of the first difference in a time series.

In R, we would need to use the diff and mean function.:

#' @title Taking the mean of a Differenced Time Series
#' @description Use R's diff function to find the nth difference of a time series of ith lag.
#' @usage diff_mean_r(x, lag = 1, differences = 1)
#' @param x A \code{vec} that is the time series
#' @param lag A \code{unsigned int} that indicates the lag
#' @param differences A \code{dif} that indicates how many differences should be taken
#' @return mean of the differenced time series.
#' @author JJB
#' @example 
#' x=rnorm(10000, 0, 1)
#' diff_mean_r(x,1,1)
diff_mean_r = function(x, lag = 1, difference = 1){
  differenced_vec = diff(x, lag = lag, difference = difference)

  mean(differenced_vec)
}

Writing similar functionality, while using R's base difference function, from within C++ we would need to do:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

//' @title Accessing R's Diff function from Rcpp
// [[Rcpp::export]]
double diff_mean_cpp(const arma::vec& x, unsigned int lag = 1, unsigned int differences = 1){

   // Obtain environment containing function
   Rcpp::Environment base("package:base"); 

   // Make function callable from C++
   Rcpp::Function diff = base["diff"];    

   // Call the function and receive its list outputu
   NumericVector diff_out = diff(_["x"] = x,
                                 _["lag"]  = lag,
                                 _["differences"] = differences);
   // Convert output to arma
   arma::vec diffed_vec = as<arma::vec>(diff_out);

   // Find mean of differenced vector
   double mu = arma::mean(diffed_vec);

   // Return mean difference
   return mu;
}

To make sure all is well with the implementations, always check to see if the objects are the same:

x = rnorm(10000, 0, 1)
all.equal(diff_mean_r(x,1,1), diff_mean_cpp(x,1,1), check.attributes=FALSE)
## [1] TRUE

The check.attributes = false is meant to avoid having to worry about the objects having the same naming structure. The comparison that needs to occur is whether the numerical output is the same.

Since, the function returned true, onto performing a a microbenchmark. The microbenchmark results should come as no surprised due to what was indicated in the first paragraph:

out = microbenchmark(rcpp_diff = diff_mean_cpp(x,1,1), r_diff = diff_mean_r(x,1,1))
expr min lq mean median uq max neval
rcpp_diff 648.719 668.5320 718.8894 679.4885 720.015 1099.910 100
r_diff 491.118 510.3295 560.9274 524.7390 554.008 954.917 100

The times listed are in microseconds.

Specifically, this approach is slow, when compared to writing it completely contained within R, as well as being dependent on R's diff function.

Note: I could have decreased the benchmark by using only Rcpp objects and, thus, avoiding the copy into the Armadillo vector. However, the rest of code is written within Armadillo so a copy would be imminent.

Implementing diff() in Armadillo

So, to speed up the function, we will need to implement it within Armadillo. To do so, it's helpful to view the R function's source.

To obtain the function's source type the functions name

diff
## function (x, ...) 
## UseMethod("diff")
## <bytecode: 0x00000000091c7778>
## <environment: namespace:base>

The results indicate a generic S3 implementation.

So, to get at the crux of the function, we will need to access the default diff function using:

diff.default
## function (x, lag = 1L, differences = 1L, ...) 
## {
##     ismat <- is.matrix(x)
##     xlen <- if (ismat) 
##         dim(x)[1L]
##     else length(x)
##     if (length(lag) != 1L || length(differences) > 1L || lag < 
##         1L || differences < 1L) 
##         stop("'lag' and 'differences' must be integers >= 1")
##     if (lag * differences >= xlen) 
##         return(x[0L])
##     r <- unclass(x)
##     i1 <- -seq_len(lag)
##     if (ismat) 
##         for (i in seq_len(differences)) r <- r[i1, , drop = FALSE] - 
##             r[-nrow(r):-(nrow(r) - lag + 1L), , drop = FALSE]
##     else for (i in seq_len(differences)) r <- r[i1] - r[-length(r):-(length(r) - 
##         lag + 1L)]
##     class(r) <- oldClass(x)
##     r
## }
## <bytecode: 0x00000000075b9ae8>
## <environment: namespace:base>

With the R code implementation in hand, we can design an Armadillo function that is able to operate in a similar manner. Note, I'm a fan of the just say no to exceptions school of thought found in Google's C++ style guide. So, the function will not contain exception handling outside of the parameter errors and row access errors.

When I write arma, I am writing in shorthand armadillo for the C++ library. Do not construe arma as being the ARMA(p,q) time series model.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

//' @title Lagged Differences in Armadillo
// [[Rcpp::export]]
arma::vec diff_arma(arma::vec x, unsigned int lag = 1, unsigned int differences = 1){
  // Length of Time Series
  unsigned int n;

  // Difference the series i times
  for(unsigned int i=0; i < differences; i++){
    // Each difference will shorten series length, so update length.
    n = x.n_elem;
    // Take the difference based on lags
    x = (x.rows(lag,n-1) - x.rows(0,n-lag-1));
  }

  // Return differenced series:
  return x;
}

//' @title Mean of the Lagged Differences in Armadillo
// [[Rcpp::export]]
double diff_mean_arma(arma::vec x, unsigned int lag = 1, unsigned int differences = 1){  

   // Retrieve differenced vector
   arma::vec diffed_vec = diff_arma(x,lag,differences);

   // Find mean of differenced vector
   double mu = arma::mean(diffed_vec);

   // Return mean difference
   return mu;
}

Be advised, this implementation is meant only for vectors. It will not handle a matrix of vectors being sent to it. The function can be modified to support this via overloading. Though, Rcpp will only allow one unique function to exported into R.

Now, let's test the implementation to see if everything works:

# Test diff() functions
all.equal(as.matrix(diff(x,1,1)), diff_arma(x,1,1), check.attributes=FALSE)
## [1] TRUE
# Test mean(diff()) functions
all.equal(diff_mean_r(x,1,1), diff_mean_arma(x,1,1), check.attributes=FALSE)
## [1] TRUE

Note: The as.matrix() call is due to R's implementation of the diff function returning a numerical vector and not a matrix. The Armadillo function returns vec, which is interpreted as a one column matrix within R.

With both implementations matching up, let's see which one is faster! To do so, we revert back to the trusty microbenchmark and receive:

out = microbenchmark(arma_fun = diff_mean_arma(x,1,1), r_fun = diff_mean_r(x,1,1), rcpp_fun = diff_mean_cpp(x,1,1))
expr min lq mean median uq max neval
arma_fun 26.417 27.9180 30.54189 28.8190 30.620 62.741 100
r_fun 454.794 468.0020 481.41501 474.0065 491.718 706.356 100
rcpp_fun 582.676 600.3875 625.85025 616.7480 638.062 783.507 100

The times listed are in microseconds.

For relativity we use the rbenchmark library:

out = benchmark(arma_fun = diff_mean_arma(x,1,1), r_fun = diff_mean_r(x,1,1), rcpp_fun = diff_mean_cpp(x,1,1))
test replications elapsed relative user.self sys.self user.child sys.child
arma_fun 100 0.00 NA 0.00 0 NA NA
r_fun 100 0.05 NA 0.05 0 NA NA
rcpp_fun 100 0.07 NA 0.07 0 NA NA

The benchmark reveals about 5-6x gain when compared to the native implementation in R. Further improvements to the structure can yield even lower times!

R to Armadillo

Whew, that was a lot of information.

To summarize:

  1. R is slow.
  2. Using R base functions via Rcpp in C++ code is a very bad idea. (e.g. slow)
  3. R base functions are amazing but need to be exported.
  4. Implementations of the R base functions can be made faster by writing them with Armadillo.
  5. Writing methods using only Armadillo allows statistical methods to be ported from one framework to another.

Based on these reasons, I'm opting to start a code repository on github that encourages the translation of R to Armadillo code. The repository can be found at https://github.com/coatless/r-to-armadillo. Feel free to use the functions that are there and feel free to submit translations or new code via pull request.

STATS@UIUC Big Data Image for VirtualBox

Intro

This guide is meant to provide helpful information on working with the STATS@UIUC Big Data Image. As a result, some of the material covered here will not be available on a different image or in a production environment. However, the concepts will most certainly be relevant.

Background Information on Image Software

For starters, the STATS@UIUC Big Data Image is a modification of the Hortonwork's Data Platform v2.2 VirtualBox image. The modifications that have been made to the image are as follows:

  1. The latest verison of the R programming language (v3.1.2) has been installed
  2. RStudio Server has been installed to allow students the ability to use the department standard of RStudio IDE.
  3. Parts of RevolutionAnalytics' RHadoop ecosystem such as rmr2, rhdfs, plyrmr, and memoise have been installed.
  4. Many key High Performance Computing (HPC) packages for R have been installed such as: Rcpp, bigmemory, ff, ffbase, foreach, iterators, doMC, doSNOW, and itertools.
  5. Python v2.7.9, pip v6.0.8, and easy_install v12.1 have been installed to be the default python, pip, and easy_install software for CentOS 6.6. The image preserves yum's dependency on python v2.6.6
  6. Several environmental variables that provide ease of use for certain features within the image.
  7. Creation of a non-root account (rstudio) that has SSH and sudo access.

Table of Contents

Due to the size of some of the image files, the guide has been split up into the following sections:

  1. Installing and Using the STATS@UIUC Big Data Image
  2. Working within shell and SSHing into the STATS@UIUC Big Data Image
  3. Web Interfaces and Saving within the STATS@UIUC Big Data Image
  4. Environment Variables, Compiling a MapReduce job via Java, Known & Resolved Issue within the STATS@UIUC Big Data Image

Questions? E-mail me

Environment Variables, Compiling a MapReduce job via Java, Known & Resolved Issue within the STATS@UIUC Big Data Image

Intro

In the previous entry, the web components and how to save the STATS@UIUC Big Data Image state were discussed. For this post, all the remaining information such as the environmental variables available, compilation instructions for hadoop jobs, and known issues with the image exist.

Environmental Variables

The image is modified so that there are variables that contain important path information to various hadoop component locations. These variables are not created by a hadoop installation.

Note: The results are based on the file configuration of Hortonworks Data Platform 2.2 Image.

  1. $HADOOP_CMD
    • Points to where hadoop is installed.
    • Result: /usr/bin/hadoop
  2. $HADOOP_STREAMING
    • Points to the streaming jar location.
    • Result: /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar
  3. $HADOOP_CONF
    • Points to the configuration file for hadoop.
    • Result: /etc/hadoop/conf
  4. $HADOOP_EXAMPLES
    • Points to the hadoop-examples.jar location
    • Result: /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-examples.jar
  5. $RLIB_HOME
    • Points to the system wide library for R.
    • Result: $(R RHOME)
  6. $JAVAC_HADOOP_PATH
    • Points to all hadoop libraries necessary for standalone compilation.
    • Result: $(hadoop classpath)
  7. $JAVA_TOOLS
    • Points to the java compilation tools jar.
    • Result: $JAVA_HOME/lib/tools.jar
  8. $UIUC_IMAGE_VERSION
    • Provides a version string of the present image. Helpful if we need to patch the image.
    • Result: STAT490 Image Version: 1.1

If you modify one of the above variables, the value will be reset back to the default value after logging back in.

One other variable of interest that is:

  • $JAVA_HOME
    • Points to the java jdk

These variables will come in handy if you are trying to compile java code for mapreduce, using RHadoop, or exploring the image.

Compiling a MapReduce job via Java

Below, we present generalized code for each option.

For an actual application of the compile options, please see the following bash script, java_example.sh, which shows both options using the hadoop example wordcount.

Shared Setup code

Both options rely on this underlying code

# Create where the input data should go
hdfs dfs -mkdir -p /user/rstudio/example/input

# Place data on hdfs
hdfs dfs -put ~/workspace/example/input /user/rstudio/example/input

# Note: We do not create an output directory. Hadoop does that for us!

# Create a directory to put the compiled java files in.
mkdir class_files

Option 1: Standalone compile

  1. Set javac path to include hadoop libraries
# $JAVAC_HADOOP_PATH is provided as an environmental variable
export HADOOP_CLASSPATH=$JAVAC_HADOOP_PATH
  1. Compile java file using javac
javac -classpath ${HADOOP_CLASSPATH} -d class_files/ JavaProgramName.java
# -classpath provides the libraries required by Hadoop
# -d directs compiled results to a folder

Option 2: Using Hadoop

  1. Set the classpath to include Java tools.jar
# $JAVA_TOOLS is provided as an environmental variable
export HADOOP_CLASSPATH=$JAVA_TOOLS
  1. Compile the Java file using hadoop
hadoop com.sun.tools.javac.Main JavaProgramName.java -d class_files/
# com.sun.tools.javac.Main access the main method
# -d specify output of class files

Shared job code

  1. Create a jar file, e.g. java specific zip file, by using the .class files generated from compiling the java file
jar -cf jar_archive_name.jar -C class_files/ .
# -c create a new archive
# -f specify archive name
# -C get the files from given directory
# . specifies all the files from the directory
  1. Run the job
hadoop jar jar_archive_name.jar JavaProgramName /user/rstudio/example/input /user/rstudio/example/output
  1. Display the results
hdfs dfs -cat /user/rstudio/example/output/part-r-00000 
# Hadoop normally outputs the reduce job results in part-r-#####

Known Issues

R Studio Server Respawn issue

The R Studio Server has been known to have respawn issues if the appropriate shutdown sequence given above is not followed.

The typical error is:

init: rstudio-server respawning too fast, stopped

If this error happens, you will be unable to access RStudio within your browser.
rstudio not available

To fix this, we need to manually start rstudio-server by running the following command in the HDP shell:

sudo rstudio-server start

RStudio session hadabend error

Another aspect of the appropriate shutdown sequence not being used is a hadabend error.

rstudio hadabend error

As a result of the improper shutdown sequence being used, the R session data, such as variables and user functions, were lost.

To resolve the error, just press OK and make sure you use the proper shutdown sequence.

Connection Refused Error

During the start up process, you may see Zookeeper indicate that a connection has been refused.

Call from sandbox.hortonworks.com/10.0.2.15 to sandbox.hortonworks.com:8020 failed on connection exception: java.net.ConnectionException: Connection refused: For more details see: http://wiki.apache.org/hadoop/Connectionrefused

If your hadoop jobs are failing, we suggest you restart the image under the condition that only Chrome and Virtual Box are open.
In particular, make sure there are no messenging applications such as Skype or Lync open as these applications have been known to interfere with ports that are needed.

Otherwise, there is no cause for concern.

More?

Have you experienced an issue with the image that is not listed above? Or did I make a grammatical / spelling error?

If so, please let me know.

Resolved issues

Please see the STAT 490 GitHub repository for update notes.

Web Interfaces and Saving within the STATS@UIUC Big Data Image

Intro

In the previous post, the ability to interact within the STATS@UIUC Big Data Image was discussed. Within this post, we detail different websites that are available with different ports and how to create snapshots or saves of the virtual image state.

In the Browser

Within this section, we will detail various web components of the image. These web components are going to be accessed by using different ports on the localhost domain from your preferred web browser. You can think of each port as a different URL. In application, you may have different subdomains pointing to these locations.

Accessing RStudio

On the modified UIUC HDP image, we have placed RStudio Server. This allows you to use R via your browser using the traditional desktop RStudio interface.

To access RStudio, make sure the image is running in Virtual Box.

Open a web browser and enter in the url field:

localhost:8787

To log into RStudio, use the following account information:

Username: rstudio

Password: rstudio

rstudio login page

After logging in, you should be presented with:

rstudio login page

Accessing Hue

Later in the course, we will be focusing on using Pig and Hive. Hortonworks provides a great web interface that simplifies this greatly.

To access the web interface use:

localhost:8000

Unlike RStudio, there is no login required. Upon entering the address, you should see: