Intro

A lot of the research that I do at the University of Illinois at Urbana-Champaign (UIUC) is computationally intensive. As a result, I’m always looking into ways to speed up the computations. One of the ways I’ve found to be very helpful is to become knowledgeable about High Performance Computing (HPC). In a nut shell, HPC techniques enable the use of a typical desktop computer to solve computationally intense problems through the use of parallelization, compiled code, gpu programming, and memory management for large data sets. These techniques can be scaled up via cloud computing or using your friendly supercomputer. There are [several packages that R offers] (http://cran.r-project.org/web/views/HighPerformanceComputing.html) that greatly help in this endeavor.

In particular, I’ve grown very fond of the RcppArmadillo package produced by Romain Francois, Dirk Eddelbuettel and Doug Bates. The package makes available within the R space the templated C++ linear algebra library known as Armadillo by Conrad Sanderson. There are two reasons why I like writing within c: 1. the speed gains are crazy and, more importantly, 2. easy ports of the statistical method(s) to any framework that has a C++ interface (e.g. Mathematica, MATLAB, et cetera). This drastically reduces the amount of recoding necessary to make the method functional on a different framework.

With this being said, there is a problem that arises when writing on one particular framework while trying to make a method framework independent. The problem of interest relates to that of R’s base functions being very useful, non-exportable, and hard to read thanks to R’s unique SEXP structures. Thus, to ensure a clean port of the code, we cannot call an R function from C++ with Rcpp.

It is important to note that by calling an R function from within C++ we should receive a worse performance. That is, if you were to benchmark calling the function directly in R vs. C++, the R benchmark would easily win. This is because the data has to be protected when pipped into R and then again when it is sent back to C++. Furthermore, we have to wait for R to finish the calculations before the script can proceed.

Example Problem - Differencing Time Series

The following example is simplistic. However, it captures the need to be able to use R’s base functions completely within C++.

Problem: We want to take the mean of the first difference in a time series.

In R, we would need to use the diff and mean function.:

#' @title Taking the mean of a Differenced Time Series
#' @description Use R's diff function to find the nth difference of a time series of ith lag.
#' @usage diff_mean_r(x, lag = 1, differences = 1)
#' @param x A \code{vec} that is the time series
#' @param lag A \code{unsigned int} that indicates the lag
#' @param differences A \code{dif} that indicates how many differences should be taken
#' @return mean of the differenced time series.
#' @author JJB
#' @example 
#' x=rnorm(10000, 0, 1)
#' diff_mean_r(x,1,1)
diff_mean_r = function(x, lag = 1, difference = 1){
  differenced_vec = diff(x, lag = lag, difference = difference)
  
  mean(differenced_vec)
}

Writing similar functionality, while using R’s base difference function, from within C++ we would need to do:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

//' @title Accessing R's Diff function from Rcpp
// [[Rcpp::export]]
double diff_mean_cpp(const arma::vec& x, unsigned int lag = 1, unsigned int differences = 1){
   
   // Obtain environment containing function
   Rcpp::Environment base("package:base"); 
   
   // Make function callable from C++
   Rcpp::Function diff = base["diff"];    

   // Call the function and receive its list outputu
   NumericVector diff_out = diff(_["x"] = x,
                                 _["lag"]  = lag,
                                 _["differences"] = differences);
   // Convert output to arma
   arma::vec diffed_vec = as<arma::vec>(diff_out);
   
   // Find mean of differenced vector
   double mu = arma::mean(diffed_vec);
      
   // Return mean difference
   return mu;
}

To make sure all is well with the implementations, always check to see if the objects are the same:

x = rnorm(10000, 0, 1)
all.equal(diff_mean_r(x,1,1), diff_mean_cpp(x,1,1), check.attributes=FALSE)
## [1] TRUE

The check.attributes = false is meant to avoid having to worry about the objects having the same naming structure. The comparison that needs to occur is whether the numerical output is the same.

Since, the function returned true, onto performing a a microbenchmark. The microbenchmark results should come as no surprised due to what was indicated in the first paragraph:

out = microbenchmark(rcpp_diff = diff_mean_cpp(x,1,1), r_diff = diff_mean_r(x,1,1))
expr min lq mean median uq max neval
rcpp_diff 253.664 272.1265 368.2545 278.2805 303.046 1639.962 100
r_diff 126.682 134.7870 210.2623 138.0900 146.495 1214.887 100

The times listed are in microseconds.

Specifically, this approach is slow, when compared to writing it completely contained within R, as well as being dependent on R’s diff function.

Note: I could have decreased the benchmark by using only Rcpp objects and, thus, avoiding the copy into the Armadillo vector. However, the rest of code is written within Armadillo so a copy would be imminent.

Implementing diff() in Armadillo

So, to speed up the function, we will need to implement it within Armadillo. To do so, it’s helpful to view the R function’s source.

To obtain the function’s source type the functions name

diff
## function (x, ...) 
## UseMethod("diff")
## <bytecode: 0x000000000845c148>
## <environment: namespace:base>

The results indicate a generic S3 implementation.

So, to get at the crux of the function, we will need to access the default diff function using:

diff.default
## function (x, lag = 1L, differences = 1L, ...) 
## {
##     ismat <- is.matrix(x)
##     xlen <- if (ismat) 
##         dim(x)[1L]
##     else length(x)
##     if (length(lag) != 1L || length(differences) > 1L || lag < 
##         1L || differences < 1L) 
##         stop("'lag' and 'differences' must be integers >= 1")
##     if (lag * differences >= xlen) 
##         return(x[0L])
##     r <- unclass(x)
##     i1 <- -seq_len(lag)
##     if (ismat) 
##         for (i in seq_len(differences)) r <- r[i1, , drop = FALSE] - 
##             r[-nrow(r):-(nrow(r) - lag + 1L), , drop = FALSE]
##     else for (i in seq_len(differences)) r <- r[i1] - r[-length(r):-(length(r) - 
##         lag + 1L)]
##     class(r) <- oldClass(x)
##     r
## }
## <bytecode: 0x000000000b791f10>
## <environment: namespace:base>

With the R code implementation in hand, we can design an Armadillo function that is able to operate in a similar manner. Note, I’m a fan of the just say no to exceptions school of thought found in Google’s C++ style guide. So, the function will not contain exception handling outside of the parameter errors and row access errors.

When I write arma, I am writing in shorthand armadillo for the C++ library. Do not construe arma as being the ARMA(p,q) time series model.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

//' @title Lagged Differences in Armadillo
// [[Rcpp::export]]
arma::vec diff_arma(arma::vec x, unsigned int lag = 1, unsigned int differences = 1){
  // Length of Time Series
  unsigned int n;
  
  // Difference the series i times
  for(unsigned int i=0; i < differences; i++){
    // Each difference will shorten series length, so update length.
    n = x.n_elem;
    // Take the difference based on lags
    x = (x.rows(lag,n-1) - x.rows(0,n-lag-1));
  }
  
  // Return differenced series:
  return x;
}

//' @title Mean of the Lagged Differences in Armadillo
// [[Rcpp::export]]
double diff_mean_arma(arma::vec x, unsigned int lag = 1, unsigned int differences = 1){  
   
   // Retrieve differenced vector
   arma::vec diffed_vec = diff_arma(x,lag,differences);

   // Find mean of differenced vector
   double mu = arma::mean(diffed_vec);
      
   // Return mean difference
   return mu;
}

Be advised, this implementation is meant only for vectors. It will not handle a matrix of vectors being sent to it. The function can be modified to support this via overloading. Though, Rcpp will only allow one unique function to exported into R.

Now, let’s test the implementation to see if everything works:

# Test diff() functions
all.equal(as.matrix(diff(x,1,1)), diff_arma(x,1,1), check.attributes=FALSE)
## [1] TRUE
# Test mean(diff()) functions
all.equal(diff_mean_r(x,1,1), diff_mean_arma(x,1,1), check.attributes=FALSE)
## [1] TRUE

Note: The as.matrix() call is due to R’s implementation of the diff function returning a numerical vector and not a matrix. The Armadillo function returns vec, which is interpreted as a one column matrix within R.

With both implementations matching up, let’s see which one is faster! To do so, we revert back to the trusty microbenchmark and receive:

out = microbenchmark(arma_fun = diff_mean_arma(x,1,1), r_fun = diff_mean_r(x,1,1), rcpp_fun = diff_mean_cpp(x,1,1))
expr min lq mean median uq max neval
arma_fun 26.117 27.318 37.54248 28.218 29.869 751.087 100
r_fun 127.883 134.187 212.81091 138.390 151.148 1012.856 100
rcpp_fun 250.663 265.972 356.10870 274.228 293.590 1430.426 100

The times listed are in microseconds.

For relativity we use the rbenchmark library:

out = benchmark(arma_fun = diff_mean_arma(x,1,1), r_fun = diff_mean_r(x,1,1), rcpp_fun = diff_mean_cpp(x,1,1))
test replications elapsed relative user.self sys.self user.child sys.child
arma_fun 100 0.00 NA 0.00 0 NA NA
r_fun 100 0.03 NA 0.03 0 NA NA
rcpp_fun 100 0.03 NA 0.04 0 NA NA

The benchmark reveals about 5-6x gain when compared to the native implementation in R. Further improvements to the structure can yield even lower times!

R to Armadillo

Whew, that was a lot of information.

To summarize:

  1. R is slow.
  2. Using R base functions via Rcpp in C++ code is a very bad idea. (e.g. slow)
  3. R base functions are amazing but need to be exported.
  4. Implementations of the R base functions can be made faster by writing them with Armadillo.
  5. Writing methods using only Armadillo allows statistical methods to be ported from one framework to another.

Based on these reasons, I’m opting to start a code repository on github that encourages the translation of R to Armadillo code. The repository can be found at https://github.com/coatless/r-to-armadillo. Feel free to use the functions that are there and feel free to submit translations or new code via pull request.