Porting R's table function to C++

Intro

Have you ever wanted to see if you can write the part of the table() function’s frequency summation for one vector in C++?! Well, that’s exactly what I did by accident. As some of the best results come from pure accidents, I felt it only right to add a post describing how to obtain such a result.

The Plan

First of all, we need to find the unique values for the vector of numbers. To do this, we opt to store the number being counted as a key within a std::map and increment the value each time we observe that number.

Next, we need to export the data from the std::map to a std::vector<std::pair<double,int> > so that we can sort by the number of occurrences.

With a sorted vector in hand by std::pair<double,int>, we need to then write a list structure to export into R. This is the case for because Rcpp does not support std::set<T>.

The Implementation

#include <Rcpp.h>

// Save on the typing... 
typedef std::pair<double, int>  ptype;


// A comparison function to rank values in descending order
bool compare_values(const ptype &p1, const ptype &p2)
{
  return p1.second > p2.second;
}

// Get the top number of observations
// [[Rcpp::export]]
Rcpp::List table_cpp(const Rcpp::NumericVector & v, bool sort_data = false)
{

  // Create a map
  std::map<double, int> Elt;
  
  Elt.clear();
  
  // Fill the map with occurrences per number.
  for (int i = 0; i != v.size(); ++i) {
  Elt[ v[i] ] += 1;
  }
  
  // Get how many unique elements exist... 
  unsigned int n_obs = Elt.size();
  
  // Switch map to a vector so that we can sort by value
  std::vector< ptype > sorted_Elt(Elt.begin(), Elt.end());
  
  if(sort_data){
    // Perform the sort with a custom sort function.
    std::sort(sorted_Elt.begin(), sorted_Elt.end(), compare_values);
  }
  // Else, return. 
  
  // Stop here if you do not need to import into R.
  // Why? There is no ability to export a set w/ a pair into R. *cries*
  
  // Recast for R using Rcpp::*Vectors to avoid another copy)      
  Rcpp::NumericVector result_keys(n_obs);
  Rcpp::IntegerVector result_vals(n_obs);
  
  unsigned int count = 0;
  
  // Need to use iterators to access objects
  for( std::vector< ptype >::iterator it = sorted_Elt.begin(); it != sorted_Elt.end(); ++it )
  {
    // Move them into split vectors
    result_keys(count) = it->first;
    result_vals(count) = it->second;
    
    count++;
  }
  
  return Rcpp::List::create(Rcpp::Named("lengths") = result_vals,
  Rcpp::Named("values") = result_keys);
}

Short Test

Let’s verify that it works by running over some data:

# Set seed for reproducibility
set.seed(223)
(a = sample(seq(0, 1, by = .2), 10, replace = T))
##  [1] 0.0 0.0 0.2 1.0 0.4 1.0 0.8 0.6 0.2 0.2

Generates:

     [1] 0.8 0.2 0.0 0.6 0.4 1.0 0.6 0.2 0.8 0.8

And now we seek to obtain the occurrence information:

# Call our function
table_cpp(a)
## $lengths
## [1] 2 3 1 1 1 2
## 
## $values
## [1] 0.0 0.2 0.4 0.6 0.8 1.0

Gives us:

$lengths
[1] 3 2 2 1 1 1

$values
[1] 0.8 0.2 0.6 0.0 0.4 1.0

Let’s check by looking at table()’s output very quickly.

table(a)
  0 0.2 0.4 0.6 0.8   1 
  1   2   1   2   3   1 

And now, we go to the winchester to grab a pint and wait for this to …

comments powered by Disqus