Intro
One of the downsides of having multiple programming languages is that each have their own defined niches in both academia and industry worlds that results in data not flowing as easily as spice between them. For example, engineers have an affinity for Python while Statisticans are in love R. Thus, to process data in one language and then use an algorithm in another language is a headache in itself. Recently, support has begun to emerge for a standard data format for data.frames between Python and R via the Feather initiative, a joint work between Wes McKinney of pandas fame and Hadley Wickham of the majority of user-oriented R developments in the last half decade (ggplot2, dplyr, tidyr, rvest, and so on…), using an underlying columnar memory specification provided by Apache Arrow. Unfortunately, this does not target NumPy arrays, which is where a lot of the data seems to be contained in some engineering applications. To that end, Dirk Eddelbuettel of Rcpp fame wrote a nice package called RcppCNPy that enables the loading and writing of 1D to 2D NumPy arrays within R. e.g.
numpy_r_ex.R
# install.packages("RcppCNPy")
library("RcppCNPy")
# Set seed for reproducibility
set.seed(1337)
# Generate data in R
vec = rnorm(100)
mat = matrix(vec, nrow = 25, ncol = 4)
# Rewrite to file
npySave("vec.npy", vec)
npySave("mat.npy", mat)
# Load
vec2 = npyLoad("vec.npy")
mat2 = npyLoad("matrix.npy")
# Check equality
all.equal(vec,vec2)
all.equal(mat,mat2)
However, when 3-D arrays are used, the common error is:
“Unsupported dimension in npyLoad”
The fault for this is primarily on the Rcpp
data types that are unable to scale above $N$-D array greater than or equal to 4. However, there is no object export inplace for an Rcpp object with 3 dimensions. Hence, there is no $N$-D arrays greater than 2 that can be loaded into R or written to a .npy
binary using RcppCNPy.
Thus, saves of $N$-D arrays greater than 2 using numpy.save
seemed to only be in existence in the Python environment. Until now…
Generating NumPy data to use
Before we can begin transferring data into R, we must first have some data in NumPy binary form (.npy
) using numpy.save
. In this case, I’ve opted to generate a 4D array with dimensions of $3 \times 4 \times 5 \times 2$ that contains values between $[0,1)$ via numpy.random.random
gen_numpy.py
import numpy as np
# Generate a 4D array (3x4x5x2) of 1s
a = np.random.random((3,4,5,2))
b = np.random.random((3,4,5,2))
# Save
np.save('a_patches_z1.npy', a)
np.save('b_patches_z1.npy', b)
Convert the Data to an R readable object
With this data in hand, let’s view the NumPy 2 R Object (n2r.py
) Script. The script itself has two sections. The first section enables the user to feed in parameters via the command line. The second section deals with using rpy2 package within Python to convert NumPy arrays to R objects.
Command Line Interface to the Script
The command line options are defined as follows:
n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>
With actual values we have:
n2r.py -i /Users/James/Desktop/lidar -f _patches_ -e rout
Note: The export directory is placed within the input directory and, thus, we have R objects in /Users/James/Desktop/lidar/rout
.
The NumPy binary to R object script n2r.py
The first order of business is to have the function set default parameter values. The second order is to then process all files within the directory that match a specific sequence (e.g. _patches_
). The third order is export these objects under .gzip
extension so that R is able to read them via load()
. The reason for using .gzip
instead of .rda
is mainly when we tried to export using .rda
there were a lot of unexpected complications that caused the writting of the file to be prolonged and then fail.
With that being said, here’s the script:
n2r.py
import os, sys, getopt
import numpy as np
import re
from rpy2.robjects import r
from rpy2.robjects.numpy2ri import numpy2ri
"""
Conversion function for .npy files
@author : Avinash Balakrishnan
Commandline argumentation
@author: JJB
"""
def main(argv):
# Declare some default values
dirname = '/Users/James/Desktop/lidar_data'
fname = '_patches_'
expdir = 'R_data'
# Try to parse the arguments
try:
opts, args = getopt.getopt(argv,"hi:f:e:",["indir=","fname=","expdir="])
except getopt.GetoptError as err:
print str(err)
print 'n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>'
sys.exit(2)
# Set the correct values
for opt, arg in opts:
if opt == '-h':
print 'n2r.py -i <inputdirectory> -f <matchfname> -e <exportdirectory>'
sys.exit()
elif opt in ("-i", "--indir"):
dirname = arg
elif opt in ("-f", "--fname"):
fname = arg
elif opt in ("-e", "--expdir"):
export_dir = arg
# Call function
convert_numpy(dirname, fname, export_dir)
def convert_numpy(path_to_data, fname, export_dir):
"""Convert NumPy N-D array to R object
Keyword arguments:
path_to_data -- full dir path to data
fname -- partial file name to match
export_dir -- Name of export dir added to data dir
"""
# Create a directory path
if not os.path.exists("%s/%s" % (path_to_data,export_dir)):
os.makedirs("%s/%s" % (path_to_data,export_dir))
# Get list of files in the directory
files = os.listdir(path_to_data)
# Sort out which files are of each type
numpy_files = sorted([f for f in files if fname in f])
# Begin process conversion
for numpy_fname in numpy_files:
# Load in 4D Numpy Array
d = np.load("%s/%s" % (path_to_data, numpy_fname))
# Remove the file extension of .npy binary
file_name = re.sub('\.npy$', '', numpy_fname)
# Convert the numpy object to R
ro = numpy2ri(patches)
# Assign the name
r.assign("%s" % file_name,ro)
# Export to .gzip readable by R's load()
r("save(%s, file='%s/%s/%s.gzip', compress=TRUE)" % (file_name,path_to_data,export_dir,file_name))
if __name__ == "__main__":
main(sys.argv[1:])
After running this script, we now have objects R would recognize via the load()
command.
load("a_patches_b1.gzip")
Convert from an R object to another R object for better storage
Now, I’m not necessarily a huge fan of the .gzip
extension. I would prefer if the file was identified as an R object just by extension type. So, I’ve added another function to be run within R that changes the format once more from .gzip
to R’s .rda
format. Again, this is mainly because using the .rda
extension within the rpy2
does not work.
gzip_to_rda.R
#' Change file format from GZIP to RDA
#'
#' Modifies file format from .gzip to R's binary format .rda
#' @param indir A \code{string} with the location of the data directory
#' @param fname A \code{string} that contains commonalities between files.
#' @param outdir A \code{string} representing the out directory to save to.
#' @examples
#' gzip_to_rda("/Users/James/Desktop/lidar/rout", "_patches_","/Users/James/Desktop/lidar/rda")
gzip_to_rda = function(indir, fname, outdir){
# Grab a list of files within the directory
m = list.files(path = indir, pattern=fname)
# Make output dir
if(!dir.exists(outdir)) {
if(outdir != "."){
dir.create(outdir, recursive = T)
}
}
# Load in each file
for(i in 1:length(m)){
# Obtain file names
f = tools::file_path_sans_ext(m[i])
# Create an absolute link to file and load
load(file.path(indir,m[i]))
# Save file
save(list=f, file = file.path(outdir,paste0(f,".rda")))
# Remove data.frame from memory
rm(list = c((m[i]))
}
}
Credit
This post has a code contribution by Avinash Balakrishnan, who is a Masters student in the Department of Statistics and a Graduate Research Assistant (GRA) that is working with me during the Spring 2016 at the University of Illinois at Urbana-Champaign.