Saving and Reading R Objects

Intro

The default analyst behavior is to export results as a CSV file and share them with their colleagues. Now, CSV files are preferred in this instance because they show “diffs” between versions and require minimal software like a text editor to open and edit them. However, CSV files are problematic when it comes to sparse matrices as one of my fellow PhD students discovered recently (generating ~ >500 Gb of data). For a refresher on sparse matrices, see the prior post on benefits to using sparse matrices. In cases like this, it would be better to save the object itself through one of R’s binary file formats. This logic expands to large data sets and simulation results as well.

The goal of this post is to highlight the different binary file formats offered by R, version compatibility, and compression differences.

Formats

First, we begin with an overview of the different kinds of R binary files that are available.

  • .rda/.RData is “R Data”
    • Description: Save and restore one or more named objects into an environment.
    • Notes: Useful for storing workspaces and multiple R objects as-is. As an example, see the save.image() function called upon closing every R session.
  • .rds is a “R Data Single”
    • Description: Save and load a single R object to a binary file.
    • Notes: Great for exporting a single result and loading it into a new variable.
  • .rdx and .rdb
    • Description: .rdx contains the index while .rdb stores objects for an R Database used in Lazy Loading
    • Notes: Primarily for R’s internal usage. Though, benefits exist around delayed assignment by the use of promises for large data.

Creation of the binary files and the ability to read them in are given next.

.rda/.RData

# Define values
fruit = "apple"
toad = "ribbit"

# Save R objects
save(fruit, toad, file = "all_objects.rda")

# Remove objects in environment
rm(list = ls())

# Load objects from disk
load("all_objects.rda")

.rds

# Define a value
life = 42L

# Save a single R object
saveRDS(life, file = "myobj.rds")

# Remove objects in environment
rm(list = ls())

# Read in the object from disk and
# assign it to a new variable
my_age = readRDS(file = "myobj.rds")
my_age
# [1] 42

life
# Error: object 'life' not found

.rdx and .rdb

# Save R objects into an environment
my_lazy_env = new.env(parent = emptyenv());
my_lazy_env$my_df = data.frame(x = 1, y = 2)
my_lazy_env$grades = data.frame(pct = 95, letter = "A")

# Store database in folder
dir.create("data-db")

# Save objects inside a LazyLoadDB
# Requires an environment and the name of a file.
tools:::makeLazyLoadDB(my_lazy_env, "data-db/my_lazyload_db");

# Remove objects in environment
rm(list = ls())

# Load objects from disk
lazyLoad("data-db/my_lazyload_db")
# NULL

Note: Using .rdx and .rdb requires the objects being saved into an environment and, then, supplying the argument to construct the Lazy DB. Moreover, note the use of three colons, e.g. :::, to access makeLazyLoadDB in tools. This means that the function is not exported from the tools package and should be considered internal.

R Binary File Versions and Compatibilities

With this being said, there is a need to emphasize saving into R’s binary format introduces compatibility issues. That is, some versions of R are using a newer variant of the binary format and others aren’t. To control the version that the object is saved in use version = 2 or version = 3 parameter when writing the object via save()

The following table provides information as to when the different versions came into service.

R Version Binary Version
R 3.5.1 - Present 3
R 1.4.0 - R 3.5.0 2 (Default)
R 0.99.0 - R 1.3.1 1

More information about version differences can be found in Section 1.8: Serialization Formats of the R Internals manual.

When looking at an R binary file, note that version information is stored within the first line of the written file under the scheme of X for binary serialization and A for ASCII serialization.

Fin

In short, give R binary files a shot if you are looking for reduced file size and don’t mind giving up being able to view the data’s information without opening it in R.

comments powered by Disqus