The default analyst behavior is to export results as a CSV file and share them with their colleagues. Now, CSV files are preferred in this instance because they show “diffs” between versions and require minimal software like a text editor to open and edit them. However, CSV files are problematic when it comes to sparse matrices as one of my fellow PhD students discovered recently (generating ~ >500 Gb of data). For a refresher on sparse matrices, see the prior post on benefits to using sparse matrices. In cases like this, it would be better to save the object itself through one of R’s binary file formats. This logic expands to large data sets and simulation results as well.
The goal of this post is to highlight the different binary file formats offered by R, version compatibility, and compression differences.
First, we begin with an overview of the different kinds of R binary files that are available.
.RDatais “R Data”
- Description: Save and restore one or more named objects into an environment.
- Notes: Useful for storing workspaces and multiple R objects as-is.
As an example, see the
save.image()function called upon closing every R session.
.rdsis a “R Data Single”
- Description: Save and load a single R object to a binary file.
- Notes: Great for exporting a single result and loading it into a new variable.
.rdxcontains the index while
.rdbstores objects for an R Database used in Lazy Loading
- Notes: Primarily for R’s internal usage. Though, benefits exist around delayed assignment by the use of promises for large data.
Creation of the binary files and the ability to read them in are given next.
# Define values fruit = "apple" toad = "ribbit" # Save R objects save(fruit, toad, file = "all_objects.rda") # Remove objects in environment rm(list = ls()) # Load objects from disk load("all_objects.rda")
# Define a value life = 42L # Save a single R object saveRDS(life, file = "myobj.rds") # Remove objects in environment rm(list = ls()) # Read in the object from disk and # assign it to a new variable my_age = readRDS(file = "myobj.rds") my_age #  42 life # Error: object 'life' not found
# Save R objects into an environment my_lazy_env = new.env(parent = emptyenv()); my_lazy_env$my_df = data.frame(x = 1, y = 2) my_lazy_env$grades = data.frame(pct = 95, letter = "A") # Store database in folder dir.create("data-db") # Save objects inside a LazyLoadDB # Requires an environment and the name of a file. tools:::makeLazyLoadDB(my_lazy_env, "data-db/my_lazyload_db"); # Remove objects in environment rm(list = ls()) # Load objects from disk lazyLoad("data-db/my_lazyload_db") # NULL
.rdb requires the objects being saved into an
environment and, then, supplying the argument to construct the Lazy DB. Moreover,
note the use of three colons, e.g.
:::, to access
This means that the function is not exported from the
tools package and
should be considered internal.
R Binary File Versions and Compatibilities
With this being said, there is a need to emphasize saving into R’s binary
format introduces compatibility issues. That is, some versions of R are using
a newer variant of the binary format and others aren’t. To control the version
that the object is saved in use
version = 2 or
version = 3 parameter when
writing the object via
The following table provides information as to when the different versions came into service.
|R Version||Binary Version|
|R 3.5.1 - Present||3|
|R 1.4.0 - R 3.5.0||2 (Default)|
|R 0.99.0 - R 1.3.1||1|
When looking at an R binary file, note that version information is stored
within the first line of the written file under the scheme of
X for binary serialization and
A for ASCII serialization.
In short, give R binary files a shot if you are looking for reduced file size and don’t mind giving up being able to view the data’s information without opening it in R.