This is the first entry out of three writings to address the nature of Data Packages within the R ecosystem. Within this post, we’ll talk about R package guidelines, distribution of a package, and the amount of data that is able to be shipped. In the next entry, the focus is on the best ways to create an R data package. For the third and final entry, the discussion turns to the creation of an additional repository outside of CRAN to use when distributing large data packages.
Comprehensive R Archive Network (CRAN) is a repository of R packages that extend the functionality of R. Getting a package completed much less listed on CRAN has spawned countless volumes of guides outside of the official documentation. The benefit of being listed on CRAN is the ease of distribution, publicity, and version control (it is an archive after all). As a result, nearly anyone who writes an R package typically tries to submit it to CRAN. Thus, CRAN has a policy on package submissions.
R Package Size Limitations
For the most part, the submission rules are straight forward until you reach the package size limitation in the source packages section. I’ve taken the opportunity to quote the particularly troubling text and emphasis specific parts.
Packages should be of the minimum necessary size. Reasonable compression should be used for data (not just .rda files) and PDF documentation: CRAN will if necessary pass the latter through qpdf. As a general rule, neither data nor documentation should exceed 5MB (which covers several books). A CRAN package is not an appropriate way to distribute course notes, and authors will be asked to trim their documentation to a maximum of 5MB.
If your data is larger than 5 MB, then you can attempt to apply for an exemption. Though, do not be surprised if CRAN turns down your request with something along the lines of:
We do not accept such huge package anymore. We have < 10 larger packages on CRAN (historically caused, we would not accept these as new package today any more).
We would really appreciate if you could halve the size, for example. Or perhaps host the data only package in another repository. Then method package using this data package could then, for exampole, shp a function that gets the data package from the external repository.
Size of R Packages on CRAN
With this being said, let’s take a look at the varying sizes of packages on CRAN. Unfortunately, this information is not available via
available.packages(). Instead, to obtain this information one must download all the 7,752 available packages. This will take up approximately 3.67 GB and about an hour or two depending on your internet connection. To obtain this information use:
# Save directory save.dir = "F:/CRANMirror" # Create a directory to store package .tar.gz dir.create(save.dir) # Obtain a list of packages pkgs = available.packages()[,'Package'] # Download those packages download.packages(pkgs = pkg$package.list, destdir = save.dir)
After all the packages are downloaded, we can obtain the total size of each package.
pkg.files = list.files(save.dir) pkg.sizes = round(file.size(file.path(save.dir,pkg.files))/ 1024^2,2) # Convert to MB from Bytes
With this information now available, we can view how size of R packages on CRAN is distributed. Overall, the size of an R package listed on CRAN is relatively low. Out of 7,711 packages there were 6,738 with a size less than 1 MB (87.38%).
Note, the graph’s scale is between 0 MB and 10 MB. The reason for this is the number of packages above 5 MB was 72 and this narrowed further after 10 MB to 23.
The top 10 R packages available on CRAN by size are:
Out of these 10 packages, note that at least 6 of them are data specific packages.
How much data could a data package chuck if a data package could chuck data?
As the age of Big Data (R Gods just killed a kitten) is upon us, the limitation of 5 MB is very steep considering most big datasets are in terabytes plus region. To illustrate just how much data can be crammed into 5 MB, let’s look at the storage capacity of numeric matrix.
# install.packages("pryr") # Pretty object.size output library("pryr") # For reproducibility set.seed(1337) # Generate a random matrix a = matrix(rnorm(625000), nrow = 62500, ncol = 10) # Matrix memory size object_size(a)
## 5 MB
So, within memory, the largest data set has a total of 625,000 elements. However, this is not the largest data set we can include in an R data package. In fact, when saving the data as a
save(a, file="a.rda"), we gain an additional 0.42 MB to play with.
# For reproducibility set.seed(1337) # Generate a random matrix a = matrix(rnorm(683400), nrow = 68340, ncol = 10) # Matrix memory size object_size(a)
## 5.47 MB
Therefore, when the data is saved using
.rda, we gain an additional 58,400 observations. However, we probably should leave off a few observations to account for the additional cost of documentation and r package structure files. But, the important aspect of this exercise is to find the size cap of the data set.
If we were to ship these data sets as
.csv files, they would be 10.88 MB and 11.9 MB or 5.88 MB and 6.9 MB over the limit!