Today, I had an epiphany on generating a report inside of the
Illinois Campus Cluster (ICC) with
R through
rmarkdown
. For
a long time, I’ve wanted to avoid downloading data from the cluster to my
personal computer and, then, running a report generation. Alas, my priorities
never aligned with solving this problem as downloading data was quick inside of
a University building; but, with COVID-19, my internet is no longer Mazda’s “zoom zoom” fast. I would likely wager downloading
the same amount of data is equivalent to placing it on a flash drive and shipping it
overnight via FedEx.
The epiphany was simple:
What if we had pandoc on the cluster?
For those who aren’t familiar with pandoc, the software serves as a universal document convert or in more relatable terms it is the “swiss-arm knife” for moving between different document formats. Alas, pandoc is built ontop of Haskell, which wasn’t available on the cluster. So, under usual operating principles on the cluster, I would have to build pandoc from source; though that would require setting up a Haskell environment or so I thought…
But, wait… There’s a binary! From pandoc’s linux section,
there is a binary package for amd64 arhitecture that is standalone with both
pandoc
and pandoc-citeproc
. Both binaries are statically linked and have no
dynamic dependencies or dependencies on external data files. Huzzah! Let’s
try out the binary…
Dynamic retrieval script
First, we want to always get the latest version of a software release from
GitHub. With a quick Google, we land at Hanwen Wu
’s One Liner to Download the Latest Release from Github Repo
gist. Though, we removed the wget
to allow for a more targeted pipe into
tar
.
# Determine latest version from GitHub
LATEST_RELEASE_URL=$(curl -s https://api.github.com/repos/jgm/pandoc/releases/latest | grep "browser_download_url.*amd64.tar.gz" | cut -d : -f 2,3 | tr -d \")
Next, we’ll need to retrieve the binary name, download the file, and unpack to our local binary location.
# Destination directory (bin will be created inside)
DESTDIR=~/project-stat/
# Retrieve filename
PANDOC_FILENAME="${LATEST_RELEASE_URL##*/}"
# Download the latest pandoc version
wget -q ${LATEST_RELEASE_URL}
# Unpack into $DESTDIR/bin
tar xvzf ${PANDOC_FILENAME} --strip-components 1 -C ${DESTDIR}
From there, we need to append onto the PATH
variable the location of where the
binary can be found. To do so, place in ~/.bashrc
:
export PATH="~/project-stat/bin:${PATH}"
Replace ~/project-stat
with the appropriate directory.
Then, open up R and trigger the render of the report using rmarkdown
:
rmarkdown::render("path/to/RmarkdownFile.Rmd")
Once the report is generated, the next step would be to send e-mail with it attached at the end of the simulation.
Fin
In short, this post showed how to dynamically retrieve the latest version of
pandoc from GitHub, extract it into a local bin
directory on the cluster,
and, then, how to include the bin
directory to be recognized by R to
generate documents using rmarkdown
.
Acknowledgements
Special thanks to
NCSA’s Weddie Jackson,
who has been my go-to person for getting the R version on the cluster upgraded
ever so often and who triggered the epiphany with an e-mail.
(He bumped the cluster version to R 4.0.0 as well.)