Intro

This is a guide meant to demonstrate how to install RStudio Server on Hortonwork’s Virtual Box Image (HDP) and set up the R package known as RHadoop. You should have already installed Oracle’s Virtual Box and have loaded the Hortonwork’s Virtual Box Image into Virtual Box.

The use of sudo in the commands below is very liberal given that by default you will access the shell by using root. The reason why sudo is listed is in the event your enterprise has limited your access to root.

Port Forwarding on Virtual Box

Before we start the virtual box image, we must first forward a port so that we can access R Studio via our browser in a similar vein to accessing Hortonwork’s web interface.

Access the Setting Options via:

Virtual Box Image Access Settings

Open port forwarding by going to Network and then selecting Port Forwarding:

Virtual Box Network Menu for Port Forwarding

Creating a new port forwarding rule:

Virtual Box create new port forwarding rule

Enter the following for the new rule:

  • Name: rstudioserver
  • Protocol: TCP
  • Host IP: 127.0.0.1
  • Host Port: 8787
  • Guest Port: 8787

Virtual Box forwarding information for new rule

Accept the changes:

Virtual Box accept changes

Start the Hortonworks Virtual Box image:

Virtual Box start image

Installing RStudio Server on Hortonwork’s image (based on CENT OS 6)

Now, within the Virtual Box image you will need to access terminal start by setting the network forwarding ports to support

# Step 1: Install Necessary Programs
# R, wget (Download File), openssl098e (SSL), vim (text editor)
# git (Version Control), curl (command line protocols for http)
# sudo allows the execution to happen as a root user (lots of power)
# yum is a package manner for CENT OS (Red Hat)
# The -y means yes to all.
sudo yum -y install R git wget openssl098e vim curl

# Step 3: Download R Studio
wget -O /tmp/rstudio-server-0.98.1091-x86_64.rpm http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm

# Step 4: Install R Studio Server
# --nogpgcheck disables temporarily the signature check of the package.
sudo yum -y install --nogpgcheck /tmp/rstudio-server-0.98.1091-x86_64.rpm

# Step 5: Any issues with the install?
sudo rstudio-server verify-installation

# Step 6: Add user to login to R Studio
sudo adduser rstudio
sudo passwd rstudio
# New password: <your password here> (I chose rstudio)

Now, you should be able to log in to R Studio Server using 127.0.0.1:8787 or localhost:8787 in your browser’s URL bar.

Login page

Access R Studio via your browser by entering 127.0.0.1:8787

What R Studio looks like in your web browser (can you tell any difference vs. the client?):

After logging in with rstudio user

When you are done using R Studio, make sure you use: q() to exit the client. This will prevent the following error message from being displayed on subsequent starts…

21 Nov 2014 14:24:38 [rsession-rstudio] ERROR session hadabend; LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&) /root/rstudio/src/cpp/session/SessionMain.cpp:1692

In the event RStudio-Server is not able to start on a subsequent run of the image, log into shell and use the following two commands:

# Stop any rstudio-server process
sudo rstudio-server stop

# Start a new rstudio-server process
sudo rstudio-server start

Setting up rmr2 (RHadoop)

In order to install and use RHadoop, we must first set two bash variables, install some packages in R, and then install rmr2

Bash Variable Initalization

There are two ways to go about letting R know where certain hadoop components are. Specifically, there is a dependency on whether or not you are using RStudio-Server to access the R session or accessing R from the command line. By initializing the variables using a file, we will avoid having to set them each time we start R. Therefore, we will have to edit R_HOME/etc/Renviron and /etc/profile. Note: R_HOME is the location where R is installed and is given by the R command R.home().

For the first variable, HADOOP_STREAMING, obtaining the path is a bit more complicated depending on the HDP you are using. Specifically, there are two cases I’ve identified HDP 2.0 - HDP 2.1 vs. HDP 2.2 vs. Future Proofing. Before we begin, try to figure out what the version number of the hadoop streaming file is by using:

# Search for where it is located via:
find / -name 'hadoop-streaming*.jar'

The file path should begin with /usr/lib/hadoop-mapreduce/... or /usr/hdp/<current version>/hadoop-mapreduce/...

Write down the version number at the end of this file. You will need it for the next steps.

Bash variable initialization for RStudio-Server

We set variable such that RStudio-Server is able to recognize them by modifying the Renviron file that is loaded during startup.

This is located at:

R_HOME/etc/Renviron

To obtain R_HOME open R via terminal and use: R.home()

The file should be at either:

/usr/lib/R/etc/Renviron

Or:

/usr/lib64/R/etc/Renviron

Access the file:

sudo vim R_HOME/etc/Renviron

To be able to input into the file press “insert” key on your keyboard. Use the “down arrow” or “page down” to get to the end of the file. Then, add the following to lines at the end:

# All verisons
HADOOP_CMD='/usr/bin/hadoop'

# Version 2.0-2.1 has a symbolic link
HADOOP_STREAMING='/usr/lib/hadoop-mapreduce/hadoop-streaming.jar'

# Version HDP 2.2 use:
HADOOP_STREAMING='/usr/hdp/2.2.0.0-1084/hadoop-mapreduce/hadoop-streaming.jar'

Press escape and type :wq to save file.

Open R Studio Server web interface and execute:

Sys.getenv("HADOOP_CMD")
Sys.getenv("HADOOP_STREAMING")

This should verify the file paths have been set.

Bash variable initialization for Command Line R

Now, we will open /etc/profile to write the information to the file:

sudo vim /etc/profile

Before you start writing file paths…

NOTE: Do not fight Linux! Use tab to autocomplete words to decrease the amount you need to type in.

Pro tip: Try to use tab when the path is not ambiguous (e.g. not on hadoop-)

For version HDP version 2.0 - HDP 2.1, use:

# Set the HADOOP_STREAMING variable for HDP 2.0 and HDP 2.1: 
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<YOUR VERSION>.jar  

For version HDP version 2.2 Preview, use:

# Set the HADOOP_STREAMING variable for HDP 2.2 Preview: 
export HADOOP_STREAMING=/usr/hdp/2.2.0.0-913/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.0.0-913.jar  

The second variable, HADOOP_CMD, is straight forward to set on all HDP versions.

# Set the HADOOP_CMD bash variable
export HADOOP_CMD=/usr/bin/hadoop

Save file via :wq

To check to see whether the variables were set write the following in terminal:

echo $HADOOP_CMD
echo $HADOOP_STREAMING

RStudio-Server start and end process change

The RStudio Server has been known to have respawn issues if the appropriate shutdown sequence is not followed. Note, running after the first import will require a manual start

The following tries to decrease the likelihood of the respawning issue within VM.

We want to modify the starting and end process init numbers. Specifically, we want to delay the procedures. We are able to do so by editing /etc/init/rstudio-server.conf so that is [345] instead of [2345]. In essence, this delays spawning the rstudio-server instance.

Open /etc/init/rstudio-server.conf:

# Command line
sudo vim /etc/init/rstudio-server.conf

Within file

# Find and modify “start on runlevel [2345]” to:
start on runlevel [345]

# Start the server when init codes 3-5 are hit

# Find and modify “stop on runlevel [!2345]” to:
stop on runlevel [!345]

# Stop the server when init codes 0, 1, or 6 are given

RStudio Server Error

Specifically, the shutdown procedure of:

The typical error is:

init: rstudio-server respawning too fast, stopped

If this error happens, you will be unable to access RStudio within your browser until you manually start RStudio Server by running the following command in the HDP shell:

sudo rstudio-server start

Installing R Packages

First, open R using:

sudo R

Using sudo prefix here is VERY important since this will place the packages in the system wide library instead of a user-specific library. (e.g. usr/lib64/R/)

Then within R type:

install.packages( c("Rcpp","RJSONIO","bitops","digest","functional","itertools","reshape2","stringr","plyr","caTools")),
repos='http://cran.revolutionanalytics.com')

# Exit R
q()

Installing rmr2 and other RevolutionAnalytics Hadoop to R technology

If you are interested in obtaining the latest version of rmr2 or other Hadoop to R technology, then check out the official download page on github for Revolution Analytics software packages.

To complete the guide, we will install rmr2 in shell (e.g. not in R):

# Download the latest version
wget -O /tmp/rmr2_3.3.0.tar.gz https://github.com/RevolutionAnalytics/rmr2/raw/master/build/rmr2_3.3.0.tar.gz

# Trigger install via shell R command. 
# Make sure to use sudo to place in system library!
sudo R CMD INSTALL /tmp/rmr2_3.3.0.tar.gz

We will also need to set up a folder to store logs and ensure we have read and write privileges to it.

# Create the log file directory recursively
# (e.g. if anyone directory is missing, then create it)
mkdir -p /var/log/hadoop/rstudio/

# !!THIS NEXT COMMAND IS VERY DANGEROUS FOR A PRODUCTION ENVIRONMENT!!
# A better solution is to remap where the hadoop logs are sent.
# e.g. modify hadoop-env.sh by adding to the end of the file
# the line: export HADOOP_LOG_DIR=<Your Location>
  
# Allow ANYONE to write to any files within the directory
chmod -R 777 /var/log/hadoop/rstudio

Quick Check

Here is a quick way to check whether the package has been set up correctly.

Note: The following is a modification of the first example on rmr2 tutorial page on GitHub.

Here is some basic R code:

# Create an R vector with values ranging from 1 to 1000
 small.ints = 1:1000
# Apply the function to each element of small.ints
sapply(small.ints, function(x) x^2)

Here is new code written in R, but it uses map reduce algorithm via Hadoop:

# Write an R object to the hdfs backend
small.ints = to.dfs(1:1000)

# Create a map reduce job using data on the hdfs backend
small.ints.job =  mapreduce(
                  input = small.ints, 
                  map = function(k, v) cbind(v, v^2))

# Retrieve the results from hdfs
small.ints.df = from.dfs(small.ints.job)

# Results will be in a list form with the list structured:
# the $key [not supplied, so it'll be null]
# the $val (values)

# Display the top 6 observations from results
head(small.ints.df$val)

For some fun, check out these tutorials on using Hadoop within R!

Thanks for reading.