This is a guide meant to demonstrate how to install RStudio Server on Hortonwork’s Virtual Box Image (HDP) and set up the R package known as RHadoop. You should have already installed Oracle’s Virtual Box and have loaded the Hortonwork’s Virtual Box Image into Virtual Box.
The use of sudo in the commands below is very liberal given that by default you will access the shell by using root. The reason why sudo is listed is in the event your enterprise has limited your access to root.
Port Forwarding on Virtual Box
Before we start the virtual box image, we must first forward a port so that we can access R Studio via our browser in a similar vein to accessing Hortonwork’s web interface.
Access the Setting Options via:
Open port forwarding by going to Network and then selecting Port Forwarding:
Creating a new port forwarding rule:
Enter the following for the new rule:
- Name: rstudioserver
- Protocol: TCP
- Host IP: 127.0.0.1
- Host Port: 8787
- Guest Port: 8787
Accept the changes:
Start the Hortonworks Virtual Box image:
Installing RStudio Server on Hortonwork’s image (based on CENT OS 6)
Now, within the Virtual Box image you will need to access terminal start by setting the network forwarding ports to support
# Step 1: Install Necessary Programs # R, wget (Download File), openssl098e (SSL), vim (text editor) # git (Version Control), curl (command line protocols for http) # sudo allows the execution to happen as a root user (lots of power) # yum is a package manner for CENT OS (Red Hat) # The -y means yes to all. sudo yum -y install R git wget openssl098e vim curl # Step 3: Download R Studio wget -O /tmp/rstudio-server-0.98.1091-x86_64.rpm http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm # Step 4: Install R Studio Server # --nogpgcheck disables temporarily the signature check of the package. sudo yum -y install --nogpgcheck /tmp/rstudio-server-0.98.1091-x86_64.rpm # Step 5: Any issues with the install? sudo rstudio-server verify-installation # Step 6: Add user to login to R Studio sudo adduser rstudio sudo passwd rstudio # New password: <your password here> (I chose rstudio)
Now, you should be able to log in to R Studio Server using 127.0.0.1:8787 or localhost:8787 in your browser’s URL bar.
What R Studio looks like in your web browser (can you tell any difference vs. the client?):
When you are done using R Studio, make sure you use:
q() to exit the client. This will prevent the following error message from being displayed on subsequent starts…
21 Nov 2014 14:24:38 [rsession-rstudio] ERROR session hadabend;
LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&)
In the event RStudio-Server is not able to start on a subsequent run of the image, log into shell and use the following two commands:
# Stop any rstudio-server process sudo rstudio-server stop # Start a new rstudio-server process sudo rstudio-server start
Setting up rmr2 (RHadoop)
In order to install and use RHadoop, we must first set two bash variables, install some packages in R, and then install rmr2
Bash Variable Initalization
There are two ways to go about letting R know where certain hadoop components are. Specifically, there is a dependency on whether or not you are using RStudio-Server to access the R session or accessing R from the command line. By initializing the variables using a file, we will avoid having to set them each time we start R. Therefore, we will have to edit
R_HOME is the location where R is installed and is given by the R command
For the first variable,
HADOOP_STREAMING, obtaining the path is a bit more complicated depending on the HDP you are using. Specifically, there are two cases I’ve identified HDP 2.0 - HDP 2.1 vs. HDP 2.2 vs. Future Proofing. Before we begin, try to figure out what the version number of the hadoop streaming file is by using:
# Search for where it is located via: find / -name 'hadoop-streaming*.jar'
The file path should begin with
Write down the version number at the end of this file. You will need it for the next steps.
Bash variable initialization for RStudio-Server
We set variable such that RStudio-Server is able to recognize them by modifying the Renviron file that is loaded during startup.
This is located at:
R_HOME open R via terminal and use:
The file should be at either:
Access the file:
sudo vim R_HOME/etc/Renviron
To be able to input into the file press “insert” key on your keyboard. Use the “down arrow” or “page down” to get to the end of the file. Then, add the following to lines at the end:
# All verisons HADOOP_CMD='/usr/bin/hadoop' # Version 2.0-2.1 has a symbolic link HADOOP_STREAMING='/usr/lib/hadoop-mapreduce/hadoop-streaming.jar' # Version HDP 2.2 use: HADOOP_STREAMING='/usr/hdp/188.8.131.52-1084/hadoop-mapreduce/hadoop-streaming.jar'
Press escape and type
:wq to save file.
Open R Studio Server web interface and execute:
This should verify the file paths have been set.
Bash variable initialization for Command Line R
Now, we will open
/etc/profile to write the information to the file:
sudo vim /etc/profile
Before you start writing file paths…
NOTE: Do not fight Linux! Use tab to autocomplete words to decrease the amount you need to type in.
Pro tip: Try to use tab when the path is not ambiguous (e.g. not on hadoop-)
For version HDP version 2.0 - HDP 2.1, use:
# Set the HADOOP_STREAMING variable for HDP 2.0 and HDP 2.1: export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<YOUR VERSION>.jar
For version HDP version 2.2 Preview, use:
# Set the HADOOP_STREAMING variable for HDP 2.2 Preview: export HADOOP_STREAMING=/usr/hdp/184.108.40.206-913/hadoop-mapreduce/hadoop-streaming-220.127.116.11.2.0.0-913.jar
The second variable,
HADOOP_CMD, is straight forward to set on all HDP versions.
# Set the HADOOP_CMD bash variable export HADOOP_CMD=/usr/bin/hadoop
Save file via
To check to see whether the variables were set write the following in terminal:
echo $HADOOP_CMD echo $HADOOP_STREAMING
RStudio-Server start and end process change
The RStudio Server has been known to have respawn issues if the appropriate shutdown sequence is not followed. Note, running after the first import will require a manual start
The following tries to decrease the likelihood of the respawning issue within VM.
We want to modify the starting and end process init numbers. Specifically, we want to delay the procedures. We are able to do so by editing
/etc/init/rstudio-server.conf so that is
 instead of
. In essence, this delays spawning the rstudio-server instance.
# Command line sudo vim /etc/init/rstudio-server.conf
# Find and modify “start on runlevel ” to: start on runlevel  # Start the server when init codes 3-5 are hit # Find and modify “stop on runlevel [!2345]” to: stop on runlevel [!345] # Stop the server when init codes 0, 1, or 6 are given
RStudio Server Error
Specifically, the shutdown procedure of:
The typical error is:
If this error happens, you will be unable to access RStudio within your browser until you manually start RStudio Server by running the following command in the HDP shell:
sudo rstudio-server start
Installing R Packages
First, open R using:
Using sudo prefix here is VERY important since this will place the packages in the system wide library instead of a user-specific library. (e.g. usr/lib64/R/)
Then within R type:
install.packages( c("Rcpp","RJSONIO","bitops","digest","functional","itertools","reshape2","stringr","plyr","caTools")), repos='http://cran.revolutionanalytics.com') # Exit R q()
Installing rmr2 and other RevolutionAnalytics Hadoop to R technology
If you are interested in obtaining the latest version of rmr2 or other Hadoop to R technology, then check out the official download page on github for Revolution Analytics software packages.
To complete the guide, we will install rmr2 in shell (e.g. not in R):
# Download the latest version wget -O /tmp/rmr2_3.3.0.tar.gz https://github.com/RevolutionAnalytics/rmr2/raw/master/build/rmr2_3.3.0.tar.gz # Trigger install via shell R command. # Make sure to use sudo to place in system library! sudo R CMD INSTALL /tmp/rmr2_3.3.0.tar.gz
We will also need to set up a folder to store logs and ensure we have read and write privileges to it.
# Create the log file directory recursively # (e.g. if anyone directory is missing, then create it) mkdir -p /var/log/hadoop/rstudio/ # !!THIS NEXT COMMAND IS VERY DANGEROUS FOR A PRODUCTION ENVIRONMENT!! # A better solution is to remap where the hadoop logs are sent. # e.g. modify hadoop-env.sh by adding to the end of the file # the line: export HADOOP_LOG_DIR=<Your Location> # Allow ANYONE to write to any files within the directory chmod -R 777 /var/log/hadoop/rstudio
Here is a quick way to check whether the package has been set up correctly.
Note: The following is a modification of the first example on rmr2 tutorial page on GitHub.
Here is some basic R code:
# Create an R vector with values ranging from 1 to 1000 small.ints = 1:1000 # Apply the function to each element of small.ints sapply(small.ints, function(x) x^2)
Here is new code written in R, but it uses map reduce algorithm via Hadoop:
# Write an R object to the hdfs backend small.ints = to.dfs(1:1000) # Create a map reduce job using data on the hdfs backend small.ints.job = mapreduce( input = small.ints, map = function(k, v) cbind(v, v^2)) # Retrieve the results from hdfs small.ints.df = from.dfs(small.ints.job) # Results will be in a list form with the list structured: # the $key [not supplied, so it'll be null] # the $val (values) # Display the top 6 observations from results head(small.ints.df$val)
For some fun, check out these tutorials on using Hadoop within R!
Thanks for reading.