Intro

In the previous entry, the web components and how to save the <STATS@UIUC> Big Data Image state were discussed. For this post, all the remaining information such as the environmental variables available, compilation instructions for hadoop jobs, and known issues with the image exist.

Environmental Variables

The image is modified so that there are variables that contain important path information to various hadoop component locations. These variables are not created by a hadoop installation.

Note: The results are based on the file configuration of Hortonworks Data Platform 2.2 Image.

  1. $HADOOP_CMD
    • Points to where hadoop is installed.
    • Result: /usr/bin/hadoop
  2. $HADOOP_STREAMING
    • Points to the streaming jar location.
    • Result: /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar
  3. $HADOOP_CONF
    • Points to the configuration file for hadoop.
    • Result: /etc/hadoop/conf
  4. $HADOOP_EXAMPLES
    • Points to the hadoop-examples.jar location
    • Result: /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-examples.jar
  5. $RLIB_HOME
    • Points to the system wide library for R.
    • Result: $(R RHOME)
  6. $JAVAC_HADOOP_PATH
    • Points to all hadoop libraries necessary for standalone compilation.
    • Result: $(hadoop classpath)
  7. $JAVA_TOOLS
    • Points to the java compilation tools jar.
    • Result: $JAVA_HOME/lib/tools.jar
  8. $UIUC_IMAGE_VERSION
    • Provides a version string of the present image. Helpful if we need to patch the image.
    • Result: STAT490 Image Version: 1.1

If you modify one of the above variables, the value will be reset back to the default value after logging back in.

One other variable of interest that is:

  • $JAVA_HOME
    • Points to the java jdk

These variables will come in handy if you are trying to compile java code for mapreduce, using RHadoop, or exploring the image.

Compiling a MapReduce job via Java

Below, we present generalized code for each option.

For an actual application of the compile options, please see the following bash script, java_example.sh, which shows both options using the hadoop example wordcount.

Shared Setup code

Both options rely on this underlying code

# Create where the input data should go
hdfs dfs -mkdir -p /user/rstudio/example/input

# Place data on hdfs
hdfs dfs -put ~/workspace/example/input /user/rstudio/example/input

# Note: We do not create an output directory. Hadoop does that for us!

# Create a directory to put the compiled java files in.
mkdir class_files

Option 1: Standalone compile

  1. Set javac path to include hadoop libraries
# $JAVAC_HADOOP_PATH is provided as an environmental variable
export HADOOP_CLASSPATH=$JAVAC_HADOOP_PATH
  1. Compile java file using javac
javac -classpath ${HADOOP_CLASSPATH} -d class_files/ JavaProgramName.java
# -classpath provides the libraries required by Hadoop
# -d directs compiled results to a folder

Option 2: Using Hadoop

  1. Set the classpath to include Java tools.jar
# $JAVA_TOOLS is provided as an environmental variable
export HADOOP_CLASSPATH=$JAVA_TOOLS
  1. Compile the Java file using hadoop
hadoop com.sun.tools.javac.Main JavaProgramName.java -d class_files/
# com.sun.tools.javac.Main access the main method
# -d specify output of class files

Shared job code

  1. Create a jar file, e.g. java specific zip file, by using the .class files generated from compiling the java file
jar -cf jar_archive_name.jar -C class_files/ .
# -c create a new archive
# -f specify archive name
# -C get the files from given directory
# . specifies all the files from the directory
  1. Run the job
hadoop jar jar_archive_name.jar JavaProgramName /user/rstudio/example/input /user/rstudio/example/output
  1. Display the results
hdfs dfs -cat /user/rstudio/example/output/part-r-00000 
# Hadoop normally outputs the reduce job results in part-r-#####

Known Issues

R Studio Server Respawn issue

The R Studio Server has been known to have respawn issues if the appropriate shutdown sequence given above is not followed.

The typical error is:

init: rstudio-server respawning too fast, stopped

If this error happens, you will be unable to access RStudio within your browser. rstudio not available

To fix this, we need to manually start rstudio-server by running the following command in the HDP shell:

sudo rstudio-server start

RStudio session hadabend error

Another aspect of the appropriate shutdown sequence not being used is a hadabend error.

rstudio hadabend error

As a result of the improper shutdown sequence being used, the R session data, such as variables and user functions, were lost.

To resolve the error, just press OK and make sure you use the proper shutdown sequence.

Connection Refused Error

During the start up process, you may see Zookeeper indicate that a connection has been refused.

Call from sandbox.hortonworks.com/10.0.2.15 to sandbox.hortonworks.com:8020 failed on connection exception: java.net.ConnectionException: Connection refused: For more details see: <http://wiki.apache.org/hadoop/Connectionrefused>

If your hadoop jobs are failing, we suggest you restart the image under the condition that only Chrome and Virtual Box are open. In particular, make sure there are no messenging applications such as Skype or Lync open as these applications have been known to interfere with ports that are needed.

Otherwise, there is no cause for concern.

More?

Have you experienced an issue with the image that is not listed above? Or did I make a grammatical / spelling error?

If so, please let me know.

Resolved issues

Please see the STAT 490 GitHub repository for update notes.