Intro

This guide is meant to provide helpful information on working with the <STATS@UIUC> Big Data Image. As a result, some of the material covered here will not be available on a different image or in a production environment. However, the concepts will most certainly be relevant.

Background Information on Image Software

For starters, the <STATS@UIUC> Big Data Image is a modification of the Hortonwork’s Data Platform v2.2 VirtualBox image. The modifications that have been made to the image are as follows:

  1. The latest verison of the R programming language (v3.1.2) has been installed
  2. RStudio Server has been installed to allow students the ability to use the department standard of RStudio IDE.
  3. Parts of RevolutionAnalytics’ RHadoop ecosystem such as rmr2, rhdfs, plyrmr, and memoise have been installed.
  4. Many key High Performance Computing (HPC) packages for R have been installed such as: Rcpp, bigmemory, ff, ffbase, foreach, iterators, doMC, doSNOW, and itertools.
  5. Python v2.7.9, pip v6.0.8, and easy_install v12.1 have been installed to be the default python, pip, and easy_install software for CentOS 6.6. The image preserves yum’s dependency on python v2.6.6
  6. Several environmental variables that provide ease of use for certain features within the image.
  7. Creation of a non-root account (rstudio) that has SSH and sudo access.

Table of Contents

Due to the size of some of the image files, the guide has been split up into the following sections:

  1. Installing and Using the STATS@UIUC Big Data Image
  2. Working within shell and SSHing into the STATS@UIUC Big Data Image
  3. Web Interfaces and Saving within the STATS@UIUC Big Data Image
  4. Environment Variables, Compiling a MapReduce job via Java, Known & Resolved Issue within the STATS@UIUC Big Data Image

Questions? E-mail me