In the last post, AWS CLI was setup, we authorized a user account, generated a keypair, and assigned default roles for the cluster to take on. We continue onward now to obtain the perfect AWS EMR Cluster that conforms as much as possible to the UIUC Big Data Image. We do this by first obtaining an S3 bucket and uploading install files. Then, we launch the AWS EMR cluster via AWS CLI.
Create a Bucket and Upload files to S3
Now, we need to create a bucket on S3. Amazon’s S3 service can be thought of as a hard drive. That is, everything that you wish to keep forever and ever, you will want to save to your bucket. If the files are not placed within the bucket when the cluster is killed, those files will be lost.
To create a bucket go to the S3 Console and press
Name the bucket something short and reasonable. You will have to type it out many times.
If the bucket was successfully created, the page will update to show its presence:
Click on the name of the new bucket that you created to enter the bucket. Inside the bucket, press the
If all is well, your bucket should now look like:
Create the cluster via the bootstrap file
Before we launch the cluster, we need to talk about pricing on AWS. Specifically, the EMR pricing. For our purposes, we only need to use the m1 instance type. Anything larger will be overkill and will be costly. By costly, the assignments every month in the course should require no more than the amount of 3 Starbuck’s single shot grande skim 2 pumps Mocha with whip, no lid, a little cinnamon, and a little nutmeg (note: last 2 ingredients…self serve)!!
For each instance we spawn of the m1 instance type we must pay the following per hour:
General Purpose - Previous Generation
|Instance Type||EC2 Cost per Hour||EMR Cost per Hour||Total Cost per Hour|
(Rates as of 2/6/15)
The configuration script given below then would yield a cost of $0.657 per hour (0.219*3).
Note, if you issue the command below, you will be charged monies depending on how long the cluster is active.
Prior to running the below command, replace
<YOUR-X> with your information. Upon running this command, a hadoop cluster will be created.
bucket="<YOUR_BUCKET>" region="<YOUR_REGION>" keypair="<YOUR_KEYPAIR>" master_instance="m1.large" slave_instance="m1.large" num_slaves=2 aws emr create-cluster --name emr_cluster \ --ami-version=3.3.0 \ --applications Name=Hue Name=Hive Name=Pig \ --region $region \ --use-default-roles --ec2-attributes KeyName=$keypair \ --no-auto-terminate \ --instance-groups \ InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$master_instance \ InstanceGroupType=CORE,InstanceCount=$num_slaves,InstanceType=$slave_instance \ --bootstrap-actions \ Name=emR_bootstrap,\ Path="s3://$bucket/hdp_setup.sh",\ Args=[--emrinstall,--rstudio,--hpaths,--rhadoop,--createuser,--sudouser,--sshuser] \ --steps \ Name=HDFS_tmp_permission,\ Jar="s3://elasticmapreduce/libs/script-runner/script-runner.jar",\ Args="s3://$bucket/emr_permissions.sh"
The above command has been modified slightly from the AWS Lab post on EMR to match the Big Data image implementation.
If the command issued above is successful, you should receive a string back that identifies the cluster ID.
You can monitor the status of the cluster (e.g. from provisioning, to installing, to using, and to terminating) at the EMR Console.
When you are done using the cluster, it is very important that you terminate it. Leaving the cluster active will rack up services fees on your AWS account.
To terminate the cluster via AWS CLI, we first need to get the list of clusters:
aws emr list-clusters
Say that in our case the cluster ID returned was:
Then, we need to issue one of the following termination commands on the cluster ID:
# Not protected aws emr terminate-clusters --cluster-ids j-STATSatUIUCRocks1 # Protected aws emr terminate-clusters --cluster-ids j-STATSatUIUCRocks1 --no-termination-protected