Introduction

Out of having 3 major cloud providers i.e. GCP, AWS, and Azure. AWS is the oldest player in this game and the most trusted one. It has a well-equipped infrastructure to support Big Data, Apps, Machine Learning, etc. workloads. Now let’s explore about implementing Big Data Framework on AWS i.e. using the EMR (Elastic MapReduce) Cluster.

Learning Objectives

  1. What is EMR?
  2. Advantages of EMR
  3. Architecture of EMR
  4. Deployment options in EMR
  5. Creating Cluster in EMR

What is EMR?

Amazon EMR(Elastic MapReduce) is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning clusters. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark.

Advantage of EMR

  1. Easy to use: One can use EMR Studio, an integrated development environment (IDE), to easily develop, visualize, and debug data engineering and data science applications are written in R, Python, Scala, and PySpark. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials. It provides fully managed Jupyter Notebooks and collaboration with peers using code repositories such as GitHub and BitBucket.
  2. Low Cost: EMR pricing is simple and predictable: You pay a per-instance rate for every second used, with a one-minute minimum charge. You can launch a 10-node EMR cluster for as little as $0.15 per hour.
  3. Elastic: Unlike the rigid infrastructure of on-premises clusters, EMR decouples compute and storage, giving you the ability to scale each independently and take advantage of the tiered storage of Amazon S3. With EMR, you can provision one, hundreds, or thousands of compute instances or containers to process data at any scale.
  4. Reliable: Spend less time tuning and monitoring your cluster. EMR is tuned for the cloud and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. Clusters are highly available and automatically failover in the event of a node failure.
  5. Secure: EMR automatically configures EC2 firewall settings, controlling network access to instances, and launches clusters in an Amazon Virtual Private Cloud (VPC). Server-side encryption or client-side encryption can be used with the AWS Key Management Service or your own customer-managed keys
  6. Flexible: You have complete control over your EMR clusters and your individual EMR jobs. You can launch EMR clusters with custom Amazon Linux AMIs and easily configure the clusters using scripts to install additional third party software packages. EMR enables you to reconfigure applications on running clusters on the fly without the need to relaunch clusters.

Architecture of EMR

Amazon EMR is an AWS service that allows users to launch and use resizable Hadoop clusters inside of Amazon’s infrastructure. Amazon EMR, like Hadoop, Spark, etc. can be used to analyze large data sets. It greatly simplifies the setup and management of the cluster of Hadoop and MapReduce components. EMR instances use Amazon’s prebuilt and customized EC2 instances, which can take full advantage of Amazon’s infrastructure and other AWS services. These EC2 instances are invoked when we start a new Job Flow to form an EMR cluster. A Job Flow is Amazon’s term for the complete data processing that occurs through a number of compute steps in Amazon EMR. A Job Flow is specified by the MapReduce/Spark etc. application and its input and output parameters.

The master, core, and task cluster groups perform the following key functions in the Amazon EMR cluster:

Master group instance

The master group instance manages the Job Flow and allocates all the needed executables, JARs, scripts, and data shards to the core and task instances. The master node monitors the health and status of the core and task instances and also collects the data from these instances and writes it back to Amazon S3. The master group instances serve a critical function in our Amazon EMR cluster.

Core group instance

Core group instance members run the map and reduce portions of our Job Flow, and store intermediate data to the Hadoop Distributed File System (HDFS) storage in our Amazon EMR cluster. The master node manages the tasks and data delegated to the core and task nodes. Due to the HDFS storage aspects of core nodes, a loss of a core node will result in data loss and possible failure of the complete Job Flow.

Task group instance

The task group is optional. It can do some of the dirty computational work of the map and reduce jobs, but does not have HDFS storage of the data and intermediate results. The lack of HDFS storage on these instances means the data needs to be transferred to these nodes by the master for the task group to do the work in the Job Flow.

Deployment options in EMR

Three different deployment options are available in EMR:

  1. Amazon EMR on Amazon EC2: You can deploy EMR on Amazon EC2 and take advantage of On-Demand, Reserved, and Spot Instances. EMR managed provisioning, management, and scaling of the EC2 instances. AWS offers more instance options than any other cloud provider, allowing you to choose the instance that gives you the best performance or cost for your workload.
  2. Amazon EMR on Amazon EKS: You can use EMR to run Apache Spark jobs on-demand on Amazon Elastic Kubernetes Service (EKS), without needing to provision EMR clusters, to improve resource utilization and simplify infrastructure management. Amazon EKS gives you the flexibility to start, run, and scale Kubernetes applications in the AWS cloud or on-premises.
  3. Amazon EMR on AWS Outposts: Amazon EMR is available on AWS Outposts, allowing you to set up, deploy, manage, and scale EMR in your on-premises environments, just as you would in the cloud. AWS Outposts brings AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility.

Creating Cluster in EMR

Create your key

Click on Services then click on EC2

Scroll down and click on Key Pairs, Inside Key pairs click on “Create a new key pair

Enter key pair name such as mykeypair and the choose ppk as file format then click on Create key pair

Open the AWS Management Console and search for EMR Service.

Now click on the Create button to create a new EMR cluster.

Provide the name to Cluster such as MyCluster, then select the Application according to your requirement. One can also go to advanced options and select the required components of Big Data Framework which is required such as JupyterLab, etc.

In our case, choose Release emr-6.2.0, we are selecting Hadoop 3.2.1, JupyterHub 1.1.0, Zeppelin 0.9.0, and Spark 3.0.1

Keep other settings as default on this page, click on Next, and set the Cluster Nodes and Instances. Set the Instance Count as 1 for Master and 0 for Core

Scroll a bit, increase EBS volume size to 100 GB and click on Next Button.

Click on Next. If you want to change the cluster name according to your requirement, then you can change it. Untick Termination protection for this demo -

Now click on Create Cluster

It will take 2-5 minutes to create the cluster.

Once the cluster is created, let’s connect to the cluster using SSH. We have two options under this i.e. connecting in Windows or connecting in Mac. First of all, connect to the cluster in Windows.

For that, copy the Master Public DNS of your cluster.

Set Security Group Configuration

This Configuration allows any IP address to connect to your EMR cluster; be it TCP or SSH which we normally use to login into the EC2 instances.

Add your IP in Security groups for Master

Click on MyIP in Source

Now open the Putty, go to Session and Under the Host Name(or IP address) add [email protected]<Master Public DNS>

Now go to the SSH > Auth, click on Browse and select the mykeypair.ppk file then click on open.

Ignore the Security it provides by clicking on the Yes.

Hence, we have connected to the EMR Cluster using Putty in Windows.

Now let’s explore how we can connect to the EMR Cluster in Mac.

Note: If you create ppk

Open a terminal window. On Mac OS X, choose Applications > Utilities > Terminal. On other Linux distributions, terminal is typically found at Applications > Accessories > Terminal.

To establish a connection to the master node, type the following command. Replace ec2-###-##-##-###.compute-1.amazonaws.com with the master public DNS name of your cluster and replace ~/mykeypair.pem with the location and file name of your .pem file.

ssh [email protected]###-##-##-###.compute-1.amazonaws.com -i ~/mykeypair.pem

A warning states that the authenticity of the host you are connecting to cannot be verified. Type yes to continue.

When you are done working on the master node, type the following command to close the SSH connection.

exit

Copy the Public IPv4 address, Public IPv4 DNS, and Private IPv4 DNS.

Before launching the Jupyter lab you need to add the following in your windows or mac host file inside C:\Windows\System32\drivers\etc or /private/etc/hosts respectively.

In window you find your host file and set as following:

Public IPv4 address, Public IPv4 DNS, Private IPv4 DNS

Click on Application User Interface and then click on JupyterHub User Interface link

The user name is jovyan and the password is jupyter.

Conclusion

Amazon EMR(Elastic MapReduce) is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. It is an easy to set up, cost effective option to work with.