AWS EMR

Joshua Callis
3 min readApr 24, 2023

If you’d like a quick to the point blog on EMR continue :) , if you’d like a very detailed blog on EMR see: https://www.zuar.com/blog/what-is-amazon-emr/

Overview

  • Elastic MapReduce.
  • Managed Hadoop framework running on EC2.
  • Includes, Spark, Presto, Hive etc
  • EMR Notebooks — Develop scripts/code.
  • Many integrations with AWS resources.
  • CloudWatch to monitor cluster performance.
  • AWS IAM to configure permissions.
  • Data Pipeline to schedule and start the cluster(s).
  • CloudTrail to audit requests.
  • Charges by the hour.
  • Can add and remove task nodes on the fly (increases processing capacity).
  • Can resize cluster core nodes (increases processing and HDFS capacity).
  • Core nodes can be added/removed, but risk of data loss.

EMR Cluster

  • Collection of EC2 instances that are running Hadoop.
  • Each instance is called a node, each node has a role in the cluster, called node type; 3 types:
  • Master node
  • Track status of tasks and monitor cluster health.
  • Core code
  • Hosts HDFS data and run tasks, can be scaled up & down.
  • Core nodes are doing the actual work.
  • Task node
  • Does not store data.
  • If you need to add more processing capacity but not storage, task node is a good option.
  • No risk of data loss (as it doesn’t store data), great use of spot instances.
  • Specify storage when creating a cluster either S3 or HDFS (underlying file system of Hadoop)
  • Specify output to S3 or other location.

EMR Usage

Transient cluster

  • Will automatically terminate when all the processing is done, great for saving money

Long running clusters

  • Manually terminated.
  • Data warehouse, periodically processing large data sets.
  • Great when you don’t want to be creating new transient cluster(s) when needing to process data often.
  • Can spin up task nodes using spot instance to increase capacity.
  • Able to use reserved EC2 instances, to save cost.

EMR Storage

HDFS

  • Stores multiple copies of files to multiple instances, protects from data loss. If one instance fails, another node can replace it and simply continue processing data.
  • Data is stored in blocks, by default a size of a block is 128MB
  • Ephemeral — Data is lost when a cluster is terminated.
  • Processing of data is done on the same node where the block of data is stored, great optimisation.

EMRFS

  • Provides persistent storage, i.e is available when a cluster terminates.
  • EMRFS, allows us to access S3 as if it were HDFS.
  • S3 itself is strongly consistent. Previously S3 had a consistent view, option, which uses dynamoDB to track file consistency, i.e if you have multiple nodes from the cluster, hitting the same file simultaneously. Although it was a pain as we had to increase read/write on DynamoDB to ensure latency wasn’t effected.
  • Not as optimised as HDFS (but still very quick), as S3 has to use a different service to process the data. Compared to HDFS, which can process the data on the same node, as previously mentioned.

EBS for HDFS

  • Allows use of EMR on EBS only supported on (M4, C4 instance types)
  • Ephemeral, data is lost when a cluster is deleted.
  • Storage capacity cannot be increased, EBS volumes can only be attached when the cluster/nodes are created.
  • If we manually detach an EBS volume, EMR treats that as a failure and replaces it.

EMR Managed Scaling

  • Support for Spark, Hive, Yarn workloads.
  • Adds core nodes first, then task nodes, up to the max nodes specified.
  • Scale down, task nodes first then core cords, until the min nodes specified.
  • Can be applied across a fleet.

--

--

Joshua Callis

Converted DevOps Engineer at oso.sh, Previously a Senior Software Engineer.