AWS EMR
3 min readApr 24, 2023
If you’d like a quick to the point blog on EMR continue :) , if you’d like a very detailed blog on EMR see: https://www.zuar.com/blog/what-is-amazon-emr/
Overview
- Elastic MapReduce.
- Managed Hadoop framework running on EC2.
- Includes, Spark, Presto, Hive etc
- EMR Notebooks — Develop scripts/code.
- Many integrations with AWS resources.
- CloudWatch to monitor cluster performance.
- AWS IAM to configure permissions.
- Data Pipeline to schedule and start the cluster(s).
- CloudTrail to audit requests.
- Charges by the hour.
- Can add and remove task nodes on the fly (increases processing capacity).
- Can resize cluster core nodes (increases processing and HDFS capacity).
- Core nodes can be added/removed, but risk of data loss.
EMR Cluster
- Collection of EC2 instances that are running Hadoop.
- Each instance is called a node, each node has a role in the cluster, called node type; 3 types:
- Master node
- Track status of tasks and monitor cluster health.
- Core code
- Hosts HDFS data and run tasks, can be scaled up & down.
- Core nodes are doing the actual work.
- Task node
- Does not store data.
- If you need to add more processing capacity but not storage, task node is a good option.
- No risk of data loss (as it doesn’t store data), great use of spot instances.
- Specify storage when creating a cluster either S3 or HDFS (underlying file system of Hadoop)
- Specify output to S3 or other location.
EMR Usage
Transient cluster
- Will automatically terminate when all the processing is done, great for saving money
Long running clusters
- Manually terminated.
- Data warehouse, periodically processing large data sets.
- Great when you don’t want to be creating new transient cluster(s) when needing to process data often.
- Can spin up task nodes using spot instance to increase capacity.
- Able to use reserved EC2 instances, to save cost.
EMR Storage
HDFS
- Stores multiple copies of files to multiple instances, protects from data loss. If one instance fails, another node can replace it and simply continue processing data.
- Data is stored in blocks, by default a size of a block is 128MB
- Ephemeral — Data is lost when a cluster is terminated.
- Processing of data is done on the same node where the block of data is stored, great optimisation.
EMRFS
- Provides persistent storage, i.e is available when a cluster terminates.
- EMRFS, allows us to access S3 as if it were HDFS.
- S3 itself is strongly consistent. Previously S3 had a consistent view, option, which uses dynamoDB to track file consistency, i.e if you have multiple nodes from the cluster, hitting the same file simultaneously. Although it was a pain as we had to increase read/write on DynamoDB to ensure latency wasn’t effected.
- Not as optimised as HDFS (but still very quick), as S3 has to use a different service to process the data. Compared to HDFS, which can process the data on the same node, as previously mentioned.
EBS for HDFS
- Allows use of EMR on EBS only supported on (M4, C4 instance types)
- Ephemeral, data is lost when a cluster is deleted.
- Storage capacity cannot be increased, EBS volumes can only be attached when the cluster/nodes are created.
- If we manually detach an EBS volume, EMR treats that as a failure and replaces it.
EMR Managed Scaling
- Support for Spark, Hive, Yarn workloads.
- Adds core nodes first, then task nodes, up to the max nodes specified.
- Scale down, task nodes first then core cords, until the min nodes specified.
- Can be applied across a fleet.