AWS EMR

3 min readApr 24, 2023

If you’d like a quick to the point blog on EMR continue :) , if you’d like a very detailed blog on EMR see: https://www.zuar.com/blog/what-is-amazon-emr/

Overview

Collection of EC2 instances that are running Hadoop.
Each instance is called a node, each node has a role in the cluster, called node type; 3 types:
Master node
Track status of tasks and monitor cluster health.
Core code
Hosts HDFS data and run tasks, can be scaled up & down.
Core nodes are doing the actual work.
Task node
Does not store data.
If you need to add more processing capacity but not storage, task node is a good option.
No risk of data loss (as it doesn’t store data), great use of spot instances.
Specify storage when creating a cluster either S3 or HDFS (underlying file system of Hadoop)
Specify output to S3 or other location.

Transient cluster

Will automatically terminate when all the processing is done, great for saving money

Long running clusters

Manually terminated.
Data warehouse, periodically processing large data sets.
Great when you don’t want to be creating new transient cluster(s) when needing to process data often.
Can spin up task nodes using spot instance to increase capacity.
Able to use reserved EC2 instances, to save cost.

HDFS

Stores multiple copies of files to multiple instances, protects from data loss. If one instance fails, another node can replace it and simply continue processing data.
Data is stored in blocks, by default a size of a block is 128MB
Ephemeral — Data is lost when a cluster is terminated.
Processing of data is done on the same node where the block of data is stored, great optimisation.

EMRFS

Provides persistent storage, i.e is available when a cluster terminates.
EMRFS, allows us to access S3 as if it were HDFS.
S3 itself is strongly consistent. Previously S3 had a consistent view, option, which uses dynamoDB to track file consistency, i.e if you have multiple nodes from the cluster, hitting the same file simultaneously. Although it was a pain as we had to increase read/write on DynamoDB to ensure latency wasn’t effected.
Not as optimised as HDFS (but still very quick), as S3 has to use a different service to process the data. Compared to HDFS, which can process the data on the same node, as previously mentioned.

EBS for HDFS

Allows use of EMR on EBS only supported on (M4, C4 instance types)
Ephemeral, data is lost when a cluster is deleted.
Storage capacity cannot be increased, EBS volumes can only be attached when the cluster/nodes are created.
If we manually detach an EBS volume, EMR treats that as a failure and replaces it.