Architect fault-tolerant purposes with occasion fleets on Amazon EMR on EC2

March 15, 2025

12

Organizations depend on Amazon EMR on EC2 clusters to course of large-scale information workloads utilizing frameworks like Apache Spark, Apache Hive, and Trino. Occasions reminiscent of TV commercials or unplanned promotions would possibly result in a rise in demand of compute capability, making efficient capability planning mandatory to ensure your workloads don’t hit capability limits or job failures.

A standard state of affairs is to run every day Spark jobs on Amazon EMR utilizing constant Amazon Elastic Compute Cloud (Amazon EC2) occasion sorts (for instance, a single occasion measurement and household for the cluster). Though this would possibly work properly to maintain the baseline, spikes can set off auto scaling, which narrows the possibilities of capability availability when attempting to cease and relaunch a bigger EMR cluster, as a result of the particular on-demand occasion pool would possibly lack capability to satisfy the demand.

On this submit, we present optimize capability by analyzing EMR workloads and implementing methods tailor-made to your workload patterns. We stroll by way of assessing the historic compute utilization of a workload and use a mix of methods to cut back the probability of InsufficientCapacityExceptions (ICE) when Amazon EMR launches particular EC2 occasion sorts. We implement versatile occasion fleet methods to cut back dependency on particular occasion sorts and use Amazon EC2 On-Demand Capability Reservation (ODCRs) for predictable, steady-state workloads. Following this method may also help stop job failures resulting from capability limits whereas optimizing your cluster for price and efficiency.

Resolution overview

Occasion fleets in Amazon EMR provide a versatile and sturdy strategy to handle EC2 cases inside your cluster. This characteristic lets you specify goal capacities for On-Demand and Spot Cases, choose as much as 5 EC2 occasion sorts per fleet (or 30 when utilizing the AWS Command Line Interface (AWS CLI) and API with an allocation technique), and use a number of subnets throughout completely different Availability Zones. Importantly, occasion fleets assist the usage of ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capability. You possibly can configure your occasion fleet to choose or require capability reservations, ensuring that your EMR clusters use your reserved capability effectively.

EMR workload patterns sometimes fall into two classes: secure and variable (spiky). Within the following sections, we discover optimize for every sample utilizing varied choices accessible with occasion fleets, beginning with secure workloads after which addressing variable workloads.

Steady workloads are workloads with a predictable sample of useful resource utilization over time; for instance, a pharmaceutical supplier must course of 21 TB of analysis information, affected person data, and different info every day. The workload is constant and must run reliably day-after-day on long-running persistent clusters. For essential enterprise operations requiring excessive reliability and assured capability, we suggest reserving the baseline capability as a part of your capability planning. We show the next steps:

Use AWS Value and Utilization Experiences (AWS CUR) to estimate the baseline of present workloads.
Reserve the baseline capability utilizing ODCR.
Configure Amazon EMR to make use of the focused ODCR.

Spiky workloads are outlined by unpredictable and sometimes vital fluctuations in processing calls for. These surges may be triggered by varied elements (reminiscent of batch processing, real-time information streaming, or seasonal enterprise fluctuations) that set off Amazon EMR to request extra capability to match the demand. We deal with the useful resource allocation by utilizing occasion and Availability Zone flexibility, with the next steps:

Introduce EC2 occasion flexibility with EMR occasion fleets.
Obtain resiliency by way of clever subnet choice with EMR occasion fleets.
Use managed scaling to robotically handle scaling out and in.

Steady workloads

On this part, we show outline your baseline, configure AWS Identification and Entry Administration (IAM) permissions, create an ODCR, and affiliate your reservations to a capability group and configure Amazon EMR to make use of focused ODCRs. You possibly can go for a blended ODCR technique—for instance, one ODCR with a brief interval of period that helps the launch of your EMR cluster, and one other ODCR with an extended interval of period that helps your process nodes primarily based on the baseline capability reservation.

Estimate the baseline

Be sure to activate the AWS generated price allocation tag aws:elasticmapreduce:job-flow-id. This permits the sphere resource_tags_aws_elasticmapreduce_job_flow_id within the AWS CUR to be populated with the EMR cluster ID and is utilized by the SQL queries within the resolution. To activate the price allocation tag from the AWS Billing Console, full the next steps:

On the AWS Billing and Value Administration console, select Value allocation tags within the navigation pane.
Below AWS generated price allocation tags, select the aws:elasticmapreduce:job-flow-id tag.
Select Activate.

It might probably take as much as 24 hours for tags to activate. For extra info, see right here.

After the tags are activated, you should utilize AWS CUR and carry out the next question on Amazon Athena to search out the compute assets utilized by the EMR cluster ID vs. the timeline of utilization. For extra particulars, see Querying Value and Utilization Experiences utilizing Amazon Athena. Replace the next question along with your CUR desk identify, EMR cluster ID, desired timestamps, and AWS account ID, and run the question on Athena:

SELECT bill_payer_account_id as Payer,
product_product_family as PFamily,
product_product_name as PName,
resource_tags_aws_elasticmapreduce_job_flow_id,
line_item_usage_account_id as LinkedAccount,
line_item_usage_start_date as UsageDate,
bill_billing_period_start_date as BillingDate,
SPLIT_PART(line_item_usage_type, ‘:’, 2) AS InstanceType,
line_item_availability_zone AS AvailabilityZone,
COUNT(line_item_resource_id) as ResourceIDCount
FROM
WHERE (
line_item_usage_start_date BETWEEN TIMESTAMP ‘YYYY-MM-DD 00:00:00’
AND TIMESTAMP ‘YYYY-MM-DD 23:59:59’
)
AND line_item_operation LIKE ‘%%RunInstance%%’
AND line_item_line_item_type LIKE ‘%%Utilization%%’
AND product_product_family NOT IN (‘Information Switch’)
AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE ‘%%%%’
AND line_item_usage_account_id IN (
”
)
GROUP BY 1,2,3,4,5,6,7,8,9

For instance, the previous question filters cases utilization per hour for a given account and EMR cluster for the interval of 6 months, to generate the next determine. You possibly can export the ends in CSV format and analyze the information. Now that you’ve got a visible illustration of your workloads’ baseline and bursts, you’ll be able to outline the technique and configuration of your EMR cluster.

Create an ODCR to order the baseline capability

ODCRs may be both open or focused:

With an open ODCR, new cases and present cases which have matching attributes (reminiscent of working system or occasion sort) will run utilizing the capability reservation attributes first.
With a focused ODCR, cases should match the attributes of the ODCR specification and the ODCR is particularly focused at launch. This method is really helpful you probably have a number of concurrent EMR clusters consuming capability from the shared On-Demand pool of EC2 cases. EMR clusters bigger than the focused ODCR amount will fall again to On-Demand Cases which can be in the identical Availability Zone.

On this instance, we use a focused ODCR with an EMR occasion fleet within the us-east-1a Availability Zone. The next diagram illustrates the workflow.

Full the next steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make an observation of the CapacityReservationArn worth within the output:

We get the next output:

{
“CapacityReservation”: {
“CapacityReservationId”: “cr-0123456f9907xxxxx”,
“OwnerId”: “XXXX”,
“CapacityReservationArn”: “arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx”,
“InstanceType”: “r8g.2xlarge”,
“InstancePlatform”: “Linux/UNIX”,
“AvailabilityZone”: “us-east-1a”

….
}
}

You should utilize Amazon CloudWatch to observe ODCR utilization and set off an alert for unused capability. For extra particulars, see Monitor Capability Reservations utilization with CloudWatch metrics.

Create a useful resource group named EMRSparkSteadyStateGroup and make an observation of GroupArn values within the output:

aws resource-groups create-group –name EMRSparkSteadyStateGroup
–configuration ‘{“Sort”:”AWS::EC2::CapacityReservationPool”}’ ‘{“Sort”:”AWS::ResourceGroups::Generic”, “Parameters”:({“Title”:”allowed-resource-types”,”Values”:(“AWS::EC2::CapacityReservation”)})}’

We get the next output:

“Group”: {
“GroupArn”: “arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup”,
“Title”: “EMRSparkSteadyStateGroup”
}, …

Use the next code to affiliate the capability reservation to the useful resource group. You possibly can have a number of capability reservations related to a useful resource group.

aws resource-groups group-resources –group EMRSparkSteadyStateGroup
–resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a finest follow for efficient administration and cleanup, Create a tag Objective=EMR-Spark-Regular-State for the newly created ODCR and the useful resource group.

# Tag your Capability Reservation
aws ec2 create-tags
–resources cr-0123456f9907xxxxx
–tags Key=Objective,Worth=EMR-Spark-Regular-State
# Tag your Useful resource Group
aws resource-groups tag
–arn “arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup” –tags Objective=EMR-Spark-Regular-State

Implement Amazon EMR with ODCR

Full the next steps to create an EMR cluster tagged with the particular focused ODCR:

Add required permissions to the EMR service function earlier than utilizing capability reservations. With these permissions, you’ll be able to lock down the useful resource with the particular Amazon Useful resource Title (ARN) of the group identify to be created with the next code:

{
    “Model”: “2012-10-17”,
    “Assertion”: (
    {
    “Impact”: “Permit”,
    “Useful resource”: “*”,
    “Motion”: (
    “ec2:CreateFleet”,
                 “ec2:RunInstances”,
                 “ec2:CreateLaunchTemplate”,
                 “ec2:CreateLaunchTemplateVersion”,
                 “ec2:DeleteLaunchTemplateVersions”,
                 “ec2:DescribeCapacityReservations”,
    “ec2:DescribeLaunchTemplateVersions”,
    “resource-groups:ListGroupResources”
    )
    }
    )
}

Configure the EMR cluster to make use of ODCR with occasion fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as proven within the following instance:

{
…
“LaunchSpecifications”: {
“OnDemandSpecification”: {
“AllocationStrategy”: “LOWEST_PRICE”,
“CapacityReservationOptions”: {
“UsageStrategy”: “USE_CAPACITY_RESERVATIONS_FIRST”,
“CapacityReservationResourceGroupArn”: “arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup”
}
}
}
}

The next step-by-step breakdown illustrates the Amazon EMR decision-making course of when prioritizing focused capability reservations, from core node provisioning by way of process node allocation:

Cluster provisioning initiation:

The consumer chooses to override the lowest-price allocation technique.
The consumer specifies focused capability reservations within the launch request.

Core node provisioning:

Amazon EMR evaluates all EC2 occasion capability swimming pools with focused capability reservations, and selects the pool with the bottom worth that has ample capability for all requested core nodes.
If no pool with focused reservations has ample capability, Amazon EMR reevaluates all specified EC2 occasion capability swimming pools and selects the lowest-priced pool with ample capability for core nodes. Obtainable open capability reservations are utilized robotically.

Availability Zone choice:

After the core capability is acquired, Amazon EMR locks within the Availability Zone to your cluster.

Major and process node provisioning:

Amazon EMR evaluates EC2 occasion capability swimming pools inside that Availability Zone for major and process fleets. First, Amazon EMR evaluates all of the swimming pools with focused ODCRs specified within the request, ordered by lowest worth by default.
From the ordered checklist, Amazon EMR launches as a lot capability as potential from the unused focused ODCRs of every occasion pool till the request is fulfilled.
If the unused focused ODCRs don’t fulfill the request but, Amazon EMR continues to launch the remaining capability into On-Demand swimming pools, within the lowest-price order by default.

For extra particulars concerning the allocation technique, check with Allocation technique as an illustration fleets or Amazon EMR Help for Focused ODCR.

Spiky workloads

Spiky workloads are outlined by unpredictable and sometimes vital fluctuations in processing calls for, triggered by elements reminiscent of rare however resource-intensive periodic batch processing jobs. For instance, a geographic info system processes location information from thousands and thousands of customers in actual time to supply up-to-date visitors info, calculate routes, and recommend factors of curiosity. Consumer location information is continually being generated, however the quantity can spike dramatically throughout rush hour or particular occasions, as illustrated within the following determine. This graph reveals the variety of used assets (Amazon EC2) by hour; it varies from 1 when the cluster scales in, ready for jobs, to spikes of 1,000 nodes.

If you happen to’re working spiky workloads with restricted flexibility in occasion sort, household, and Availability Zone, you would possibly face ICE errors when the accessible capability can’t meet the cluster’s scaling necessities. To handle this, we discover a set of finest practices for EMR cluster creation to maximise availability and steadiness price-performance. Though spiky workloads current a singular problem in useful resource administration, configuring EMR occasion fleets gives a robust resolution. By utilizing various occasion sorts, prioritized allocation methods, Availability Zone flexibility, and managed scaling, organizations can create a strong, cost-effective infrastructure able to dealing with unpredictable workload patterns. This configuration gives the next advantages:

Improved availability – By diversifying occasion sorts and utilizing a number of Availability Zones, the cluster mitigates inadequate capability points
Value financial savings – Allocation methods scale back prices whereas minimizing interruptions
Resilience for spiky workloads – Prioritizing occasion generations supplies seamless scaling underneath various calls for
Optimized efficiency – Managed scaling dynamically adjusts assets to satisfy workload calls for effectively

Introduce EC2 occasion flexibility and occasion fleets with a prioritized allocation technique

Amazon EMR helps occasion flexibility with occasion fleet deployment. Occasion fleets offer you a greater variety of choices and intelligence round occasion provisioning. Now you can present a listing of as much as 30 occasion sorts with corresponding weighted capacities and spot bid costs (together with spot blocks) utilizing the AWS CLI or AWS CloudFormation. Amazon EMR will robotically provision On-Demand and Spot capability throughout these occasion sorts when creating your cluster. This could make it extra simple and less expensive to shortly get hold of and keep your required capability to your clusters. In August 2024, Amazon EMR launched the prioritized allocation technique to reinforce occasion flexibility with occasion fleets. This characteristic lets you specify precedence ranges to your occasion sorts, enabling Amazon EMR to allocate capability to the highest-priority cases first. This technique helps enhance price financial savings and reduces the time required to launch clusters, even in situations with restricted capability. For extra particulars, see Amazon EMR assist prioritized and capacity-optimized-prioritized allocation methods for EC2 cases. To maximise cost-efficiency and availability for spiky workloads, mix the price-performance benefits of new-generation cases with the broader availability of previous-generation cases. For workloads with strict latency necessities, repair the occasion measurement to keep up constant efficiency. This method takes benefit of the strengths of each occasion generations, offering flexibility and reliability reducing the probability of capability constraints. For On-Demand nodes, select the prioritized allocation technique, so the cluster tries to make use of newer-generation cases first. Whereas configuring the occasion fleet, prepare cases in a prioritized order reflecting price-performance and availability trade-offs, for instance:

Major node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Job Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Cases, ensure the capacity-optimized prioritized allocation technique is chosen to cut back interruptions. See the next CloudFormation template snippet for instance:

…
“Properties”: {
“Cases”: {
“MasterInstanceFleet”: {
“Title”: “cfnMaster”,
“InstanceTypeConfigs”: (
{
“BidPrice”: “10.50”,
“InstanceType”: “m5.xlarge”,
“Precedence”: “1”,
…
“LaunchSpecifications”: {
“SpotSpecification”: {
“TimeoutAction”: “SWITCH_TO_ON_DEMAND”,
“TimeoutDurationMinutes”: 20,
“AllocationStrategy”: “CAPACITY_OPTIMIZED_PRIORITIZED”
},
“OnDemandSpecification”: {
“AllocationStrategy”: “PRIORITIZED”
}
…

Choose subnets with EMR occasion fleets

When making a cluster, specify a number of EC2 subnets inside a digital personal cloud (VPC), every akin to a distinct Availability Zone. Amazon EMR supplies a number of subnet (Availability Zone) choices by using subnet filtering at cluster launch, and selects one of many subnets that has sufficient accessible IP addresses to efficiently launch all occasion fleets. If Amazon EMR can’t discover a subnet with ample IP addresses to launch the entire cluster, it would prioritize the subnet that may not less than launch the core and first occasion fleets.

Use managed scaling

Managed scaling is one other highly effective characteristic of Amazon EMR that robotically adjusts the variety of cases in your cluster primarily based on workload calls for. This makes certain that your cluster scales up during times of excessive demand to satisfy processing necessities and scales down throughout idle occasions to save lots of prices. With managed scaling, you’ll be able to set minimal and most scaling limits, supplying you with management over prices whereas benefiting from an optimized and environment friendly cluster efficiency.

The next workflow illustrates Amazon EMR configured with occasion fleets and managed scaling.

The workflow consists of the next steps:

The consumer defines the EMR occasion configurations and occasion sorts, together with their launch precedence.
The consumer selects subnets for the Amazon EMR configuration to supply Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision cases primarily based on the allocation technique.
The EMR occasion fleet is launched.
The cycle is repeated for scaling operations inside the launched Availability Zone, offering optimized efficiency and scalability.

Conclusion

On this submit, we demonstrated optimize capability by analyzing EMR workloads and implementing methods tailor-made to your workload patterns. As you implement any of the previous methods, bear in mind to repeatedly monitor your cluster’s efficiency and modify configurations primarily based in your particular workload patterns and enterprise wants. With the correct method, the challenges of spiky workloads may be remodeled into alternatives for optimized efficiency and price financial savings.

To successfully handle workloads with each baseline calls for and sudden spikes, take into account implementing a hybrid method in Amazon EMR. Use ODCRs for constant baseline capability and configure occasion fleets with a strategic mixture of ODCR, On-Demand, and Spot Cases prioritizing ODCR utilization.

Strive these methods with your personal use case, and go away your questions within the feedback.

Concerning the Authors

Deepmala Agarwal works as an AWS Information Specialist Options Architect. She is captivated with serving to prospects construct out scalable, distributed, and data-driven options on AWS. When not at work, Deepmala likes spending time with household, strolling, listening to music, watching films, and cooking!

Suba Palanisamy is a Senior Technical Account Supervisor, serving to prospects obtain operational excellence on AWS. Suba is captivated with all issues information and analytics. She enjoys touring together with her household and taking part in board video games.

Flavio Torres is a Principal Technical Account Supervisor at AWS. Flavio helps Enterprise Help prospects design, deploy, and scale resilient cloud purposes. Exterior of labor, he enjoys mountain climbing and barbecuing.

Supply hyperlink

Architect fault-tolerant purposes with occasion fleets on Amazon EMR on EC2

Resolution overview

Steady workloads

Estimate the baseline

Create an ODCR to order the baseline capability

Implement Amazon EMR with ODCR

Spiky workloads

Introduce EC2 occasion flexibility and occasion fleets with a prioritized allocation technique

Choose subnets with EMR occasion fleets

Use managed scaling

Conclusion

Concerning the Authors

Introducing AWS Glue Information Catalog utilization metrics for API utilization

Improve knowledge ingestion efficiency in Amazon Redshift with concurrent inserts

Constructing serverless occasion streaming functions with Amazon MSK and AWS Lambda

LEAVE A REPLY Cancel reply

Most Popular

S&P 500 hits document closing excessive as inventory market surges

Downsview Airport in Ancaster neighbourhood to be remodeled into seven sustainable neighbourhoods

Canada: Natalie Sourisseau Declares Retirement From Ladies’s Nationwide Staff

Liverpool might start Alexander Isak talks imminently amid Newcastle risk

Recent Comments

EDITOR PICKS

S&P 500 hits document closing excessive as inventory market surges

Downsview Airport in Ancaster neighbourhood to be remodeled into seven sustainable neighbourhoods

The Supreme Courtroom has restricted common injunctions. What does it imply? : NPR

POPULAR POSTS

Meta CTO: Sam Altman ‘Dishonest’ for $100M Bonus Declare

Donald Trump Says ‘Bitcoin Takes A Lot Of Stress Off The Greenback’

Entrepreneurs Who Failed Earlier than Success: From Setbacks to Wins

POPULAR CATEGORY

ABOUT US

FOLLOW US