Introducing job queuing to scale your AWS Glue workloads

Data is a key driver for your business. Data volume can increase significantly over time, and it often requires concurrent consumption of large compute resources. Data integration workloads can become increasingly concurrent as more and more applications demand access to data at the same time. In AWS, hundreds of thousands of customers use AWS Glue, a serverless data integration service, for integrating data across multiple data sources at scale. AWS Glue jobs can be triggered asynchronously via a schedule or event, or started synchronously, on-demand.

Your AWS account has quotas, also referred to as limits, which are the maximum number of service resources for your AWS account. AWS Glue quotas helps guarantee the availability of AWS Glue resources and prevents accidental over provisioning of resources. However, with large or spiky workloads, it can be challenging to manage job run concurrency or Data Processing Units (DPU) to stay under the service quotas.
Traditionally, when you hit the quota of concurrent Glue job runs, your jobs fail immediately.

Today, we are pleased to announce the general availability of AWS Glue job queuing. Job queuing increases scalability and improves the customer experience of managing AWS Glue jobs. With this new capability, you no longer need to manage concurrency of your AWS Glue job runs and attempt retries just to avoid job failures due to high concurrency. You can simply start your jobs, and when the job runs are in Waiting state, the AWS Glue job queuing feature staggers jobs automatically whenever possible. This increases your job success rates and the experience for large concurrency workloads.

This post demonstrates how job queuing helps you scale your Glue workloads and how job queuing works.

Use cases and benefits for job queuing

The following are common data integration use cases where many concurrent job runs are needed:

Many different data sources need to be read in parallel
Multiple large datasets need to be processed concurrently
Data is processed in an event-driven fashion, and many events occur at the same time

AWS Glue has the following service quotas per Region and account related to concurrent usage:

Max concurrent job runs per account
Max concurrent job runs per job
Max task DPUs per account

You can also configure maximum concurrency for individual jobs.

In the aforementioned typical use cases, when you run a job through the StartJobRun API or AWS Glue console, you may hit the upper limit defined at any of the discussed places. If this happens, your job fails immediately due to errors like ConcurrentRunsExceededException returned by the AWS Glue API endpoint.

Job queuing helps those typical use cases without forcing you to manage concurrency between all your job runs. You no longer need to make manual retries when you get ConcurrentRunsExceededException. Job queuing enqueues job runs when you hit the limit and automatically reattempts job runs when resources free up. It simplifies your daily operation and reduces latency for the retries. It also allows you to scale more with AWS Glue jobs.

In the next section, we describe how job queuing is configured.

Configure job queuing for Glue jobs

To enable job queuing on the AWS Glue Studio console, complete the following steps:

Open AWS Glue console.
Choose Jobs.
Choose your job.
Choose the Job details tab.
For Job Run Queuing, select Enable job runs to be queued to run later when they cannot run immediately due to service quotas
Choose Save.

In the next section, we describe how job queuing works.

How AWS Glue jobs work with job queuing

In the current job run lifecycle, the job-level and account-level limits are checked when a job starts, and the job moves to a Failed state when these limits are reached. With job queuing, your job run state goes into a Waiting state to be reattempted instead of Failed. The Waiting state means that job run is queued for retry after the limits have been exceeded or resources were not unavailable. Job queueing is another retry mechanism in addition to the customer-specified max retry.

AWS Glue job queuing will improve the success rates of job runs and reduce failures due to limits, but it doesn’t guarantee job run success. Limits and resources could still be unavailable by the time the reattempt run starts.

The following screenshot shows that two job runs are in the Waiting state:

The following limits are covered by job queuing:

Max concurrent job runs per account exceeded
Max concurrent job runs per job exceeded (which includes the account-level service quota as well as the configured parameter on the job)
Max concurrent DPUs exceeded
Resource unavailable due to IP address exhaustion in VPCs

The retry mechanism is configured to retry for a maximum of 15 minutes or 10 attempts, whichever comes first.

Here’s the state transition diagram for job runs when job queuing is enabled.

Considerations

Keep in mind the following considerations:

AWS Glue Flex jobs are not supported
With job queuing enabled, the parameter MaxRetries is not configurable for the same job

Conclusion

In this post, we described how the new job queuing capability helps you scale your AWS Glue job workload. You can start leveraging job queuing for your new jobs or existing jobs today. We are looking forward to hearing your feedback.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Gyan Radhakrishnan is a Software Development Engineer on the AWS Glue team. He is working on designing and building end-to-end solutions for data intensive applications.

Simon Kern is a Software Development Engineer on the AWS Glue team. He is enthusiastic about serverless technologies, data engineering and building great services.

Dana Adylova is a Software Development Engineer on the AWS Glue team. She is working on building software for supporting data intensive applications. In her spare time, she enjoys knitting and reading sci-fi.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytic services. In his spare time, he enjoys skiing and gardening.

AWS Big Data Blog