Posted On: Nov 24, 2020
Errors in Spark applications commonly arise from inefficient Spark scripts, distributed in-memory execution of large-scale transformations, and dataset abnormalities. AWS Glue workload partitioning is the newest offering from AWS Glue to address these issues and improve the reliability of Spark applications and consistency of run-time. Workload partitioning enables you to specify how much data to process in each job-run and, using AWS Glue job bookmarks, track how much of the data AWS Glue processed.
With workload partitioning enabled, each ETL job-run sets an upper bound on the unprocessed dataset size or limits the number of files with a job-run. For example, if you need to process 1,000 files, you can set the number of files to be 500 and separate them into two job runs, which can be executed sequentially or in parallel.
This expanded capability is available in every Region where AWS Glue is available. To learn more, visit the AWS Glue User Guide or read our blog. Access the AWS Glue console to get started.