Run Monte Carlo simulations at scale with AWS Step Functions and AWS Fargate

Introduction

Organizations across financial services and other industries have business processes that require executing the same business logic across billions of records for their machine learning and compliance needs. Many organizations rely on internal custom orchestration systems or big data frameworks to coordinate the parallel processing of their business logic across many parallel compute nodes. The maintenance and operation of orchestration systems can require significant effort from development resources or even require additional internal dedicated teams to manage these tools. Organizations also often manage large clusters of compute resources for executing business logic at scale requiring significant operational and infrastructure investments.

In this post, we demonstrate how to build an application capable of handling datasets containing billions of records, operating on a compute environment with minimal configuration and running tasks resiliently at scale. It combines the power and scale of AWS Step Functions distributed map feature and Amazon Simple Storage Service (Amazon S3) with the simplicity of AWS Fargate to build a robust and resilient solution.

Solution overview

The solution generates and processes a fictitious dataset with a default of 500,000 personal and commercial loan files. The solution stores the loan files in Amazon S3 as objects, creates an Amazon S3 Inventory list, and runs a simple algorithm that predicts the likelihood of the loan defaulting in the event of a federal reserve rate increase to 8%.

The AWS Step Functions workflow orchestrates the process of running the Monte Carlo algorithm for billions of loan files using distributed map. Distributed map is a state of AWS Step Functions that can iterate over large scale data in Amazon S3 and invoke AWS Fargate, AWS Lambda, AWS SDK service integrations, third-party APIs or any combination of those as child workflows at a maximum concurrency of 10,000. This sample solution uses Amazon S3 Inventory list as the input for distributed map. It uses an Amazon Elastic Container Service (Amazon ECS) environment powered by AWS Fargate and AWS Step Functions Activity to process the loan files. Distributed map reads the records from the Amazon S3 Inventory list, batches the records and distributes them to a highly resilient Step Functions Activity. The Amazon Elastic Container Service (Amazon ECS) task consumes the data from the AWS Step Functions activity to run the Monte Carlo simulation. Note that this integration of Amazon ECS with AWS Step Functions Activity is comparable to consuming messages in Amazon ECS from Amazon SQS. It is different from direct integration of Amazon ECS with AWS Step Functions.

Prerequisites

An AWS Account
Access to AWS Services
- AWS IAM with access to create policies and roles
- Amazon ECS with access to create Clusters, Services, and Task Definitions
- AWS Step Functions with access to create State Machines and Activities
- Amazon VPC with access to create Subnets, Internet Gateways, NAT Gateways, VPC Endpoints, Route Tables, and Flow Logs
- Amazon CloudWatch with access to create Log Groups

Getting started

Clone the solution from Github to your workstation
From the AWS CloudFormation Console, and choose Create Stack
In the Specify template section, choose Upload a template file
Choose Choose file and navigate to the directory you cloned the repository to
Navigate into the cloudformation directory and stack you want to deploy, choose the main.yml file
Provide a stack name (ex: sfn-sample)
On the next page please review the variables. Here you can adjust things like concurrency or how many records to generate.
Choose Next
Review the checkboxes at the bottom of the page. If you consent, check the boxes and choose Submit
Once deployed, the Stacks will create two AWS Step Functions State Machines. You will first run the *-datagen-* workflow to generate the source data before processing. Next you will run the *-dataproc-* workflow to process the data

Walkthrough

The AWS Step Functions workflow starts by adding the desired number of Amazon ECS tasks to an existing Amazon ECS cluster. The workflow proceeds to the next step which is iterating the Amazon S3 objects metadata from the Amazon S3 Inventory list. When you configure the location of the Amazon S3 object or S3 Inventory list, distributed map reads individual items in the S3 object, optionally batches them, and distributes to the steps defined inside the distributed map. In our sample project, the step inside the distributed map is an activity task. The activity task step is an AWS Step Functions feature that enables the task to be hosted on almost any compute platform including Amazon Elastic Compute Cloud (Amazon EC2), Amazon ECS, and even mobile devices. AWS Step Functions manages an internal queue for each type of activity you create for no additional cost.

The AWS Step Functions activity task queues the batch received from distributed map to the activity and waits for the acknowledgement from the worker, the Amazon ECS task. As soon as the Amazon ECS task completes initialization, it starts to consume a batch of records from the activity. The Amazon ECS task sends either a success or failure response back to the activity after processing the batch. The activity removes the batch from the queue upon receiving a successful response from the Amazon ECS task. The Amazon ECS task then consumes the next batch from the activity. The AWS Step Functions workflow repeats this processing until all of the batches of Amazon S3 objects have been processed through the activity by the Amazon ECS tasks. The AWS Step Functions workflow proceeds to terminate all of the Amazon ECS tasks after the processing is complete.

Design choices and considerations

Understanding scaling and performance

AWS Step Functions Distributed Map executes steps as separate sub-workflows. You have the ability to run one at a time, two at a time, three at a time, and up to 10,000 at a time. You can customize the level of parallelism required and AWS Step Functions creates an equivalent number of parallel workflows. Additionally, you can batch data instead of sending them one by one. The example in this post initializes 1000 Amazon ECS tasks as workers, and processes 500,000 records in batches of 100 records with a parallelism of 1000.

           "ItemReader": {
              "Resource": "arn:aws:states:::s3:getObject",
              "ReaderConfig": {
                "InputType": "MANIFEST",
                "MaxItems": 0
              },
              "Parameters": {
                "Bucket.$": "$.key.bucket",
                "Key.$": "$.key.key"
              }
            },
      "MaxConcurrency": 1000,
      "Label": "S3objectkeys",
      "ItemBatcher": {
        "MaxItemsPerBatch": 100
      },

The above configuration results in distributed map processing 100,000 records with 1000 parallel child workflow runs and 500,000 records in just 5 sets of 1000 parallel child workflow runs.

The example uses a task definition of .25 vCPU and .5GB of RAM, which is adequate for running the simple example algorithm for batches of 100 records from the dataset. When you increase the vCPU/RAM, you can process more records in parallel or decrease the time required to process the records. However, it’s important to note that increasing vCPU/RAM results in higher cost for the time-based billing increment. Consequently, your choice of vCPU/RAM should align with the specific performance characteristics and budget requirements of your workload.
The example uses Amazon S3 Inventory list to iterate the Amazon S3 objects to improve performance and reduce cost. You can also directly point to the prefix of Amazon S3 objects. Distributed map collects all the objects metadata using Amazon S3 listObjectV2 API before starting the iteration. When processing millions of objects, using S3 inventory list is faster since S3 listObjectV2 returns 1000 objects at a time. Additionally, you pay for the S3 listObjectV2 API calls.

Resiliency through AWS Step Functions’ activity

The AWS Step Functions activity task allows us to nicely decouple the AWS Step Functions distributed map and Amazon ECS task scaling. This follows messaging pattern where the activity tasks sends the message to internally managed activity (i.e., queue) and the Amazon ECS tasks consume the messages from the activity.

client = boto3.client('stepfunctions', region_name=region, config=config)
response = client.get_activity_task(
    activityArn = activity_arn,
    workerName = worker_name
  )

After the batch is processed, the Amazon ECS tasks notifies AWS Step Functions by either sending success or failure.

if success:
  client.send_task_success(
    taskToken = token,
    output = stringoutput
  )
else:
  client.send_task_failure(
    taskToken = token,
    cause = strCause,
    error = strError
  )

The message remains in the activity until it is processed. This makes the solution tolerant to interruption. If your use case has flexible completion requirements, then you may be able to achieve significant cost-saving for your workload with the Spot capacity provider for Amazon ECS tasks within AWS Fargate.

Comparison with direct Amazon ECS integration

The solution can also be designed with direct Amazon ECS integration using Amazon ECS run task API invocation inside the distributed map. We chose AWS Step Functions activity for the sample for the following reasons:

The Amazon ECS run task integration starts a new task, runs the batch processing, and terminates the task. This means for running 500,000 record dataset we end up creating 5000 ephemeral tasks. By using activity, we are able to reuse the same 1000 tasks we created in the beginning of the workflow.
Amazon ECS run task accepts request payload through the ContainerOverrides parameter which has a limit of 8192 characters whereas Activity can accept up to 256 KB of data per request.
While Amazon ECS runtask API integration can be retried automatically when there are failures the Activity feature provides additional resiliency features. The activity’s asynchronous processing architecture provides resiliency to hard failures such as service unavailability or unexpected errors in Amazon ECS tasks. As messages are kept in the activity, they can be processed as workers become available without requiring retry logic within the workflow.

Key considerations when running at scale

We have chosen to run the workflow inside the distributed map as EXPRESS as the processing runs within 5 minutes. If your individual steps for a batch of data takes more than 5 minutes, then you should consider reducing the batch size or changing the workflow type to STANDARD.
API calls for Distributed map child workflows such as StartExecution are subject to your account limit. Standard workflows have a lower start execution limit compared to express workflows. If you set the concurrency above this limit, then you’ll experience delays in starting the workflows.
While distributed map can process 10,000 workflows in parallel other services utilized in the child workflow may have different scaling characteristics. You can use the maxconcurrency parameter to limit the scaling to the desired number in addition to implementing backoff and retry strategies within your workflow.
Optimizing the batch size for your use case can lead to cost savings.

Cleaning up

Delete any example resources you no longer need, to avoid incurring future costs. You can destroy the entire stack of resources by deleting the stack from AWS CloudFormation.

Navigate to the AWS CloudFormation Console and choose the deployed stack
Choose Delete and follow the prompts

Conclusion

In this post, we showed you an example solution for simplifying Monte Carlo simulations and machine learning data processing implementations using AWS Step Functions Distributed Map and Amazon ECS. AWS Step Functions’ distributed map is a powerful feature to process massive amounts of data at large scale without managing servers. The workflows inside the distributed map can use AWS Fargate, AWS Lambda, AWS SDK service integrations, third-party APIs, or any combination.

To learn more about how to use distributed map, visit the distributed map paved path.

To get hands-on experience with building workflows, try the Serverless Land workflows and the workshop.

Containers