AWS Compute Blog

Orchestrating an application process with AWS Batch using AWS CloudFormation

This post is written by Sivasubramanian Ramani

In many real work applications, you can use custom Docker images with AWS Batch and AWS CloudFormation to execute complex jobs efficiently.

 

This post provides a file processing implementation using Docker images and Amazon S3, AWS Lambda, Amazon DynamoDB, and AWS Batch. In this scenario, the user uploads a CSV file into an Amazon S3 bucket, which is processed by AWS Batch as a job. These jobs can be packaged as Docker containers and are executed using Amazon EC2 and Amazon ECS.

 

The following steps provide an overview of this implementation:

  1. AWS CloudFormation template launches the S3 bucket that stores the CSV files.
  2. The Amazon S3 file event notification executes an AWS Lambda function that starts an AWS Batch job.
  3. AWS Batch executes the job as a Docker container.
  4. A Python-based program reads the contents of the S3 bucket, parses each row, and updates an Amazon DynamoDB table.
  5. Amazon DynamoDB stores each processed row from the CSV.

 Prerequisites 

 

Walkthrough

The following steps outline this walkthrough. Detailed steps are given through the course of the material.

  1. Run the CloudFormation template (command provided) to create the necessary infrastructure.
  2. Set up the Docker image for the job:
    1. Build a Docker image.
    2. Tag the build and push the image to the repository.
  3. Drop the CSV into the S3 bucket (copy paste the contents and create them as a [sample file csv]).
  4. Confirm that the job runs and performs the operation based on the pushed container image. The job parses the CSV file and adds each row into DynamoDB.

 

Points to consider

  • The provided AWS CloudFormation template has all the services (refer to upcoming diagram) needed for this walkthrough in one single template. In an ideal production scenario, you might split them into different templates for easier maintenance.

 

  • As part of this walkthrough, you use the Optimal Instances for the batch. The a1.medium instance is a less expensive instance type introduced for batch operations, but you can use any AWS Batch capable instance type according to your needs.

 

  • To handle a higher volume of CSV file contents, you can do multithreaded or multiprocessing programming to complement the AWS Batch performance scale.

 

Deploying the AWS CloudFormation template

 

When deployed, the AWS CloudFormation template creates the following infrastructure.

 application using AWS BatchAn application process using AWS Batch

 

You can download the source from the github location. Below steps will detail using the downloaded code. This has the CloudFormation template that spins up the infrastructure, a Python application (.py file) and a sample CSV file. You can optionally use the below git command to clone the repository as below. This becomes your SOUCE_REPOSITORY

$ git clone https://github.com/aws-samples/aws-batch-processing-job-repo

$ cd aws-batch-processing-job-repo

$ aws cloudformation create-stack --stack-name batch-processing-job --template-body file://template/template.yaml --capabilities CAPABILITY_NAMED_IAM

 

When the preceding CloudFormation stack is created successfully, take a moment to identify the major components.

 

The CloudFormation stack spins up the following resources, which can be viewed in the AWS Management Console.

  1. CloudFormation Stack Name – batch-processing-job
  2. S3 Bucket Name – batch-processing-job-<YourAccountNumber>
    1. After the sample CSV file is dropped into this bucket, the process should kick start.
  3. JobDefinition – BatchJobDefinition
  4. JobQueue – BatchProcessingJobQueue
  5. Lambda – LambdaInvokeFunction
  6. DynamoDB – batch-processing-job
  7. Amazon CloudWatch Log – This is created when the first execution is made.
    1. /aws/batch/job
    2. /aws/lambda/LambdaInvokeFunction
  8. CodeCommit – batch-processing-job-repo
  9. CodeBuild – batch-processing-job-build

 

Once the above CloudFormation stack is complete in your personal account, we need to containerize the sample python application and push it to the ECR. This can be done in two ways as below:

 

Option A: CI/CD implementation.

As you notice, a CodeCommit and CodeBuild were created as part of above stack creation. CodeCommit URL can be found from your AWS Console > CodeCommit.  With this option, we can copy the contents from the downloaded source git repo and trigger deployment into your repository as soon as the code is checked in into your CodeCommit repository. Your CodeCommit repository will be similar to “https://git-codecommit.us-east-1.amazonaws.com/v1/repos/batch-processing-job-repo”

 

Below steps details to clone your code commit & push the changes to the repo

  1.     – $ git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/batch-processing-job-repo
  2.     – cd batch-processing-job-repo
  3.     – copy all the contents from SOURCE_REPOSITORY (from step 1) and paste inside this folder
  4.     – $ git add .
  5.     – $ git commit -m “commit from source”
  6.     – $ git push

 

You would notice as soon as the code is checked in into your CodeCommit repo, a build is triggered and Docker image built based on the Python source will be pushed to the ECR!

 

Option B: Pushing the Docker image to your repository manually in your local desktop

 

Optionally you can build the docker image and push it to the repository. The following commands build the Docker image from the provided Python code file and push the image to your Amazon ECR repository. Make sure to replace <YourAcccountNumber> with your information. The following sample CLI command uses the us-west-2 Region. If you change the Region, make sure to replace the Region values in the get-login, docker tag, and push commands, also.

 

#Get login credentials by copying and pasting the following into the command line

$ aws ecr get-login --region us-west-2 --no-include-email

# Build the Docker image.

$ docker build -t batch_processor .

# Tag the image to your repository.

$ docker tag batch_processor <YourAccountNumber>.dkr.ecr.us-west-2.amazonaws.com/batch-processing-job-repository

# Push your image to the repository.

$ docker push <YourAccountNumber>.dkr.ecr.us-west-2.amazonaws.com/batch-processing-job-repository

 

Navigate to the AWS Management Console, and verify that you can see the image in the Amazon ECR section (on the AWS Management Console screen).

 

Testing

  • In AWS Console, select “CloudFormation”. Select the S3 bucket that was created as part of the stack. This will be something like – batch-processing-job-<YourAccountNumber>
  • Drop the sample CSV file provided as part of the SOUCE_REPOSITORY

 

Code cleanup

To clean up, delete the contents of the Amazon S3 bucket and Amazon ECR repository.

 

In the AWS Management Console, navigate to your CloudFormation stack “batch-processing-job” and delete it.

 

Alternatively, run this command in AWS CLI to delete the job:

$ aws cloudformation delete-stack --stack-name batch-processing-job

 

Conclusion 

You were able to launch an application process involving AWS Batch to integrate with various AWS services. Depending on the scalability of the application needs, AWS Batch is able to process both faster and cost efficiently. I also provided a Python script, CloudFormation template, and a sample CSV file in the corresponding GitHub repo that takes care of all the preceding CLI arguments for you to build out the job definitions.

 

I encourage you to test this example and see for yourself how this overall orchestration works with AWS Batch. Then, it is just a matter of replacing your Python (or any other programming language framework) code, packaging it as a Docker container, and letting the AWS Batch handle the process efficiently.

 

If you decide to give it a try, have any doubt, or want to let me know what you think about the post, please leave a comment!

 


About the Author

Sivasubramanian Ramani (Siva Ramani) is a Sr Cloud Application Architect at AWS. His expertise is in application optimization, serverless solutions and using Microsoft application workloads with AWS.