AWS HPC Blog

Using the Terraform AWS Cloud Control provider for managing AWS Batch resources

Using the Terraform AWS Cloud Control provider for managing AWS Batch resources-resizedAs you might know, there are two Terraform providers for AWS. The original Terraform AWS provider is an open-source project with community-driven pull requests. The provider is hand-coded infrastructure-as-code library that makes calls directly using the AWS SDK, which in turn call AWS APIs. While this approach provides a great developer experience, we know that it can take some time to review and incorporate pull requests that support new AWS service features and new AWS services.

In the middle of 2024, the Terraform AWS Cloud Control (AWSCC) provider was made generally available by HashiCorp. This provider works with the AWS Cloud Control API, which is a set of common APIs that make it easy for developers and partners to manage the lifecycle of AWS and third-party services. In contrast to the original AWS Provider, the AWSCC provider is automatically generated based on the Cloud Control API published by AWS. That means the latest features and services from AWS can be supported right away.

Until recently, AWS Batch job definitions weren’t supported by the AWS Cloud Control API as a managed resource. For this reason the AWS Batch team has been focusing on the original AWS provider for providing a blueprint for working with AWS Batch on Amazon EKS. Now that AWS Batch job definitions are supported in the Cloud Control API, you can use the AWSCC provider to manage all of your Batch resources. Better yet, you can use both providers in the same stack, retaining your existing resources managed by your original AWS provider.

Let’s take a look at an example of managing Batch resources with these two providers.

Managing a Batch compute environment resource with Terraform

The following snippet of code showcases how to create and manage an AWS Batch compute environment using the original AWS provider.

resource "aws_batch_compute_environment" "sample" {
  compute_environment_name_prefix = "mySampleComputeEnv"

  compute_resources {
    type = "EC2"
    allocation_strategy = "BEST_FIT_PROGRESS"
    min_vcpus = 0
    max_vcpus = 256
    instance_type = [
      "c5",
      "m5",
      "r5"
    ]

    instance_role = aws_iam_instance_profile.ecs_instance_role.arn
    security_group_ids = [
      aws_security_group.sample.id
    ]
    subnets = [
      aws_subnet.sample.id
    ]
  }
  type         = "MANAGED"
}
JSON

Keep in mind that this example references other resources in the Terraform stack, such as the VPC security group and subnet. You can find a full deployment example in the Data on Amazon EKS blueprint for AWS Batch that was mentioned previously.

One detail worth noting is the use of a compute_environment_name_prefix instead of the compute_environment_name to set the name of the compute environment. By using a prefix, the AWS provider is able to handle deployments that would require a complete replacement of the compute environment (CE), much like a blue/green deployment for a microservice. With the prefix, the provider creates a new CE, moves the job queue association(s) to it, then deletes the old CE. Using the name (not the prefix) would mean that you take ownership of the order of operations for replacing the old CE with the new.

Now let’s take a look at what this same resource would look like if it was managed by the AWSCC provider.

resource "awscc_batch_compute_environment" "sample" {
  compute_environment_name = "mySampleComputeEnv"
  compute_resources {
    type = "EC2"
    allocation_strategy = "BEST_FIT_PROGRESS"
    min_vcpus = 0
    max_vcpus = 256
    instance_types = [
      "c5",
      "m5",
      "r5"
    ]
    instance_role = aws_iam_instance_profile.ecs_instance_role.arn
    security_group_ids = [
      aws_security_group.sample.id
    ]
    subnets = [
         aws_subnet.sample.id
    ]
  type                        = "MANAGED"
  replace_compute_environment = false
  }
}
JSON

As you can tell, the AWSCC managed resource looks almost the same, and notice that you can still use other resources that were created using the original AWS provider in the resource definition.

The differences between the two snippets are:

  1. We changed the resource type to reflect the AWSCC provider resource for Batch compute environments.
  2. The AWSCC provider follows the Batch API and does not have a an argument to define a prefix for the compute environment name. If a deployment requires you to replace an existing compute environment, you will need to also account for how to update any associated job queue(s) in the Terraform stack. For example, you would create a new compute environment with a different name, associate the job queue(s) with it, then deactivate and delete the old compute environment.
  3. instance_types is now plural to reflect the API. I’m not sure why the original AWS provider uses the singular instance_type, since this deviates from the Batch API.
  4. There is an additional argument of replace_compute_environment which we set to false. The reason for this is that CEs that use a service-linked role can update a larger set of attributes without having to replace the CE in what is known as an infrastructure update. If this argument was true you would be restricted to a much smaller set of updatable attributes in the CE. For more information on which settings trigger an infrastructure update, refer to the Batch user guide on updating compute environments.

The example shows that there could be some differences in the arguments and behavior of providers that require your close attention. For this reason, I suggest treading carefully when refactoring your existing Terraform-managed resources.

For managing new resources, or for resources that have more frequent updates — like AWS Batch job definitions — I definitely recommend you use the AWSCC provider. A case in point, configurable namespaces, persistent volume claims, container mount sub-path support, and pod annotations where added to job definitions for Batch on EKS. These features are available today within the auto-generated AWSCC provider, but the original AWS provider will need a community member to submit a pull request and the maintainers will need to review and accept the request before these features become available there.

A downside of leveraging the AWSCC provider – in my opinion – is that the provider’s documentation of the resources is minimal, consisting of a little more that the resource arguments’ type. You will need to refer to the AWS Cloud Control resource type documentation for what the arguments represent, and any caveats with how you’d use them in practice. This creates a small amount of developer friction when using the AWSCC provider.

Conclusion

With the addition of AWS Batch job definition support to the AWS Cloud Control API, it’s now possible to manage all AWS Batch resources using the Terraform AWS Cloud Control provider. You can manage new or fast changing resources using the AWSCC provider alongside the original AWS provider.

Try it out on your own resources or start out with the Data on Amazon EKS AWS Batch blueprint. If you do start using the AWSCC provider with Batch, drop us a line at ask-hpc@amazon.com and let us know how we can improve the experience.

TAGS: ,
Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.