Microsoft Workloads on AWS

Batch processing for Microsoft Windows Server workloads on AWS

Computing at-scale solutions is required in many industries and domains. Cloud computing provides elastic on-demand access to large amounts of computing resources and enables economically efficient and technically flexible solutions naturally suited for computing at scale.

Batch processing is a requirement for many scale-out computing solutions. Customers use batch processing as a non-interactive way of computation to calculate outputs. These outputs can be used to produce simulation results, analyze large datasets, train AI/ML models, or render digital media content.

Traditionally, batch processing has been a domain of the Linux operating systems, which is natively supported by AWS services, such as AWS Batch and AWS ParallelCluster. Customers are now starting to leverage the cloud to modernize and automate batch workflows involving Microsoft Windows Server. These users are not able to switch between operating systems easily because of software dependencies. As a result, they resort to custom-built solutions, which require significant upfront implementation and ongoing maintenance efforts.

This blog post provides a reusable and general framework for running Windows Server batch processing workloads on AWS. We will show you how to leverage Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate batch jobs with Amazon Elastic Container Service (Amazon ECS), which provides the compute runtime for Windows Containers. We will provide step-by-step instructions and AWS CloudFormation templates, allowing you to go hands-on and experiment with the setup and customize it to your needs.

Solution overview

The following diagram shows a high-level overview of the solution with the individual layers described in more detail.

Overview of solution Figure 1: Architectural high level design showing orchestration (MWAA), compute (ECS) and storage layer (S3)

Orchestration layer

Amazon MWAA is used to orchestrate compute tasks and model dependencies between them. For example, for a particular workflow, you may first need to pre-process the source data, then run simulations and aggregate the results.

In the context of Airflow, workflows are defined using Direct Acyclic Graphs (DAGs) written in Python, which allows for flexible modeling of complex dependencies between tasks. Next to running the actual batch compute jobs, tasks in a workflow can be used to perform actions on the underlying infrastructure. Examples include scaling up a cluster for subsequent parallel processing or scaling it down once an upstream batch job has completed.

Once a DAG is initialized and registered with Amazon MWAA, you can take a no-code approach towards running your Windows batch jobs. MWAA comes with a web-based Airflow UI. You can use this to run workflows, configure inputs, and monitor progress. The web UI user database can also be integrated with external identity providers and supports common enterprise authentication protocols.

Compute layer

We leverage Amazon ECS to provide the compute capacity for Windows batch jobs using Amazon Elastic Compute Cloud (Amazon EC2). Individual tasks run independently and isolated from one another as Docker containers. In this solution, we rely on Amazon ECS optimized Windows AMIs provided by AWS.

For more information on Windows AMIs, refer to the Microsoft Software Supplemental License for Windows Container Base Image. To learn more about Microsoft licensing options for Amazon ECS optimized AMIs, refer to the Licensing – Windows Server section in the AWS and Microsoft FAQ.

Amazon EC2-based clusters are well suited for batch workloads, allowing for more direct access to the underlying infrastructure – for example, if GPU access or access to instance storage or Amazon Elastic Block Store (EBS) is required. Windows workloads containers can be large, resulting in container download and extraction taking longer than compute time. For jobs that consist of many short subsequent tasks, caching the container image on Amazon EC2 instances will greatly reduce the total runtime since the container download and extraction time is incurred only once.

Storage layer

You will use Amazon Simple Storage Service (Amazon S3) to store inputs and outputs of batch jobs. The solution can be customized to provide jobs with access to Amazon EC2 instance storage or leverage Amazon EBS volumes for jobs that have high data throughput and can take advantage of local storage. If shared storage is needed, Amazon FSx for Windows File Server can easily be added to the solution as a high performance and robust shared storage service.

Walkthrough

Going through this example will take approximately two hours (most of which will be waiting for infrastructure to deploy and batch simulation to finish computing) and generate about $5 of AWS charges to the AWS Account used. No special prerequisites are required other than access to an AWS account via the AWS management console.

Creating the infrastructure

Perform the following steps in the console to deploy the infrastructure:

  1. In the Amazon S3 console, create a bucket for storing the Amazon MWAA DAG. Choose a unique name for the bucket such as mwaa-bucket-your-uuid (where your-uuid can be any unique identifier).
  2. In this bucket, create a folder named dags, download the sample DAG file, and upload it to the dags folder:Airflow dags directory within MWAA bucketFigure 2: Uploaded DAG file inside the MWAA bucket
  3. Click to deploy the Amazon MWAA infrastructure. Choose a stack name (like mwaa-stack), and for the AirflowBucketName parameter, enter the bucket from the previous step (mwaa-bucket-your-uuid in the previous example). Click Next and Submit to create the stack.
  4. Create another Amazon S3 bucket for storing the output files your batch jobs. Again, select a unique name, such as ecs-bucket-your-uuid.
  5. Click to deploy the Amazon ECS infrastructure. Choose a stack name (such as ecs-stack) and input ecs-bucket-your-uuid for the EcsBucketName parameter.

Alternatively, if you prefer a CLI-based workflow over the console method described previously, you can clone the GitHub repository and run the deploy.sh script.

Creation of the CloudFormation stacks (in particular the MWAA part) can take up to 35 minutes. Check the CloudFormation console to monitor the progress. Once both stacks are in CREATE_COMPLETE state, navigate to the MWAA console and click on Open Airflow UI for the MWAA-Batch-Compute-Environment.

Running a simulation batch job

Overview of the batch job structure

In the Airflow UI, click on windows-batch-compute-dag and select the Graph tab to inspect the DAG we use in context of this example. The following diagram shows its structure:

Simulation as DAG

Figure 3: DAG used for the simulation

The first task scales up the Amazon ECS cluster to achieve the desired degree of parallelism for the subsequent compute tasks. Each compute task runs a simulation within a Windows container on the Amazon ECS cluster, which uploads a results file to Amazon S3 when finished. Once all compute tasks finish, another Amazon ECS task aggregates the results to provide an aggregated output. Finally, the last task scales down the Amazon ECS cluster. More details on the actual simulation workload are provided at the end of this section.

Create and monitor the simulation

Return to the DAGs overview and kick off a batch simulation by clicking Trigger DAG w/ config as shown in the following screenshot:

MWAA UI console and trigger buttons for workflow

Figure 4: MWAA UI console to triggering the workflow

You will then be prompted to configure simulation parameters, as follows:

Configuring parameters for triggering the DAG

Figure 5: Primary parameter setup for the DAG run in MWAA console

Set the name of the Amazon S3 bucket to the one created earlier for storing Amazon ECS task outputs (for example ecs-bucket-your-uuid) and click Trigger. The following sequence of events will then happen in context of the batch job you just submitted, taking about 1 hour to complete:

  1.  The Amazon ECS cluster will be scaled up to six instances (of type c5.xlarge, unless you changed the defaults when deploying the Amazon ECS stack). The entire cluster with the default configuration has 24 vCPUs and 48GB of memory available to run simulation tasks in parallel.
  2. Once the scale up task completes, the Amazon ECS cluster will start to run simulation tasks. The MWAA environment will automatically scale and increase the number of workers until it can handle 24 concurrent tasks to load the Amazon ECS cluster fully. Each individual task is running a simulation in the context of a Windows container on the Amazon ECS cluster and will take about 5 minutes to complete and will upload its result to the Amazon S3 bucket. Overall, 200 simulations tasks will be run, with at most 24 in parallel due to the cluster size and resource requirements of each task. You can check the Container Insights console to monitor the state of your cluster, as shown in the following figure:

    CPU, memory and batch job task metrics

    Figure 6: Monitoring the execution in the ECS console

  3. Once all simulation tasks have completed, an aggregate task reads all outputs from preceding tasks from Amazon S3 and combines them into a single file to generate a plot.
  4. Finally, a scale-in task sets the cluster size to zero.

You can monitor the progress in the Airflow UI. Once the DAG run finishes, you can check the results file generated in the Amazon S3 bucket. Specifically, the prefix sim/task_results should contain 200 files, one per task completed, while the sim/aggregate_results task stores the combined results and a corresponding visualization plot in form of a PNG file:

The simulationoutput folder on S3 contains the per task and aggregated results

Figure 7: Result outputs in S3 bucket

Diving deeper and customizing the sample

Now let’s have a look and dive deeper into the actual simulation workload. For illustration purposes, we have created a simple compute simulation that calculates the Bit Error Rate (BER) of a telecommunication channel depending on its Signal Noise Ratio (SNR) using the free GNU Octave software to running inside the Windows containers.

The results.png file contained in the aggregate_results prefix is shown in the following image:

The compute BER curve from the simulation

Figure 8: simulation’s numerical output plotted

Individual points on this curve are the combined result of the 200 simulation tasks that we just ran across six Windows instances on the Amazon ECS cluster. The final aggregation job then examined all the individual task results and counted the overall number of erroneous bits in order to generate this plot.

The GitHub repository holds all of the details. In particular, the Docker section contains the simulation scripts and instructions to build the container and run the simulation. The DAG file contains the overall orchestration logic to run this sample end-to-end. If you intend to build customized containers with specialized software and workflows, dive deep here to understand how these components tie together so that you can apply them successfully to your workloads.

Cleaning up

Empty and delete both Amazon S3 buckets. Delete both CloudFormation stacks created earlier.

Conclusion

In this blog post, we showed how Amazon Managed Workflows for Apache Airflow can be used to simplify and modernize Windows batch computing workflows on AWS. Amazon MWAA provides an interactive UI to create, monitor, and manage batch workflows. The UI enhances usability and makes activities like reviewing the state of jobs and gathering runtime information like logs, duration, and status accessible to non-technical users.

Using Amazon EC2 Windows containers, customers can focus on their applications and use Windows AMIs without worrying about potential interference between the operating system and their application. In conjunction with Amazon ECS, Amazon MWAA enables customers to take advantage of elastic compute resources on AWS, enabling scalable batch computing workloads and control over parallelism and runtime.

You can start tailoring our solution to your needs with this GitHub repository.


AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWSContact us to start your modernization journey today.

Petar Forai

Petar Forai

Petar Forai is a Senior Cloud Infrastructure Architect at AWS Professional Services. Petar has a background in high performance computing and built on premises and cloud based HPC solutions. At AWS Petar specializes in cloud foundations and computing at scale topics.

Michael Meidlinger

Michael Meidlinger

Michael Meidlinger is a Sr. Solutions Architect at AWS where he works with Enterprise customer in regulated industries. He holds a PhD in telecommunications engineering with a background in signal processing, statistics, networking, embedded systems development and Linux system administration. Before joining AWS, he worked on mission critical compute systems and safe cloud computing concepts.