Microsoft Workloads on AWS
Batch processing for Microsoft Windows Server workloads on AWS
Computing at-scale solutions is required in many industries and domains. Cloud computing provides elastic on-demand access to large amounts of computing resources and enables economically efficient and technically flexible solutions naturally suited for computing at scale.
Batch processing is a requirement for many scale-out computing solutions. Customers use batch processing as a non-interactive way of computation to calculate outputs. These outputs can be used to produce simulation results, analyze large datasets, train AI/ML models, or render digital media content.
Traditionally, batch processing has been a domain of the Linux operating systems, which is natively supported by AWS services, such as AWS Batch and AWS ParallelCluster. Customers are now starting to leverage the cloud to modernize and automate batch workflows involving Microsoft Windows Server. These users are not able to switch between operating systems easily because of software dependencies. As a result, they resort to custom-built solutions, which require significant upfront implementation and ongoing maintenance efforts.
This blog post provides a reusable and general framework for running Windows Server batch processing workloads on AWS. We will show you how to leverage Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate batch jobs with Amazon Elastic Container Service (Amazon ECS), which provides the compute runtime for Windows Containers. We will provide step-by-step instructions and AWS CloudFormation templates, allowing you to go hands-on and experiment with the setup and customize it to your needs.
Solution overview
The following diagram shows a high-level overview of the solution with the individual layers described in more detail.
Figure 1: Architectural high level design showing orchestration (MWAA), compute (ECS) and storage layer (S3)
Orchestration layer
Amazon MWAA is used to orchestrate compute tasks and model dependencies between them. For example, for a particular workflow, you may first need to pre-process the source data, then run simulations and aggregate the results.
In the context of Airflow, workflows are defined using Direct Acyclic Graphs (DAGs) written in Python, which allows for flexible modeling of complex dependencies between tasks. Next to running the actual batch compute jobs, tasks in a workflow can be used to perform actions on the underlying infrastructure. Examples include scaling up a cluster for subsequent parallel processing or scaling it down once an upstream batch job has completed.
Once a DAG is initialized and registered with Amazon MWAA, you can take a no-code approach towards running your Windows batch jobs. MWAA comes with a web-based Airflow UI. You can use this to run workflows, configure inputs, and monitor progress. The web UI user database can also be integrated with external identity providers and supports common enterprise authentication protocols.
Compute layer
We leverage Amazon ECS to provide the compute capacity for Windows batch jobs using Amazon Elastic Compute Cloud (Amazon EC2). Individual tasks run independently and isolated from one another as Docker containers. In this solution, we rely on Amazon ECS optimized Windows AMIs provided by AWS.
For more information on Windows AMIs, refer to the Microsoft Software Supplemental License for Windows Container Base Image. To learn more about Microsoft licensing options for Amazon ECS optimized AMIs, refer to the Licensing – Windows Server section in the AWS and Microsoft FAQ.
Amazon EC2-based clusters are well suited for batch workloads, allowing for more direct access to the underlying infrastructure – for example, if GPU access or access to instance storage or Amazon Elastic Block Store (EBS) is required. Windows workloads containers can be large, resulting in container download and extraction taking longer than compute time. For jobs that consist of many short subsequent tasks, caching the container image on Amazon EC2 instances will greatly reduce the total runtime since the container download and extraction time is incurred only once.
Storage layer
You will use Amazon Simple Storage Service (Amazon S3) to store inputs and outputs of batch jobs. The solution can be customized to provide jobs with access to Amazon EC2 instance storage or leverage Amazon EBS volumes for jobs that have high data throughput and can take advantage of local storage. If shared storage is needed, Amazon FSx for Windows File Server can easily be added to the solution as a high performance and robust shared storage service.
Walkthrough
Going through this example will take approximately two hours (most of which will be waiting for infrastructure to deploy and batch simulation to finish computing) and generate about $5 of AWS charges to the AWS Account used. No special prerequisites are required other than access to an AWS account via the AWS management console.
Creating the infrastructure
Perform the following steps in the console to deploy the infrastructure:
- In the Amazon S3 console, create a bucket for storing the Amazon MWAA DAG. Choose a unique name for the bucket such as
mwaa-bucket-your-uuid
(whereyour-uuid
can be any unique identifier). - In this bucket, create a folder named
dags
, download the sample DAG file, and upload it to thedags
folder:Figure 2: Uploaded DAG file inside the MWAA bucket - Click to deploy the Amazon MWAA infrastructure. Choose a stack name (like
mwaa-stack
), and for theAirflowBucketName
parameter, enter the bucket from the previous step (mwaa-bucket-your-uuid
in the previous example). Click Next and Submit to create the stack. - Create another Amazon S3 bucket for storing the output files your batch jobs. Again, select a unique name, such as
ecs-bucket-your-uuid
. - Click to deploy the Amazon ECS infrastructure. Choose a stack name (such as
ecs-stack
) and inputecs-bucket-your-uuid
for theEcsBucketName
parameter.
Alternatively, if you prefer a CLI-based workflow over the console method described previously, you can clone the GitHub repository and run the deploy.sh
script.
Creation of the CloudFormation stacks (in particular the MWAA part) can take up to 35 minutes. Check the CloudFormation console to monitor the progress. Once both stacks are in CREATE_COMPLETE
state, navigate to the MWAA console and click on Open Airflow UI for the MWAA-Batch-Compute-Environment
.
Running a simulation batch job
Overview of the batch job structure
In the Airflow UI, click on windows-batch-compute-dag and select the Graph tab to inspect the DAG we use in context of this example. The following diagram shows its structure:
Figure 3: DAG used for the simulation
The first task scales up the Amazon ECS cluster to achieve the desired degree of parallelism for the subsequent compute tasks. Each compute task runs a simulation within a Windows container on the Amazon ECS cluster, which uploads a results file to Amazon S3 when finished. Once all compute tasks finish, another Amazon ECS task aggregates the results to provide an aggregated output. Finally, the last task scales down the Amazon ECS cluster. More details on the actual simulation workload are provided at the end of this section.
Create and monitor the simulation
Return to the DAGs overview and kick off a batch simulation by clicking Trigger DAG w/ config as shown in the following screenshot:
Figure 4: MWAA UI console to triggering the workflow
You will then be prompted to configure simulation parameters, as follows:
Figure 5: Primary parameter setup for the DAG run in MWAA console
Set the name of the Amazon S3 bucket to the one created earlier for storing Amazon ECS task outputs (for example ecs-bucket-your-uuid
) and click Trigger. The following sequence of events will then happen in context of the batch job you just submitted, taking about 1 hour to complete:
- The Amazon ECS cluster will be scaled up to six instances (of type c5.xlarge, unless you changed the defaults when deploying the Amazon ECS stack). The entire cluster with the default configuration has 24 vCPUs and 48GB of memory available to run simulation tasks in parallel.
- Once the scale up task completes, the Amazon ECS cluster will start to run simulation tasks. The MWAA environment will automatically scale and increase the number of workers until it can handle 24 concurrent tasks to load the Amazon ECS cluster fully. Each individual task is running a simulation in the context of a Windows container on the Amazon ECS cluster and will take about 5 minutes to complete and will upload its result to the Amazon S3 bucket. Overall, 200 simulations tasks will be run, with at most 24 in parallel due to the cluster size and resource requirements of each task. You can check the Container Insights console to monitor the state of your cluster, as shown in the following figure:
Figure 6: Monitoring the execution in the ECS console
- Once all simulation tasks have completed, an aggregate task reads all outputs from preceding tasks from Amazon S3 and combines them into a single file to generate a plot.
- Finally, a scale-in task sets the cluster size to zero.
You can monitor the progress in the Airflow UI. Once the DAG run finishes, you can check the results file generated in the Amazon S3 bucket. Specifically, the prefix sim/task_results
should contain 200 files, one per task completed, while the sim/aggregate_results
task stores the combined results and a corresponding visualization plot in form of a PNG file:
Figure 7: Result outputs in S3 bucket
Diving deeper and customizing the sample
Now let’s have a look and dive deeper into the actual simulation workload. For illustration purposes, we have created a simple compute simulation that calculates the Bit Error Rate (BER) of a telecommunication channel depending on its Signal Noise Ratio (SNR) using the free GNU Octave software to running inside the Windows containers.
The results.png
file contained in the aggregate_results
prefix is shown in the following image:
Figure 8: simulation’s numerical output plotted
Individual points on this curve are the combined result of the 200 simulation tasks that we just ran across six Windows instances on the Amazon ECS cluster. The final aggregation job then examined all the individual task results and counted the overall number of erroneous bits in order to generate this plot.
The GitHub repository holds all of the details. In particular, the Docker section contains the simulation scripts and instructions to build the container and run the simulation. The DAG file contains the overall orchestration logic to run this sample end-to-end. If you intend to build customized containers with specialized software and workflows, dive deep here to understand how these components tie together so that you can apply them successfully to your workloads.
Cleaning up
Empty and delete both Amazon S3 buckets. Delete both CloudFormation stacks created earlier.
Conclusion
In this blog post, we showed how Amazon Managed Workflows for Apache Airflow can be used to simplify and modernize Windows batch computing workflows on AWS. Amazon MWAA provides an interactive UI to create, monitor, and manage batch workflows. The UI enhances usability and makes activities like reviewing the state of jobs and gathering runtime information like logs, duration, and status accessible to non-technical users.
Using Amazon EC2 Windows containers, customers can focus on their applications and use Windows AMIs without worrying about potential interference between the operating system and their application. In conjunction with Amazon ECS, Amazon MWAA enables customers to take advantage of elastic compute resources on AWS, enabling scalable batch computing workloads and control over parallelism and runtime.
You can start tailoring our solution to your needs with this GitHub repository.
AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWS. Contact us to start your modernization journey today.