Automating Amazon EC2 Auto Scaling with Amazon CloudWatch custom metrics and AWS CDK

Introduction

As customers migrate legacy workloads to AWS Cloud, they may need to rehost or replatform applications to Amazon EC2 servers. To benefit from the scalability of cloud, customers need to be able to scale these EC2 servers up or down, on demand and on schedule.

Amazon EC2 Auto Scaling Groups provide the on-demand scaling based on standard infrastructure metrics like CPU or memory usage by automatically scaling up or scaling down based on defined conditions. However, many applications need the ability to scale based on custom metrics emitted by legacy applications instead of the standard infrastructure metrics. Additionally, AWS Auto Scaling Groups can scale up or down based on pre-defined schedules.

Amazon CloudWatch collects monitoring and operational data from AWS Services and resources to provide actionable insights such as triggering alarms to respond to changes within an application or infrastructure. Amazon CloudWatch custom metrics can be integrated with Amazon EC2 Auto Scaling Groups to maintain the necessary scaling mechanisms for legacy applications depending on application custom metric throughout the lifecycle of the EC2 servers in the auto scaling group.

This blog demonstrates an automated way to use AWS Cloud Development Kit (AWS CDK) to provision infrastructure for customer use cases such as Trading Platforms, where long-lived TCP connections must scale up depending on how many orders are processed on each server and scaled down on weekends.

Solution architecture

The following AWS architecture can be automated and implemented on AWS Cloud for Trading Platform order processing where orders come in from Front Office applications after authorization. Orders are then passed on to the Order Processing Gateway servers for validation, order formatting and routed to Matching Engines and Order Post Processing for corresponding trading markets.

The EC2 servers rehost the legacy applications in this use case as the first step of migration to AWS Cloud and can be further refactorized and modernised using AWS native services later on in the cloud journey.

Architecture diagram demonstrating how to automate Amazon EC2 Auto Scaling with CloudWatch custom metric and CDK.

Figure 1: Architecture Diagram.

Automating Amazon EC2 Auto Scaling with CloudWatch custom metric and CDK

Traders and brokers place trade orders via the Front Office servers and the orders reach the Order Processing Servers via a private Amazon Network Load Balancer (NLB). As the connectivity is TCP protocol based, an NLB is used. An NLB functions at the fourth layer of the Open Systems Interconnection (OSI) model. It can handle millions of requests per second.
The NLB on the private subnet is resolved using Amazon Route53. Amazon Route53 is a DNS service for incoming requests.
A private certificate issued by AWS Certificate Manager (ACM) Private Certificate Authority (PCA) protects the NLB. ACM provisions, manages, and deploys public and private SSL/TLS certificates for use with AWS services and your internal connected resources. The ACM certificate protecting the NLB is tied to a Route53 Private Hosted Zone as the NLB is an internal NLB.
NLB routes the trade orders to available EC2 servers in an Auto Scaling Group for order processing in a round robin fashion. As the trading applications require long-lived TCP connections, new EC2 servers are launched with Termination Protection mode on so that servers are not scaled down during weekdays. The EC2 servers are hosted in private subnets.
The Auto Scaling Group ensures that the EC2 servers are highly available. The failover to another Availability Zone (AZ) and availability of the servers in an AZ is completely managed by auto-scaling group.
On weekends, an AWS Lambda function is triggered by Scheduled Amazon CloudWatch Events Rule to terminate instances from the ASG after removing termination protection for the EC2 instances to reduce the number of servers to a minimum value. At the start of the week, the Amazon CloudWatch Events Rule instructs Auto Scaling to scale up the number of servers to a desired value. AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.
During the week, servers can only be scaled up by Auto Scaling depending on an application custom metric triggered and captured from the EC2 servers by Amazon CloudWatch. The constructs to capture and record the custom metric is baked into the Launch Template of EC2 AMI using EC2 User Data programmatically by AWS CDK so that any newly launched instance in the Auto Scaling Group is automatically configured to capture and record the emitted custom metrics.
An Amazon Elastic File System (EFS) shared mount is used by the EC2 servers for state maintenance. The state maintenance is necessary as the Auto Scaling group scales the servers up and down and without a shared filesystem or database, applications state cannot be maintained across the fleet of EC2 servers. Amazon EFS is a shared file system that stores data in multiple Availability Zones within an AWS Region for data durability and high availability.
Formatted and validated orders are sent from the EC2 servers to another private NLB for downstream processing.
The NLB sends trade orders to matching engines for order matching which are then sent for post order processing once an order is executed.
We have used AWS CDK for Python to develop the infrastructure code for the solution architecture and stored the code in an AWS CodeCommit repository. AWS CodePipeline integrates the AWS CDK stacks for automated build and deployment.
The build scripts use a key from AWS Key Management Service (AWS KMS) for data encryption of EFS volumes and to create secrets and parameters to be stored in AWS Secrets Manager.

Key benefits

High availability

Auto Scaling Group (ASG) provides high availability for the EC2 instances and automated failover to another AZ within a region. NLB distributes traffic to the EC2 instances and performs health checks on the targets. It instructs ASG to bring up a fresh EC2 instance in case of a failure.

Scalability based on custom metric

ASG automatically scales up the instances based on an application metric captured by CloudWatch. You can configure any metric of your choice and output a number from a script or a command for the CloudWatch agent on the EC2 instance to capture it, record it and trigger an autoscaling event when the metric reaches above its threshold value. As the EC2 instances are created from a launch template with custom metric capturing logic, the logic is persisted throughout the lifecycle of the EC2 autoscaling event.

One of the primary concerns of customers using EC2 and CloudWatch metrics is collecting metrics at 1 min (for infrastructure metrics) or sub-minute (for custom metrics) granularity leads to 4-5 mins for the autoscaling to take place in Load Balancer (ALB/ELB) use cases which might trigger repeated autoscaling events before a scaling event is complete.

Hence, customers usually set-up the metric thresholds optimally to balance between end user experience and sacrifice in efficiency. To enable autoscaling to work with high level (sub-minute) granularity with application custom metric, our solution makes sure autoscaling activities are complete before triggering autoscaling again.

If the threshold for the metric is breached, the python routine (bundled in the launch template of the EC2) stops uploading metric till the autoscaling activities are complete. It starts uploading the metric again once the metric falls below the threshold. That means the autoscaling activities are not impacted by repeated threshold breaches when the autoscaling is taking place.

Security

All services are deployed on private subnets which restrict outside access. EC2 instances are configured with security groups to further restrict ingress. You can use AWS PrivateLink to configure VPC endpoints for AWS services such as Amazon CloudWatch AWS CloudTrail, AWS KMS etc.

AWS PrivateLink makes sure inter-service traffic is not exposed to the internet for AWS service endpoints. You can also deploy Amazon GuardDuty to detect threat in the deployment account. Secrets are accessed using AWS Secrets Manager and AWS ACM protects the private NLB with certificate.

Vulnerability scan

EC2 AMIs should be kept secured by implementing EC2 AMI patch management. You should only use approved 3rd party libraries and software in EC2 AMI. You can use Amazon Inspector for EC2 scanning Vulnerability issues should be addressed before deploying the AMI.

Persistent shared storage

Amazon Elastic File System (EFS) provides this persistent storage for shared state maintenance of the EC2 instances. Data on EFS is encrypted at rest using AWS KMS.

Logging and monitoring

Logging and monitoring infrastructure is provided by AWS CloudTrail and AWS CloudWatch. You can also use AWS Config for monitoring and alerting on security events All logs can be aggregated into a central audit account in an AWS Control Tower Landing Zone environment.

Backup

EFS mounts will be automatically backed up by EFS. EC2 instances can be backed up using AWS Backup.

Automation

The infrastructure and automation code provided in this blog greatly reduces the time to build, deploy and maintain the solution infrastructure with a CI/CD pipeline.

Deployment

The source code for deploying the solution architecture is in the aws-samples GitHub repository.

Prerequisites

To deploy the AWS CDK stacks from the source code, you need to review and perform the prerequisites described in the accompanying GitHub repository readme to make sure you have the necessary resources to proceed.

AWS Route 53 Private Hosted Zone. For detailed instruction, see Creating a private hosted zone
AWS Certificate Manager Private Certificate. For detailed instruction, see Creating a private certificate
An Amazon Linux AMI with Python3 installed. For detailed instruction, see Creating an AMI
A script or command that can return a numeric value for the application custom metric. This script or command needs to be included in the file custom_metric.py

metric_data = sp.getoutput(“/usr/sbin/ss -tnH src :##PORT## | wc -l”)

Launch the solution

1. Clone the Repository, check out the main branch.

git clone -b main https://gitlab.aws.dev/blogs2/asg_blog

2. Create a CodeCommit repository to hold the source code for installation with the following command:

aws codecommit --profile <AWS CLI profile of AWS account>create-repository --repository-name <name of repository>

3. Pass the required parameters in parameters.json following Step 4 in the Deployment section of the readme file.

4. Install the package requirements for the AWS CDK application:

python3 -m pip install -r requirements.txt

5. Before committing the code into the CodeCommit repository, synthesize the AWS CDK stacks. This ensures all the necessary context values are populated into the cdk.context.json file and avoids the dummy values being mapped.

cdk synth --profile <AWS CLI profile of the account>

6. Commit the changes into the CodeCommit repository you created. Follow Step 6 in the Deployment section of the readme file if you need help with the Git commands.

7. Deploy the AWS CDK stacks to install the CDK application containing the Infrastructure-as-Code (IaC) for the described architecture using AWS CodePipeline. This step takes around 30 minutes.

cdk deploy --profile <AWS CLI profile of the account>

8. Navigate to the CodePipeline console (the link takes you to the us-east-1 Region). Monitor the pipeline and confirm that the services are built successfully.

Figure 2: Pipeline Completion.

9. To verify custom metrics are being uploaded from the EC2 servers to Amazon CloudWatch for Auto Scaling activities to be triggered when thresholds are breached, please login to any EC2 instance using Session Manager.

Figure 3: EC2 instance login with SSM.

On the EC2 instance:

cd /opt/aws
tail -f cw_custom_metric.log

You should see an output like the following:

Figure 4: Automated Custom Metric Upload from EC2 Instance to CloudWatch.

10. Finally, login to Amazon CloudWatch console (the link takes you to the us-east-1 Region) to see the custom application metric status:

Figure 5. Custom Metric in Amazon CloudWatch.

Cleaning up

Please follow the readme in the repository to delete the stacks created.

Next Steps

As a next step, you can refactor this solution architecture using AWS native services such as AWS Fargate, Amazon ECS/Amazon EKS, Amazon API Gateway, AWS Lambda, Amazon EventBridge, Amazon SQS and Amazon MemoryDB to create a serverless solution.

Conclusion

We discuss and deploy an automated solution architecture to scale Amazon EC2 based workloads using Amazon CloudWatch custom metrics emitted by applications instead of scaling based on traditional infrastructure metrics. This architecture provides businesses a secure, scalable and highly available environment while considering migration of legacy applications such as trading platforms to AWS Cloud.

AWS Cloud Operations & Migrations Blog

Automating Amazon EC2 Auto Scaling with Amazon CloudWatch custom metrics and AWS CDK

Launch the solution

Cleaning up

About the authors

Resources

Follow