AWS Compute Blog

Amazon ECS Service Auto Scaling Enables Rent-A-Center SAP Hybris Solution

This is a guest post from Troy Washburn, Sr. DevOps Manager @ Rent-A-Center, Inc., and Ashay Chitnis, Flux7 architect.

—–

Rent-A-Center in their own words: Rent-A-Center owns and operates more than 3,000 rent-to-own retail stores for name-brand furniture, electronics, appliances and computers across the US, Canada, and Puerto Rico.

Rent-A-Center (RAC) wanted to roll out an ecommerce platform that would support the entire online shopping workflow using SAP’s Hybris platform. The goal was to implement a cloud-based solution with a cluster of Hybris servers which would cater to online web-based demand.

The challenge: to run the Hybris clusters in a microservices architecture. A microservices approach has several advantages including the ability for each service to scale up and down to meet fluctuating changes in demand independently. RAC also wanted to use Docker containers to package the application in a format that is easily portable and immutable. There were four types of containers necessary for the architecture. Each corresponded to a particular service:

1. Apache: Received requests from the external Elastic Load Balancing load balancer. Apache was used to set certain rewrite and proxy http rules.
2. Hybris: An external Tomcat was the frontend for the Hybris platform.
3. Solr Master: A product indexing service for quick lookup.
4. Solr Slave: Replication of master cache to directly serve product searches from Hybris.

To deploy the containers in a microservices architecture, RAC and AWS consultants at Flux7 started by launching Amazon ECS resources with AWS CloudFormation templates. Running containers on ECS requires the use of three primary resources: clusters, services, and task definitions. Each container refers to its task definition for the container properties, such as CPU and memory. And, each of the above services stored its container images in Amazon ECR repositories.

This post describes the architecture that we created and implemented.

Auto Scaling

At first glance, scaling on ECS can seem confusing. But the Flux7 philosophy is that complex systems only work when they are a combination of well-designed simple systems that break the problem down into smaller pieces. The key insight that helped us design our solution was understanding that there are two very different scaling operations happening. The first is the scaling up of individual tasks in each service and the second is the scaling up of the cluster of Amazon EC2 instances.

During implementation, Service Auto Scaling was released by the AWS team and so we researched how to implement task scaling into the existing solution. As we were implementing the solution through AWS CloudFormation, task scaling needed to be done the same way. However, the new scaling feature was not available for implementation through CloudFormation and so the natural course was to implement it using AWS Lambda–backed custom resources.

A corresponding Lambda function is implemented in Node.js 4.3, while automatic scaling happens by monitoring the CPUUtilization Amazon CloudWatch metric. The ECS policies below are registered with CloudWatch alarms that are triggered when specific thresholds are crossed. Similarly, by using the MemoryUtilization CloudWatch metric, ECS scaling can be made to scale in and out as well.

The Lambda function and CloudFormation custom resource JSON are available in the Flux7 GitHub repository: https://github.com/Flux7Labs/blog-code-samples/tree/master/2016-10-ecs-enables-rac-sap-hybris

Scaling ECS services and EC2 instances automatically

The key to understanding cluster scaling is to start by understanding the problem. We are no longer running a homogeneous workload in a simple environment. We have a cluster hosting a heterogeneous workload with different requirements and different demands on the system.

This clicked for us after we phrased the problem as, “Make sure the cluster has enough capacity to launch ‘x’ more instances of a task.” This led us to realize that we were no longer looking at an overall average resource utilization problem, but rather a discrete bin packing problem.

The problem is inherently more complex. (Anyone remember from algorithms class how the discrete Knapsack problem is NP-hard, but the continuous knapsack problem can easily be solved in polynomial time? Same thing.) So we have to check on each individual instance if a particular task can be scheduled on it, and if for any task we don’t cross the required capacity threshold, then we need to allocate more instance capacity.

To ensure that ECS scaling always has enough resources to scale out and has just enough resources after scaling in, it was necessary that the Auto Scaling group scales according to three criteria:

1. ECS task count in relation to the host EC2 instance count in a cluster
2. Memory reservation
3. CPU reservation

We implemented the first criteria for the Auto Scaling group. Instead of using the default scaling abilities, we set group scaling in and out using Lambda functions that were triggered periodically by a combination of the AWS::Lambda::Permission and an AWS::Events::Rule resources, as we wanted specific criteria for scaling.

The Lambda function is available in the Flux7 GitHub repository: https://github.com/Flux7Labs/blog-code-samples/tree/master/2016-10-ecs-enables-rac-sap-hybris

Future versions of this piece of code will incorporate the other two criteria along with the ability to use CloudWatch alarms to trigger scaling.

Conclusion

Using advanced ECS features like Service Auto Scaling in conjunction with Lambda to meet RAC’s business requirements, RAC and Flux7 were able to Dockerize SAP Hybris in production for the first time ever.

Further, ECS and CloudFormation give users the ability to implement robust solutions while still providing the ability to roll back in case of failures. With ECS as a backbone technology, RAC has been able to deploy a Hybris setup with automatic scaling, self-healing, one-click deployment, CI/CD, and PCI compliance consistent with the company’s latest technology guidelines and meeting the requirements of their newly-formed culture of DevOps and extreme agility.

If you have any questions or suggestions, please comment below.