AWS Quick Start best practices for broad customer impact

This blog post is a case study in which we go behind the scenes of developing and maintaining one of the AWS Quick Starts. We share some of the challenges faced, ways that we addressed them, results that we’ve seen, and best practices that can benefit our customers. If you’re working on an automated deployment, consider these best practices for broad customer impact.

Background

First published in September 2017, the OpenShift on AWS Quick Start got its start from collaboration between the AWS Quick Start team and Red Hat. OpenShift on AWS was, in fact, one of the first Quick Starts that wasn’t built internally from the get-go. It was our team’s first foray into creating a Quick Start with a systems integrator (SI) partner.

The OpenShift on AWS Quick Start has active involvement from the community. It has evolved to become one of the most used Quick Starts in our catalog, with some of the highest number of launches and of deployment guide downloads, ranking in the top 10 for 2018, for example. The landing page was also in the top 10 in terms of page views for 2018.

This Quick Start also has certain partner/customer wins associated with it, but that’s a subject for a future blog post. The case study in this blog post, meanwhile, focuses on the success of the OpenShift on AWS Quick Start from a development perspective, despite roadblocks. What have some of the challenges been from a development standpoint, and how did the Quick Start team, Support team, and others address them? Are there best practices we can apply to Quick Starts in general?

OpenShift Container Platform on AWS

First, let’s look more closely at the technology that’s core to this Quick Start. Red Hat OpenShift is a popular distribution of Kubernetes, which is an open-source system for deploying, scaling, and managing containerized applications. Red Hat OpenShift Container Platform enables application developer and IT operations teams to accelerate application delivery. By extension, the OpenShift on AWS Quick Start is a container application platform with Kubernetes orchestration on the AWS Cloud. The platform is based on Docker-formatted Linux containers, Google Kubernetes orchestration, and Red Hat Enterprise Linux (RHEL).

The Quick Start includes AWS CloudFormation templates that build infrastructure using AWS best practices. The templates then pass that environment to Ansible playbooks to build out the OpenShift environment.

Initial challenges faced

Although the OpenShift on AWS Quick Start has become one of our most widely used Quick Starts, there were a few challenges prior to its first launch.

Limited development time

From a timeline perspective, the plan was to finish developing the OpenShift Quick Start and to unveil it to customers in time for the Red Hat Summit. To align with the timing of the conference, the Quick Start team launched the Quick Start before we might have otherwise.

Limitations in features and functionality

Partially due to these time constraints, some features and functionality weren’t available in the first release of this Quick Start. For example:

Certificates for the web console—for development and operations—were self-signed rather than being issued and verified from a trusted Certificate Authority.
The cluster metrics service—for collecting application and performance metrics—wasn’t enabled.
There were no persistent storage options configured for applications that run in the cluster.
Accessing the control plane with the OpenShift Container Platform command line interface (CLI) tooling—for creating applications and managing OpenShift projects—was difficult and required additional manual steps.

If an Amazon Elastic Compute Cloud (Amazon EC2) instance failed, some degree of manual intervention was needed to get it back in place. Because there are multiple Availability Zones, there would be no downtime. However, some capacity would be lost until an administrator made another node available.

Not enough resources for maintenance

After the initial launch, at times it was hard obtaining resources to maintain the Quick Start. After about 6 months, the Quick Start no longer worked as expected. At this point, the Quick Start team decided to invest in rebuilding it, partly to reduce future maintenance needs. Our aim was also to add the missing functionality and increase the number of deployment options, so that we could meet a broader set of customer use cases.

Pain points that were addressed

To rework the Quick Start for broader customer use, the team allocated an engineer in AWS—Jay McConnell. This revamp, in tandem with frequent maintenance, has included:

Making the Quick Start more flexible and pluggable. This included exposing options that support different use cases, which were previously hardcoded, and adding functionality that enables customers to provide custom scripts and Amazon Machine Images (AMIs) for use in the cluster. For example, providing a custom domain name wasn’t possible at first because AWS CloudFormation didn’t provide a fully automated mechanism to verify domain ownership. So Jay added AWS Lambda custom resources to enable this functionality.
Addressing the certificate issue—where certificates were initially self-signed—by integrating AWS Certificate Manager.
Adding Hawkular for cluster metrics, which provides scalable storage of metrics data, and enables analysis of performance data.
Adding Amazon Elastic Block Store (Amazon EBS) for persistent storage, which enables customers to easily implement stateful applications.
Adding certificates to Elastic Load Balancing, which enables secure access to the OpenShift API, and provides Secure Sockets Layer (SSL) offloading for applications that run in the cluster.
Adding support for dynamic creation of public and internal load balancers.
Adding GlusterFS support, an open-source storage solution that provides read-write-many (RWX) persistent storage to applications that are running in the cluster.
Enabling customers to use Amazon Route 53 Domain Name System (DNS) names, or to provide details of an externally hosted domain.
Adding AWS Service Broker. This enables customers to manage AWS services directly from OpenShift, without needing to provide individual users with direct access to the AWS account.
Improving error messages and troubleshooting guidance for addressing failed Quick Start launches. Jay made it a point to provide meaningful, rather than generic messages, whenever a customer had a launch that failed. Also, the troubleshooting guidance initially didn’t have details on what logs to look at, so he added specifics.
Backing up the OpenShift registry by using Amazon Simple Storage Service (Amazon S3).
Improving behavior of scaling for control plane (master) and database (etcd) nodes. OpenShift has no built-in mechanism for exploiting the elasticity provided by AWS Auto Scaling. Andrew Glenn (from the Quick Start team) built Python scripts to implement support. As of OpenShift 3.11, OpenShift now supports automatic scaling for worker nodes, but Andrew’s library is still the only implementation that handles automatic scaling for the master and etcd nodes. Without his implementation, failure of the critical master or etcd instances would at best require manual intervention to set up a replacement host, or at worst create a severe outage of the platform, which would result in data loss. This feature is key to providing the required high availability (HA) functionality in the Quick Start.
Rewriting many of the components to ease ongoing maintenance.

The Quick Start team also reached out to the Amazon EC2 and Support teams. Quick Start team member Andrew Glenn (who was on the Support team at that time) and Tony Vattathil (on the Quick Start team) were instrumental, for example, in writing a custom tool that’s triggered when Auto Scaling detects a change in the instances in a group. After it’s triggered, the tool automatically configures the OpenShift cluster to add or remove hosts to reflect the change, using Ansible behind the scenes. The custom tool is used when a node dies or a customer scales in or out. Automation handles scaling and high availability, without the manual intervention needed before.

Community contribution

Today, there is frequent engagement in the form of community activity, pull requests, and forks in GitHub. For example, customers have contributed patches to add support for custom AMIs, configurable bootstrapping hooks, proxy support, and the de-registration of hosts with Red Hat licensing servers, when they are decommissioned or scaled down. Community contribution, in tandem with the updates mentioned previously, help keep the Quick Start relevant and current, with ongoing enhancements.

The effect we’ve seen on launches and PDF downloads

The initial launch was in September 2017. Jay began fixing the Quick Start on January 16, 2018, and he finished most of the major changes by March 4, 2018. A major update to the deployment guide was also released at that time. What are some results we’ve seen?

Launches

The increased number of stack launches in March, April, and May of 2018 compared to the month of the initial launch appears to correlate with the updates that were published at the beginning of March.

Launches went up by 309 percent in the initial months that followed the major update, compared to the initial launch in September 2017.

This is a possible indication that the bug fixes and feature additions affected adoption during the initial months after this effort. In part, this could be because potential adopters were tracking issues in the GitHub repo.

PDF downloads

PDF downloads increased by 31 percent a month after the updates, and by as much as 60 percent later in the year.

Additionally, regular maintenance since the major update in March 2018 appears to be paying off in the form of sustained high launch numbers, download numbers, and marketing page traffic. Throughout 2018, monthly page views for the landing page increased by 150 percent.

Best practices, based on lessons learned

From an engineering standpoint, there has been significant effort involved in adding these features and in regularly maintaining the Quick Start. It’s worth it, however, in terms of benefiting our customers. This effort has also contributed toward making this Quick Start one of our most used, in terms of launches and downloads, and community contributions help to provide ongoing enhancements.

From this experience, we gleaned some best practices that you can use for automated deployments in general:

Allow ample development time.
Ensure that you have resources available to regularly maintain the Quick Start.
Expand your reach to the appropriate service teams, and explore ways to add functionality via AWS services.
Test automated deployments against real-world use cases to be sure that required features are available and operational.
Lock down upstream-dependency versions. Although this increases the maintenance required to manually update versions, it results in a lower likelihood that version changes will cause unexpected failures.
Encourage community involvement by being active on GitHub issues and pull requests.

Resources for Quick Start development

Although this blog post offers a case study on a specific Quick Start, its general principles can apply to other Quick Starts and automated deployments. We hope that sharing some of our experiences and best practices can be useful in your automated deployments.

For information about ways to participate in Quick Start development, including how to build a Quick Start, see the Quick Start Contributor’s Guide. Also, check out the AWS Quick Start Workshop for tools, tips, and tricks for writing CloudFormation templates that are easy to read, maintain, test, and that expedite the development process with high quality.

Integration & Automation