Flight Controller by Contino – A Solution built on AWS Control Tower
Today AWS customers are rapidly adopting the cloud and at a massive scale. To support this demand, customers must build a strong foundation based on AWS well-architected best practices. A well-architected landing zone is a key construct that lets you vend accounts, provision access, setup security guardrails, and build CI/CD pipelines. However, at scale, implicit expectations begin to develop between the business units who need features fast, the product teams who produce feature requests, and the platform team who has to deliver these requests.
As an example, development/product teams expect accounts to be provisioned immediately. But platform teams may need to consider these requests in conjunction with the rest of their product backlog, which may have other important features that may be required to realize the organization’s long-term IT strategy. This could mean that the account is vended a week later or longer. Often, business requirements for speed compete with the need to build rich and enhanced features. In turn, this leads to delays, reprioritization without proper consideration, and – in the worst case – delivery defects. Ultimately, all of this results in IT slowing down the business rather than enabling quicker time to market, which is the key outcome for everyone.
This post explains how Flight Controller for Landing Zones, a joint solution released by Contino and AWS, uses Lean and Site Reliability Engineering (SRE) principles along with AWS Control Tower and Customizations for Control Tower to measure, track, and communicate the business impact of your landing zone continuously and automatically. You will learn how to convert implicit expectations into concrete outcomes, continuously measure and improve these outcomes, facilitate decision making based on data instead of opinions, and ultimately enable business to innovate and transform quickly and safely.
Flight Controller leverages Site Reliability Engineering (SRE), which is a metrics-driven approach to building and maintaining systems. SRE uses Service Level Indicators (SLIs) to identify which metrics are important to building and maintaining quality, Service Level Objectives (SLOs) to define targets for the SLIs, and Service-Level Agreements (SLAs) to define the consequences of SLOs being achieved or breached.
An SRE approach to building systems lets quality be measured and tracked over time. In turn, this lets teams focus their efforts on the things that really improve overall quality. It lets teams have a shared understanding of their performance, track this over time, and drive shared goals.
The Key Performance Indicator (KPI) definition in Flight Controller is underpinned by SLOs. SLOs help convert implicit expectations into explicit requirements using the data ingested from AWS Control Tower and workloads deployed in the AWS accounts that it manages.
Flight Controller takes a data-driven approach to Landing Zone health, and it helps product teams make informed decisions about where to focus efforts, helps business leaders understand how their Cloud adoption is progressing, and make sure that teams are enabled to deliver value.
AWS customers can now collect and view real-time AWS Control Tower landing zone metrics at near-zero total cost of ownership (TCO).
Using Flight Controller, AWS customers can measure KPIs, such as:
• Account vending SLO
• User creation and access SLOs
• Patching status across the entire estate
• Compliance framework adoption
• AMI usage
• Infrastructure as Code adherence
• The four State of DevOps Report metrics:
o Deployment Frequency
o Lead Time for Changes
o Change Failure Rate
o Mean Time to Recovery
Flight Controller is built on top of, and is compliant with, AWS Control Tower. It leverages AWS best practices and Customizations for Control Tower. This lets customers automate the creation of resources and SCPs, as well as vend accounts using account factory to track the value of your AWS landing zone infrastructure.
Data collection, processing, and visualization
Flight Controller for Landing zones uses lifecycle and other cloud native events generated by AWS Control Tower. Furthermore, a combination of AWS Lambda and Amazon EventBridge is used to generate custom events to record the start or completion of key activities. AWS customers can select which events are important to measure. The diagram in the following figure shows how a custom event bus receives events from multiple AWS accounts within the landing zone, and then reports the events to a central account where the Flight Controller for Landing Zones solution resides.
In this architecture, the solution ingests events from the workload accounts, AWS Control Tower management account, and any third-party sources (for example, GitHub, Zendesk, ServiceNow), which are pre-integrated with EventBridge or use the EventBridge SDK to raise events. These events are filtered to provide relevant information to calculate the required SLO.
DynamoDB provides a low-cost, low-latency database to store and catalogue historical events for analysis.
Amazon Timestream is a purpose-built managed database for analyzing time series data. In particular, it helps measure elapsed times between events, as well as helps with the easy calculation of metrics where time is the main dimension.
The visualization layer utilizes Amazon Managed Grafana, which is a fully managed service that provides an easy-to-configure framework to build dashboards and reports. It’s pre-integrated with Amazon Timestream, which makes access to underlying data secure and easy to implement.
The screenshot in the following figure provides an example of SLOs that can be monitored using Flight Controller via Amazon Managed Grafana.
Figure 2 Flight Controller dashboard powered by Amazon Managed Grafana.
For account vending it shows the expected success percentage for the SLO identified, the deployment cycle time, the success percentage, and the total number of accounts vended within the reporting period.
Let’s look at a sample customer use case where we track the SLO for account vending. The platform team and the product teams agreed to fulfil account vending requests within one business day 80% of the time over a 30-day window. The use case has a defined and measurable outcome, i.e. an account is vended within one business day. Furthermore, it negotiates error budget, which lets the platform team account for defects or outages associated with account vending.
In this case, accounts vend successfully 66% of the time. Therefore, the SLO isn’t being met. Product or platform teams can dig deeper into the root cause of these SLO breaches by evaluating the recorded time taken between all of the events from start to finish. Then, they can make adjustments to the process by automating steps, removing redundant steps, or making a decision to augment the platform team. Product or platform teams can also observe when activities that were being executed serially could be parallelized.
Let’s look at how data is ingested and processed using Flight Controller’s architecture to facilitate the preceding SLO.
Figure 3 Account vending SLO – Deep dive architecture
The architecture used to achieve this data capture can be seen above. It shows two processes. The first is started when an engineer submits a pull request to create a new account through AWS Control Tower. At this point, an ‘Account Requested’ event is raised. It targets the Event Bus in the Flight Controller management account, which stores the raw event in the DynamoDB event store. The second process occurs when the pull request is merged, signifying that the account should be created. At this point, AWS CodePipeline helps create a new instance of a AWS Service Catalog Provisioned Product. This is what sits behind the AWS Control Tower account vend process. Once this completes, a second event is raised, and it’s sent again to the Flight Controller Management account. Then, the Lambda Event Processor function can retrieve the original event from the DynamoDB event store, calculate the time elapsed between the two events, and create a record in Timestream. Timestream can be directly integrated with Amazon Managed Grafana, allowing the data to be queried to determine whether the SLO is being met.
Equipped with this data, the product and platform teams can parallelize certain post account creation activities, and then convert hard dependencies to soft dependencies. This reduces the time that it takes to vend accounts to under 30 minutes within certain conditions. Furthermore, we noticed that the account vending success rate went up significantly to 90% after adjustments were made. Then, over a few weeks of consistent delivery, the account vending success rate KPI was also updated to 90%. These SLO improvements resulted in a direct improvement in the innovation pace at this customer, and it eventually resulted in many more business teams adopting the AWS Cloud to enable their outcomes.
Automation using Customizations for Control Tower
Customizations for Control Tower is a framework to automate change to your landing zone with configuration changes represented as infrastructure as code. Changes to the landing zone can be made through Python, manually on the console, or through Terraform (or Tf with pipelines). Out of the box, CFCT v2 supports AWS CloudFormation to make changes across multiple accounts. The types of changes that you can make are:
- Deployment of Service Control Policies
- Deployment of resources that you can provision through CloudFormation
Learn more about the Customization for Control Tower solution through this link.
The following diagram shows how CFCT v2 is used to deploy the resources required to run Flight Controller, as well as the resources needed in each workload account to record and transport events to the Flight Controller account.
Figure 4 Flight Controller deployment using CFCTv2
CFCT v2 is deployed using a CloudFormation script after deploying your landing zone using AWS Control Tower. The next step is to deploy the Flight Controller resources using a manifest file. When the Flight Controller manifest is checked in, the following occurs:
- The Flight Controller Account is vended and the resources within it are created. Out-of-the-box (OOB), Flight Controller comes with an event catalog, which is designed to be easily extended.
- A custom event bus and event rules for the OOB event types are deployed in all of the configured workload accounts.
Subsequently, when a new workload account is provisioned, a custom event bus and event rules for the OOB event types are deployed in the workload account.
New event types can be configured by adding new event rules to the accounts from where the events need to be sourced. In addition, this can be done by adding a Lambda event processor function for the event in the Flight Controller management account to parse and store the new event type.
When a new feature is launched in the landing zone, the corresponding SLOs are defined, configured, and added to the dashboard so that you have full observability of the delivery quality from the start.
In this post, we discussed how Flight Controller for Landing Zones brings clarity to what your landing zone offers its users, as well as empowers the team to make informed priority decisions based on where they are, or aren’t, meeting their SLO targets. We discussed how Flight Controller embeds a data-driven culture at the heart of your Cloud adoption journey, as well as maximizes the impact of your landing zone by having the right conversations across your business.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.