AWS Cloud Operations Blog
Reinventing automated operations (Part I)
This is the first in a two-part series that covers lessons learned at AWS Managed Services (AMS) as we help customers and partners achieve operational excellence on AWS. To create a secure and consistent cloud operating model, you need both operational experience and AWS skills. In my conversations with customers, it is common for experienced IT teams to be in the early stages of learning AWS skills. I also often find that while the application teams have the necessary AWS skills, they don’t always have experience in network and infrastructure operations. Automated operations can help overcome these experience and skill gap challenges.
In the second post, I’ll cover how we use automation as a flywheel of continuous improvement to achieve operational outcomes at scale that would be challenging using traditional approaches. No matter where you are in the migration process, it’s never too late to consider operations and every investment will help regardless of your approach.
The importance of operations in the cloud
The cloud has many of the same operational roles and requirements found in an on-premises environment. Critical business applications must conform to Payment Card Industry (PCI), International Organization for Standardization (ISO), System and Organization Controls (SOC), Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and other standards. What’s different is that the skillset requirement has changed from system admins to software developers and system engineers. In the world of infrastructure as code, it’s easy to forget that the resources you are deploying need ongoing operational care and feeding. Even among customers who are using a DevOps approach, there is no shared definition of what the Ops portion entails in the cloud. In the traditional data center, DevOps teams were responsible for application operations. They left patching, networking, hardware, storage, shared services (AV/malware), and operating system (OS) hardening to their central IT group. With the majority of the focus being on applications during a migration, it is understandable that operations doesn’t always come to the forefront until 6-12 months into a migration. This is because operational deficiencies take time to manifest, and when they do, they can negatively impact your migration and business.
Operations is still important when customers transition from workloads hosted on Amazon Elastic Compute Cloud (Amazon EC2) to containers and serverless. The operations work doesn’t go away, it just changes. Your OS-based operational headaches (patching, backup, and host security) become ensuring your next-gen workloads are operating within compliance, organizational, and security controls, and dealing with alarms and events to determine which will result in an incident. Even AWS services like Amazon Relational Database Service (Amazon RDS) and Amazon Elastic Kubernetes Service (Amazon EKS) must be monitored 24/7 for Amazon CloudWatch events that indicate performance issues and resource failures. Next gen applications must also conform to the same security and compliance controls as traditional workloads.
Using the right resources to perform operations is also important to your migration success. Instead of integrating your processes with dozens of vendor tools, in the cloud you must combine your business process with of dozens of individual AWS services to replicate an operating model that is similar to your data center. Although your AWS application teams can do this work, the results may be inconsistent across teams. This approach is also the most expensive solution to your operational problems. Cloud application developers are highly skilled, scarce, and become frustrated when handling the low-value, high-risk operational work that requires them to carry a pager on evenings and weekends. For a better return on your investment, keep AWS developers focused on migrating, refactoring, and building applications that have an impact on your business. For example, by focusing your teams on the pillars and best practices of AWS Well-Architected, you can take full advantage of cloud benefits instead of chasing down a failed patch.
Prescription isn’t as important as we thought
For a developer, it’s always easier to automate something that is predictable. For those considering a migration to AWS, a prescriptive approach simplifies and accelerates deployment. It provides answers to thousands of frequently asked, migration-related questions. When AMS launched in December 2016, our service was heavily prescriptive. Not only did we help customers reduce onboarding time by more than 50%, we also achieved 91% automation across all operational activities. We essentially made those activities self-service to our customers. Because automation executes consistently and from a principle of least privilege, you can breeze through compliance audits. Today, AMS holds attestations for pretty much every major industry compliance standard. Customers such as Sallie Mae took full advantage of our approach, and they closed two data centers ahead of schedule and under budget.
There are drawbacks to a prescriptive approach, though. Each customer has different needs (and sometimes those needs are different even between business units). There really is no one right way of doing things. At AWS, there are several account-configuration designs (a.k.a landing zones) that extend AWS Control Tower and AWS Organizations to meet the specific needs of the customer and their workloads. Customers also interact with AWS in different ways: through the AWS Management Console, AWS CLI, APIs, and integrations through ISV and partners for every imaginable use case. Customers come to the cloud because it offers speed and agility compared to their data centers. They don’t want heavy restrictions, constrained interactions, and excessive operational controls and standards. It is also difficult to keep prescriptive systems up to date with the latest AWS service and feature innovations. A prescriptive approach also makes it harder to help with operations, as most customers have made some investment in AWS before they realize they need operational assistance.
Learning from our experience, we recently released the AMS Accelerate operations plan that can overlay operational assistance on top of existing customer deployments. Using this approach, the burden of working with the unpredictable is on us instead of our customers. We meet them wherever they are in their migration and cloud lifecycle. Our main takeaway is that prescription isn’t nearly as important as dependency management (the agents, configurations, and permissions required to operate) and creating automation flywheels for scalable operations.
Conclusion
Operational excellence is more of a journey than a destination — there is no one right way. Any effort you take to improve operations will yield returns in the form of increased security, availability, and agility. Being less prescriptive about the configurations and offloading operations work to automation will also free up your application teams to focus on migrating new workloads instead of taking care of those already moved. If your in-house capacity is constrained or your team doesn’t have the right skill sets, I encourage you to consider using one of the AWS Certified Managed Service Provider Partners or AMS Accelerate to simplify your migration to the cloud.
In the second part of this series, I will cover the AWS services we use, how we have converted our operations expertise into operational artifacts, and the nine high-level automation flywheels AMS created to convert our operational experience and learnings into repeatable and scalable operational capabilities.