The Critical Missing Piece of DevOps…And How to Find It

We’ve probably all heard the DevOps principle “you build it, you run it.” In theory, DevOps makes each team responsible for both the development and operation of its code, giving DevOps teams complete responsibility — and complete visibility and transparency — for the entire value stream, including not just coding, testing, securing, and complying, but even the business results of the code when it is running in production. But IT operations includes much more than the limited “ops” functions we typically fold into a DevOps team. I’m talking about things like ticket management, incident handling, user management and authorization, backups and recovery, network management, security operations, infrastructure procurement and cost optimization, compliance reporting, and much more. In today’s IT organization, where do these responsibilities fall? And how can we improve these operations and perhaps even apply DevOps and Agile principles to them?

This post, the first in a series on how to best think about operations in the cloud, will explore that set of operations functions that is not typically assigned to DevOps teams. We will also talk about how organizations not yet using DevOps can still benefit from streamlined operations when they migrate their applications as-is to the cloud.

Why aren’t these functions typically performed by the DevOps product teams? For one thing, fast feedback is critical to DevOps. You want DevOps teams to have a streamlined, low lead-time, lean pipeline to production. Devoting team capacity to this broader set of operational functions may slow down this pipeline. There are also efficiencies to be gained by sharing these practices across the work of all the DevOps teams. For example, it might not make sense for each team to have its own way of communicating about production incidents, and certain functions like user management and cost optimization require a view across all systems and IT capabilities.

All of this is to say that a portion of IT operations still exists independently of the DevOps teams, performing those “ops” functions that are not in “DevOps” while the DevOps teams focus on that subset of ops functions specifically related to deploying code and responding to code-related incidents (“wearing the pager”).

You might recognize the voice of hard-won experience here. In my role as CIO of USCIS I once made the mistake of not paying enough attention to that portion of ops that lies outside of DevOps. We had a large initiative going on with about 15 agile teams. When they released code into production, they found that they needed to set up a process for handling user problems and questions, production incidents, and monitoring alerts. As the system became more complex, this burden became heavier. In a few cases, business leaders as well as teams working on other systems downstream and upstream complained that they hadn’t been notified of outages that affected them.

One day, the head of our Network Operations Center (NOC) happened to visit the offices of that project. When he saw what they were doing, he was stunned. “Why aren’t you just using our normal incident handling process? We have a situation room at the NOC, escalation procedures, an incident response team, and a runbook for contacting the people who need to be informed or involved in diagnosing issues. We can show you our statistics on how good we are at this and how we’ve been getting better and better at it. Why would you re-invent the wheel?”

Why indeed? Focused as I was on combining ops and dev into each team, the DevOps teams, feeling responsible for managing their system in production, had responded by cobbling together their own incident response processes. They also authorized new users to use the system, tried to resolve network issues that affected their system, and lots more. That degree of ownership was admirable, and it was my fault for not clearly thinking through how the combined effort of the entire organization could be harnessed to provide the best results.

Part of what IT leadership needs to do then — ouch! — is to set up the environment or the context in which DevOps teams can be most successful. This certainly involves cultural change, changes to governance and investment management processes, and in some cases organizational changes, but it also involves integrating the DevOps teams with the rest of the IT organization and its processes.

Some organizations stand up a central Platform and Tools team to provide a common infrastructure on which DevOps teams build and operate. Sometimes this team provides test suites and monitoring capabilities that serve as guardrails to ensure security and compliance. A centralized team might also handle Network Operations Center (NOC) and Security Operations Center (SOC) functions. A Site Reliability Engineering (SRE) group might help DevOps teams optimize and oversee the performance of their code. There are usually teams that provision and support the devices that employees will use. There may be a Tier 1 and Tier 2 support help desk. Netflix even has an Insight Engineering team that tries to make monitoring and logging easily actionable by DevOps teams. Many variations are possible — what they have in common is they provide an ecosystem in which DevOps teams operate.

All of this raises some interesting questions, though. First: Can these other operations functions borrow some of the ideas of DevOps to streamline what they do? Automation, for example. Or testing in production. Or building cross-functional teams with end-to-end responsibilities. Second, can we avoid a “handoff” between the DevOps teams and these other operational teams? Remember, DevOps was created to avoid handoffs between Dev and Ops. But how about that moment when the help desk starts handling Tier 1 and Tier 2 support calls? When end user devices must be provisioned to give employees access to the new system? When the SOC begins monitoring activity on the system?

On the first question, I have some good news. The idea of applying DevOps-influenced best practices to all of operations has taken a large step forward with the release of the latest version of AWS’s Well-Architected framework, which describes best practices for architecting systems in the cloud. It includes a fleshed out Operational Excellence Pillar, which adds details about best practices for operating workloads once they are in the cloud.

Most AWS services can be accessed via APIs, which makes it possible to script their activities. As a result, it is possible to automate a great number of operational processes. Many other processes are amenable to AWS Lambdafunctions that can be triggered by events that take place in the operational environment.

AWS Managed Services (AMS), which provides ongoing management of the infrastructure for customer applications, has developed and uses these well-operated practices in delivering its services, and has based them on its own experiences helping customers. It has automated a great deal of the operations processes it uses to support its customers, and instituted processes that allow it to continuously learn and improve.

Although taking full advantage of the cloud often requires some refactoring or rewriting of applications, AWS customers have found that even a simple lift-and-shift of existing applications into the cloud can result in cost savings, higher availability, and increased security — simply because being in the cloud allows for better operational processes. AMS has created repeatable, highly automated processes to harness these improvements. By using these automated, best practice operations mechanisms, AMS can help customers lift-and-shift legacy workloads while helping them gain the advantages of best-in-class cloud operations. Customers can then begin refactoring their applications on their own schedule and move them to a DevOps underpinning.

On the second question, how to avoid handoffs, the answer is that enterprises simply must involve experts from across operations domains in the development process. At AWS, we like to say that systems must be designed with operations in mind. This is no different from other DevOps practices — it is already common to talk about designing with security in mind from the start. An interesting technique to consider is creating operations runbooks for things that cannot be automated as part of the system deliverable — artifacts that are developed along with the system and checked into version control along with code, tests, and deployment scripts.

Runbooks can help facilitate compliance. Changes to them can be audited and restricted to certain roles. The existence of the automated scripts and manual runbooks can serve as proof that compliance controls have been implemented.

The upshot is that enterprises need to create the context around DevOps that takes advantage of its short feedback cycles, risk reduction, and compliance and quality controls. As I have learned, IT operations, in its broadest sense, is an important part of this ecosystem. DevOps teams run what they build but do so in the context of other IT operational functions, just as they do in the context of governance processes, compliance audits, and organizational structures.

As I continue this blog series, we will talk about organizational structures for IT, cost optimization in the cloud, security operations, and what it looks like when an integrated IT organization is functioning optimally.

–Mark
@schwartz_cio
A Seat at the Table: IT Leadership in the Age of Agility
The Art of Business Value

AWS Cloud Enterprise Strategy Blog

The Critical Missing Piece of DevOps…And How to Find It

Resources

Follow