AWS Public Sector Blog

Reduce mean time to contain (MTTC) on incidents against digital citizen services

Attacks on digital citizen services can cause citizens to lose trust in their governments. Services such as real estate land title searches, emergency services, and more need to be operational in times of need. As an IT leader for digital citizen services, you need to automate incident management runbooks. In this post, learn how the automation of incident response starts with what you already have: your existing incident response runbook (the sequence of steps to resolve incidents).

Public sector organizations have a challenge ahead in the next 2-5 years: they must do more with less and transform along the way. But it’s a challenge, with opportunities to serve citizens with new services. One such opportunity is to reduce the incident mean time to contain (MTTC) with automated remediation. In this post, I describe three prescriptive steps to start orchestrating your own incident response strategy and within it, evolve your runbook.

What we want and need are shorter incidents

Figure 1: Incident Response Flywheel

Figure 1: Incident Response Flywheel

To start, look at an incident response flywheel (Figure 1)—a mental model that iterates and improves with each incident, with the goal at its center. What we want, and ultimately need, are shorter incidents. In our journey to achieve this, we need to be able to detect incidents faster, respond faster, and remediate faster. By decoupling each stage and diving deep into what happened and why, we can empower teams to learn and apply these new findings to prepare for the next iteration of our runbook. This iterative, yet transformative model makes sure that we progress in our journey toward our goal.

With this mental model in mind, here are the three steps to implement it in practice, starting with an existing asset: your incident management runbook

1) Place current incident management runbook as documents in AWS Systems Manager

Your current incident management runbook may reside in various places that are accessible to your response team. To access the runbook when you need it, place the runbook as documents in the response plan of the AWS Systems Manager Incident Manager. As an example, Figure 2 shows a generic incident management runbook document with three steps: Step 1: Triage; Step 2: Mitigation; Step 3: Recovery. 

Figure 2: Your runbook as Automation Documents

Figure 2: Your runbook as Automation Documents

Under each steps of triage, mitigation, and recovery, there are miniature sequential or parallel steps that incident response teams execute to recover from the incident. For example, under Step 2: Mitigation, one of these steps could be to automatically quarantine a compromised Amazon Elastic Compute Cloud (Amazon EC2)instance. Another step could be to create an Amazon Elastic Block Service (Amazon EBS) snapshot for forensic investigation. With this automation document, your response team has a prescriptive guidance that serves as a guide in times when phones are constantly chiming and alert fatigue takes over.

Automation of incident responses starts with a single step. By placing your runbook here, all of the steps (manual or automatic) that your response teams follow are outlined. You may have no automation now, but you’ve flushed out your manual steps that your response team follows so that, over time, you may build one such automation out of these steps. With that new automation you can gauge how well it’s working and then iterate upon it. Then, build the second automation out of these steps, iterate, then build the third, and so on. Now, learn how to set up the incident management service to use this Automation Document.

2) Setup AWS Systems Manager Incident Manager

At the heart of the AWS Systems Manager Incident Manager is the response plan. Taking the time to plan for incidents ahead of time saves crucial operational time for teams during an incident. Some of the best practices to consider when designing a response plan includes:

  • Streamline engagement – Identify the most appropriate team for an incident. This focused approach narrows down the right response teams rather than engaging wide distribution lists which hinder progress during critical incidents.
  • Create reliable escalation – Using escalation plans make sure that responders are effectively and reliably engaged. Even with best intentions, responders can be unreachable. Having backup responders in succession configured in an escalation plan covers these scenarios.
  • Iterate runbooks – Developing runbooks that provide repeatable and understandable steps helps reduce the stress that responders experience during incidents.
  • Collaborate with right teams – Use chat channels to streamline communication during incidents. Chat channels help responders stay up-to-date with information and also share information.

With best practices in mind, follow the Incident Preparation steps. After this is completed, test to see how your preparation steps are working. Select the Start Incident button and observe the response plan in action triggering the contacts in your Escalation Plan and starting the Automation Document containing the runbook steps.

A useful feature for guiding teams for post-incident analysis can be triggered after the incident is resolved. This guides your teams through identifying improvements to your incident response, including time to detection and mitigation, as well as diving deeper into the root cause. Incident Manager creates recommended action items to improve your incident response. Looking back at the Incident Response Flywheel in Figure 1, at this stage, we’re decoupling and empowering teams to ask the five levels of Why this incident happened.  After doing this post-incident analysis, teams can iterate on their next incident preparation with real data. This is the incident flywheel mechanism working well and ultimately benefiting the digital citizens’ experience of your service offerings. Now build out the rest of this end-to-end flow.

3) Create end-to-end flow with monitoring

With the containment portion of your system is in place and tested, add the Detection system for an end-to-end flow with monitoring. Turn on Amazon GuardDuty, AWS Security Hub, create a rule in Amazon EventBridge, monitor the rule triggers via Amazon CloudWatch, create a topic in Amazon Simple Notification Service (Amazon SNS) and link AWS Chatbot to that SNS topic. The AWS Chatbot can integrate with an existing Slack Channel that your response team uses. The architecture is depicted in Figure 3.

Figure 3: End-to-end flow

Figure 3: End-to-end flow

3.1 Enable GuardDuty

GuardDuty is an intelligent threat detection service that continuously monitors for malicious activity and delivers detailed security findings for visibility and remediation. Go to the GuardDuty console and choose ‘Get Started’ and ‘Enable GuardDuty’ to start your 30-day no-cost trial. GuardDuty analyzes event details from VPC flow logs, DNS logs, and AWS CloudTrail without any additional infrastructure to provision.

Compromised EC2 instances are picked up by GuardDuty from the VPC Flow Logs and findings are generated. To observe a few sample findings, select the ‘Generate Findings’ button on the GuardDuty console. These are test findings not on your EC2 instances but just a sample of findings for testing purposes from GuardDuty. This GitHub repo has scripts to generate different types of findings on your configured EC2 instances after your detection setup is complete.

3.2 Turn on Security Hub

Security Hub is a cloud security posture management service that performs security best practice checks, aggregates alerts, and enables automated remediation. You get a comprehensive view of your high-priority security alerts and security posture across all of your AWS accounts. To get started, there’s a single select on the Security Hub console that enables it. Once enabled, Security Hub immediately starts aggregating your security findings including those from GuardDuty, such as a compromised EC2 instance.

3.3 Create a topic in Amazon SNS

Amazon SNS is a managed service that provides message delivery from publishers to subscribers (also known as producers and consumers). Publishers communicate asynchronously with subscribers by sending messages to a topic, which is a logical access point and communication channel. Clients can subscribe to the SNS topic and receive published messages using a supported endpoint type.

Create a topic with a standard queue

3.4 Configure a Slack client to AWS Chatbot and subscribe to the SNS topic

The AWS Chatbot enables teams to use messaging program chat rooms to monitor and respond to operational events in their AWS Cloud. AWS Chatbot processes AWS service notifications from SNS and forwards them to chat rooms so that teams can analyze and act on them immediately, regardless of their location. Once you have the SNS topic created, configure a Slack client and subscribe to that SNS topic. GuardDuty findings forwarded to Security Hub show up in your Slack channel.

3.5 Create Amazon EventBridge Rule with Event Pattern and Target

EventBridge is a serverless event bus. Create a simple rule, as shown in Figure 4, with an Event Pattern from the Security Hub findings. Then we create two targets:

  1. The response plan in AWS Systems Manager Incident Manager.
  2. The SNS topic so that it triggers a Slack notification via the AWS Chatbot.

For greater observability, use CloudWatch, then monitor a summary of the individual invocations of your rules via the CloudWatch Events dashboard.

Figure 4: EventBridge Rule with Event Pattern from Security Hub Findings

Figure 4: EventBridge Rule with Event Pattern from Security Hub Findings

3.6 How to test end-to-end with Security Findings

GuardDuty has an option to generate findings automatically. Once you generate the findings, each propagates to Security Hub. From there the Event Pattern defined in EventBridge forward to two targets:

  1. The AWS Systems Manager Incident Manager Response Plan which contains your runbook as Automation Documents, and
  2. The SNS topic with which the AWS Chatbot is integrated and forwards the message to the Slack Channel.

Alternatively, for more complete simulated generated findings (cryptomining, Linux, and RDP brute force, etc.) against an EC2 instance, visit this GitHub repo. The setup involves creating a bastion host and four EC2 instances. Don’t forget to turn them off after testing so as not to incur unnecessary charges.

Conclusion

In this post, I introduced the incident response flywheel as an iterative mechanism that helps teams improve metrics such as MTTC for incidents. You saw how you can take your existing incident runbook, place it as automation document in AWS Systems Manager, and subsequently how to configure a response plan with this runbook along with escalation plans in the Incident Manager. Then, in the third step (with six sub-steps) you saw how you can build out the detection system with an end-to-end flow by enabling GuardDuty, turning on Security Hub, creating an SNS topic, configuring a Slack client, subscribing to the SNS topic, and then creating an EventBridge rule that picks up an event pattern from Security Hub and forwards to the Response Plan in AWS System Manager Incident Manager, and the SNS topic so that it triggers a Slack notification via the AWS Chatbot.

Get started today by creating your existing runbook as automation document in the AWS Systems Manager Incident Manager.

Kelvin Ting

Kelvin Ting

Kelvin Ting is a senior solutions architect at Amazon Web Services (AWS) and collaborates with public sector customers to help them architect, build, scale, and monitor applications to achieve their goals.