AWS for Industries

Improve application resiliency using AWS Incident Detection and Response

Airline applications that handle critical functions such as flight booking, flight tracking, reward programs, baggage tracking, in-flight entertainment, and so on are transforming the way passengers experience air travel. Any disruption to these applications may inconvenience passengers, or worse, lead to loss of revenue and passenger trust. Disruptions that lead to major delays may also result in penalties for airlines.

While migrating airline applications to the cloud can reduce disruptions through improved scalability and enhanced disaster recovery capabilities, managing cloud operations can present challenges. Some of these challenges include factors such as a lack of cloud computing skills, integration with legacy systems, outdated incident management protocols, dependency on on-premises infrastructure, and the use of outdated monitoring solutions.

In this blog, we explore how a major airline onboarded a mission critical application to AWS Incident Detection and Response (IDR) to improve their cloud operations.

What is AWS Incident Detection and Response?

AWS Incident Detection and Response offers you proactive engagement and incident management for critical workloads. With AWS Incident Detection and Response, AWS Incident management Engineers (IMEs) monitor your workloads 24X7, detect incidents, and get you engaged with AWS Support experts to provide guidance towards mitigation and recovery.

AWS Incident Detection and Response achieves these objectives by focusing on four key outcomes:
Improved Observability: It ensures that you have the right observability between the application and infrastructure layers to detect disruptions on your workloads.
Faster Resolution: It ensures that you are engaged early with AWS Incident Managers within five minutes of an alarm trigger and that incidents are managed with pre-defined response plans to accelerate your recovery.
Incident Management for AWS Service Events: It provides you updates on AWS Service events, sentiments on impact and guidance on implementing your mitigation plans.
Reducing the potential for failure: Beyond accelerating recovery, it also provides mechanism for continuous improvements by ensuring that lessons captured from previous incidents inform improvements to the runbooks, observability, and response plans to reduce the potential for failure.

FICON application’s operational efficiency

How did IDR improve application resiliency?

As part of their infrastructure modernization initiative, the airline embarked on a multi-year cloud migration journey. One of the applications migrated to the cloud as part of this initiative was their Field Condition Reporting (FICON) application. FICON provides pilots and flight planners with information related to runway conditions. Any impact to the availability of this application or delay in recovery results in flight delays directly impacts airline passengers and operations.

FICON is a ground-stop application with near zero Recovery Time Objective (RTO). As part of the migration, the airline needed support setting up observability for the application in the cloud, fast response to critical incidents, and access to experts who have context of their application to guide their teams through recovery.

To address these needs, the customer decided to onboard the application to AWS Incident Detection and Response. The onboarding process began with a review of the application for reliability and operational excellence. AWS specialists worked with the airline’s application team to identify key performance indicators to enhance observability across the application and infrastructure layers of the system and created Amazon CloudWatch alarms to alert them during an incident. A runbook was also created with a list of application contacts for escalation during critical incidents.

AWS Incident Detection and Response enhanced FICON application’s operational efficiency through improved observability and early incident detection. The five-minute response time was key for the airline’s ground-stop applications considering their strict Recovery Time and Recovery Point Objectives (RTO & RPO). AWS Incident Detection and Response improved the Mean Time To Engage (MTTE) and Mean Time To Restore (MTTR) for critical incidents.

In an incident that reflects the improvements in operational excellence, an Amazon CloudWatch alarm in the FICON application was triggered. The alarm tracks the Amazon API Gateway integration latency, the time between when the API Gateway relay a request and receives a response from the backend. A Support case was automatically created in response to the alarm and an Incident Manager was engaged within 2 mins of the alarm trigger.

The Incident Manager initiated a conference bridge and facilitated joint triage and incident resolution with the airline and AWS teams. The AWS Lambda support team joined the conference session, reviewed the logs, and identified that AWS Lambda had reached its concurrency limit. The engineer quickly increased the Lambda concurrency limit to resolve the issue. The integrated monitoring and automated response workflow enabled proactive engagement and swift mitigation of issue. After the incident was resolved, the AWS Incident Manager shared a post-incident report, including the cause of the issue and recommendations to prevent a reoccurrence of the issue. The recommendations included enabling provisioned concurrency and creating a new CloudWatch alarm to monitor the Lambda concurrency limit. The team also made recommendations on the alarm thresholds to improve detection and updated the runbooks accordingly.

improve the detection of issues

Incident detection and response under-the-hood

As shown below, setting up integration with AWS Incident Detection and Response requires no change to your existing architecture. You can easily setup integration with AWS Incident Detection and Response by provisioning access to the AWS Health Service Linked Role to ingest alarms from your Application Performance Monitoring (APM) tools (e.g. Amazon CloudWatch, Datadog, New Relic etc.).

In the event of an alarm, AWS Incident Detection and Response automation will ingest the alarm via Amazon Event Bridge and create a Support case in your account for correspondence with an AWS Incident Manager. Updates are also made to the AWS Personal Health Dashboard in your accounts for notification about AWS Service events. AWS Incident Detection and Response supports event ingestion either directly from 3rd party APMs or via a webhook. You may consult the Getting Started section of the AWS Incident Detection and Response User Guide for more details on setting up your workloads on AWS Incident Detection and Response.

Integrating with AWS Incident Detection and ResponseIntegrating with AWS Incident Detection and Response

Conclusion

AWS Incident Detection and Response can improve the incident management processes for most critical applications across domains such as ticketing, baggage handling, flight operations, airport operations, and crew management. Applications with strict RPO and RTO requirements will benefit from the proactive engagement. The timely identification and remediation of problems affecting mission-critical systems will help minimize disruptions to operations and impacts on customers.

To learn more, please visit AWS Incident Detection and Response User Guide or contact your AWS account representative.

TAGS:
Naseer Sayyad

Naseer Sayyad

Naseer Sayyad is a Senior Technical Account Manager at Amazon Web Services. Naseer partners with AWS enterprise customers and helps them to be successful in their cloud transformation journey. He is passionate about cloud computing and automation. Outside work, he enjoys travel and photography.

Neel Sendas

Neel Sendas

Neel Sendas is a Principal Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He is also an ML enthusiast and worked on various ML use cases for manufacturing and logistics industries. When he is not helping customers, he dabbles in golf and salsa dancing.

Temitope Baiyewu

Temitope Baiyewu

Temitope Baiyewu is a Senior Product Manager at Amazon Web Services. Temi leads the product development for AWS Incident Detection and Response and is passionate about helping customers operate their critical workloads more efficiently on AWS. Temi loves to read and is a diehard fan of Chelsea FC, a true blue at heart.