AWS Cloud Operations Blog
How Cigna Implemented a Multi-Region Centralized Alerting System on AWS
This post is co-written with Nicolas Trettel, Cloud Engineering Senior Advisor at Cigna.
Monitoring applications and alerting on issues is crucial for building resilient systems. Amazon CloudWatch is a service that monitors applications, responds to performance changes, optimizes resource use, and provides insights into operational health. By collecting data across AWS resources, CloudWatch gives visibility into system-wide performance and allows users to set alarms, automatically react to changes, and gain a unified view of operational health.
Large enterprises often implement a multi-account strategy to host their applications on AWS. With multiple applications running across these accounts, aggregating alarms for a centralized view becomes necessary. These alarms are typically funneled into a centralized monitoring and alerting system, hosted either on-premises or in the cloud. This setup requires configuring networking, authentication, and a standard message format to integrate with the target system.
Instead of having individual accounts establish their own protocols for this integration, a centralized alerting framework can provide a more scalable alternative. It’s also essential for the monitoring solution to be highly available, ensuring continuity of operations even in the event of disruptions to the primary AWS Region.
In this post, you’ll learn how the Cigna Cloud Center of Enablement (CCoE) team (2023 Overall Winner of the IDC Future Enterprise Best in Future of Digital Infrastructure North America Awards) implemented Alarm Funnel – a resilient, multi-Region centralized alerting framework for applications deployed on AWS. Alarm Funnel provides a scalable solution to the challenge of integrating with a target monitoring and observability system.
Cigna and AWS
As part of its larger strategy, Cigna leverages AWS among other cloud services to enhance its infrastructure. AWS is integrated into a broader stack, enabling Cigna to benefit from a wide range of tools and services that support their comprehensive monitoring and alerting system. This multi-faceted approach ensures robust and scalable solutions, aligning with Cigna’s commitment to operational excellence and resilience.
About Alarm Funnel
The Alarm Funnel provides a standard centralized alerting framework for applications deployed on AWS at Cigna. At its core, the Alarm Funnel is an Amazon Simple Notification Service (Amazon SNS) topic in a centralized AWS account that consumers use as a CloudWatch Alarm Action. There is an SNS topic in all supported regions which triggers a downstream process to forward these alarms to Cigna’s centralized monitoring and observability system. Alerts are formatted according to a standard naming convention.
Architecture
The Alarm Funnel architecture leverages several AWS services, including AWS Lambda , AWS Step Functions, and Amazon DynamoDB, as shown in the diagram below.
- CloudWatch Alarm Notification: A CloudWatch alarm in an application team’s AWS account is configured to notify an Amazon Simple Notification Service (SNS) topic in a central or management account.
- Alarm Ingestion: When the alarm is triggered, the alarm information is passed into an “Entrypoint” AWS Lambda function in each of the destination regions (currently us-east-1 and us-east-2).
- Regional Step Functions: The “Entrypoint” Lambdas each trigger an AWS Step Function state machine in their respective regions.
- Distributed Locking Mechanism:
- The Step Functions attempt to obtain a lock from an Amazon DynamoDB table in the primary region (us-east-1) before processing the alarm message.
- The Step Function must acquire the lock before it can process the message.
- The secondary region (us-east-2) waits a few seconds before attempting to obtain the lock, ensuring the primary region has a chance to do so first.
- The secondary region will only obtain the lock and process the message if the primary region is experiencing issues.
- Both regions use the DynamoDB table in the primary region for strongly consistent reads.
- If the DynamoDB table in the primary region is degraded, the secondary region will be used. The global nature of the DynamoDB table preserves the status and locks regardless of the region used.
- The region that does not obtain the lock will continue to process the message if the locked region fails.
- Monitoring System Integration: The Lambdas within the Step Function then send the alarm message to the downstream monitoring and observability system.
Multi-Region Challenges
Since the Alarm Funnel is the primary alerting mechanism, it must remain functional even when a consuming application is affected by regional service disruptions. For example, if the Alarm Funnel runs solely in us-east-1 and that region experiences a large-scale service event, alerting for unaffected regions would still be degraded as their alarms are processed in the affected region. Relying on other resources in the impacted region, such as Lambda, for an active/passive failover is also not feasible, as those resources may be unavailable. Additionally, requiring all consumers to update their alarm actions to point to an SNS topic in a different region would be overly burdensome. Therefore, the Alarm Funnel must be designed as a multi-Region active/active solution. However, as a message delivery system, the Alarm Funnel must orchestrate the logic to ensure each alert is processed only once.
How it works
The Alarm Funnel’s multi-Region active/active design relies on a concurrency lock mechanism. When an alarm is triggered, all configured regions are invoked simultaneously, but only one region can obtain the lock and process the message.
The process begins with an SNS topic in each supported source region (us-east-1, us-east-2, eu-west-2, ap-southeast-1) invoking an “Entrypoint” Lambda in all destination regions (us-east-1 and us-east-2). The “Entrypoint” Lambdas then trigger a Step Functions state machine, which first attempts to obtain a lock from a DynamoDB table. If the lock is successfully obtained, the message is processed and forwarded to the downstream monitoring systems.
DynamoDB Locking Table
The lock is represented as a DynamoDB table item, with the message ID as the partition key. This allows for concurrency control at the message level, rather than the entire service. The lock item also stores metadata, such as the timestamp of when the lock was obtained, the region that processed the message, and the status of the processing.
Several considerations are made when using DynamoDB for this locking mechanism:
- The table must be a global table, as a regional table would be susceptible to failures in that specific region.
- DynamoDB global tables use a “last writer wins” method to reconcile between concurrent updates. This means that a successful PutItem operation does not guarantee the lock has been obtained. Each destination region must perform a Strongly Consistent Read to validate the lock item and ensure they are the processor.
- After processing the message, the processor updates the lock item’s status. The region that did not obtain the lock periodically checks the status and, if it indicates a failure or the lock has expired, overwrites the lock and processes the message itself.
Due to the lack of strongly consistent reads across regions in a DynamoDB global table, each destination region must use the table in the primary region (us-east-1) and fail over to the secondary region (us-east-2) only if the primary is unavailable.
The failover mechanism is straightforward and integrated into the logic. If the Lambdas are unable to reach the DynamoDB table in the primary us-east-1 region after several retries, they will attempt to obtain the lock from the secondary us-east-2 region’s DynamoDB table instead.
Step Function Architecture
- The “Entrypoint” Lambda invokes the Step Function.
- The Step Function in the secondary region waits a few seconds before continuing. This is to ensure the primary region has a chance to obtain the lock and process the message first. The secondary region should only obtain the lock and process the message if the primary region is experiencing issues. The primary region’s wait step is set to zero seconds.
- The Step Function invokes a “Lock Obtainer” Lambda, which creates a lock item in the DynamoDB table with the following fields, a condition expression is used to ensure the lock is created only if one does not exist.
- MessageId – the partition key (value is the SNS message’s ID).
- Expiry – the time at which the lock is considered “expired” and the other region should take over. Note that the handler Lambdas will time out before this value.
- Region – this is used to validate the lock was successfully obtained by the Step Function’s region.
- Status – used to track the status of the message processing.
- The “lock obtainer” Lambda then uses strongly consistent reads to get the lock item it just wrote. The Lambda validates the written lock item’s Region, Status, and Expiry fields as described below.
- The Region field is equal to the Step Function’s region. The Step Function branch continues processing the message.
- The Region field is not equal to the Step Function’s region and the Status field is “failed.” This indicates the other region failed to process the message. The “lock obtainer” Lambda overwrites the lock item’s Region field with its own region, and the Step Function branch continues processing the message.
- The Region field is not equal to the Step Function’s region and the Expiry field has a value in the past. This indicates the other region timed out while processing the message. The “lock obtainer” Lambda overwrites the lock item’s Region field with its own region, and the Step Function branch continues processing the message.
- The Region field is not equal to the Step Function’s region, the Status field is not “failed,” and the Expiry field has a value not in the past. This indicates the other region is still processing. The Step Function branch enters a wait loop and periodically starts the “lock obtainer” process all over again (Step 4 above).
- The Region field is not equal to the Step Function’s region and the Status field is “complete.” This indicates the other region successfully processed the message. The Step Function branch skips message processing.
- If the message should be processed, the Step Function invokes a “handler” Lambda to send the message to the monitoring and observability systems, updating the lock item’s status accordingly.
- If the Lambdas are unable to reach DynamoDB in the primary region, they fail over to the secondary region’s DynamoDB table and returns a “failover” value in its payload indicating so. The “failover” value is persistent throughout the Step Function execution so that subsequent calls to DynamoDB (i.e. to update the status or during the check/wait loop) do not even attempt to reach DynamoDB in the impacted region.
This design ensures the Alarm Funnel remains functional even when a specific AWS Region is impacted, providing a reliable and resilient alerting mechanism.
Conclusion
The Alarm Funnel solution implemented by the Cigna CCoE team showcases how a centralized, multi-Region alerting framework can effectively address the challenges of monitoring and alerting for large enterprises with applications deployed across multiple AWS accounts. By providing a scalable alternative to individual account-level integrations and leveraging a multi-Region active/active design, the Alarm Funnel ensures continuity of operations even in the event of regional service disruptions. The innovative use of DynamoDB as a distributed locking mechanism, coupled with the orchestration logic in AWS Step Functions, enables the Alarm Funnel to process each alert exactly once across multiple regions. This resilient architecture provides Cigna with an efficient way to funnel CloudWatch alarms into their centralized monitoring and observability system, enhancing the overall reliability of their AWS-hosted applications.