AWS Partner Network (APN) Blog

Using AWS Health to Drive Operational Excellence with Ensono’s Health Event Centralization Platform

By Chris Mills, Sr. Partner Technical Account Manager – AWS
By Michael Hogg, Expert AWS DevOps Engineer – Ensono

Ensono-AWS-Partners-2023
Ensono
Connect with Ensono-1

If you’re using Amazon Web Services (AWS) to run your applications, you may be familiar with AWS Health Events, which are notifications published to the AWS Health Dashboard and sent via email to account contacts.

These indicates a range of health status changes with your resources, including service deprecations, maintenance events, degraded Amazon Elastic Compute Cloud (Amazon EC2) hardware, policy updates, misconfigurations, resiliency notices, and software updates.

These notifications are crucial when managing AWS environments, but at enterprise scale or for managed service providers (MSPs), where you typically have significant AWS service usage across thousands of accounts, this translates to large quantities of notices, making it difficult to keep track of the different deadlines and remediation workflows.

Ensono’s AWS Health Event Centralization Platform captures, filters, prioritizes, notifies, and tracks the remediation of AWS Health notifications across its customer base, ensuring critical managed infrastructure and applications remain secure and available.

In this post, we will explore some of the challenges customers face when using the AWS Health Dashboard at scale, and how Ensono, an AWS Specialization Partner and AWS Marketplace Seller, has solved them by developing its platform. Ensono is also a member of the AWS MSP and Well-Architected Partner Programs.

Customer Challenges

The AWS Health Dashboard provides a consolidated view of the availability and operational status of customers’ AWS resources, including resource issues, upcoming changes, and important notifications. The dashboard, accessed via the AWS Management Console, enables customers to be notified of events or incidents that can impact the secure and stable running of their resources.

The AWS Health Dashboard provides integration and automation capabilities via Amazon EventBridge and AWS Lambda, and the notices detail remediation and/or troubleshooting steps necessary to mitigate the risks.

AWS also makes available the AWS CloudFormation template for AWS Health Aware (AHA), which can be deployed to provide deduplication and more targeted notifications to technical stakeholders, via Slack, Teams, Chime, and email.

As it stands today, however, AWS Health does not aid enterprise-scale businesses or MSPs to visualize and manage notifications across many AWS Organizations and customers. There are limitations, particularly when considering the end-to-end operational workflow required from notification to completion, and the common need for ITIL compliance throughout.

The AWS Health Dashboard and AHA have the following limitations:

  • Single AWS Organization support.
  • AHA is only available to customers with AWS Business or Enterprise Support.
  • No stakeholder mapping.
  • No remediation workflow, tracking, or logging.
  • Filtering on event categories only.

For example, one of Ensono’s customers has over 250 AWS accounts across eight AWS Organizations, and is a heavy user of Amazon OpenSearch Service. This posed significant challenges for tracking mandatory software updates, including:

  • Manual process to audit notification events (logging in to multiple AWS consoles or trawling through emails in a group mailbox).
  • Manual ticket raising, taking time to look up and correlate stakeholders, severity levels, service-level agreements (SLAs), remediation documentation, and tagging configuration items (CIs) for request for changes (RFCs).
  • Presence of insignificant events also increased the workload (low priority and known risks). In addition, service health (operational) events were already covered by a monitoring platform.

The manual, time-consuming, and repetitive nature of this process introduced delays, risk, and the potential for human error.

AWS Health Events Centralization Platform

Ensono developed the AWS Health Events Centralization Platform to ensure their customers don’t miss critical events impacting their AWS infrastructure.

A key benefit of hosting applications using AWS infrastructure is they do not let your services get out of date, and let you know when your configuration needs updating. But how do you keep track of the (potentially) hundreds of notices every month, and ensure your technical teams are informed, able to take action, and that you track issues through to resolution on time? Ensono’s AWS Health Event Centralization Platform solves this challenge.

Ensono’s solution captures all events, for all services, across all AWS Organizations and accounts for their customers, filters based on type, prioritizes based on criticality and deadlines, and then ingests into their ITIL IT service management (ITSM) system.

Using their comprehensive configuration management database (CMDB), the ITSM system assesses the Ensono service level covering the customer, account, or service, enriches the notice with relevant stakeholder information, and triggers a remediation workflow.

The primary business benefits of Ensono’s solution are:

  • Ensures all health events are captured across all services, AWS Organizations, and linked accounts, consolidating them for ingestion to the ITSM system.
  • Filters notices to reduce noise from informational and known risks, and prioritizes based on criticality and key dates. This removes the maximum amount of risk in the shortest time period.
  • Leverages comprehensive AWS tagging, discovery process, and Ensono’s CMDB to identify technical and business stakeholders.
  • ITIL risk and change workflow management, ensuring health events (incidents) are logged, categorized, prioritized, owned, and audited to resolution.
  • Automated processes increase speed, reduce staffing costs, and remove potential for human error.

How it Works

Ensono’s AWS Health Event Centralization Platform is an event-driven serverless architecture which aggregates and filters AWS Health Events from multiple AWS Organizations and accounts, using native AWS services.

The AWS services used in this solution include Amazon EventBridge, AWS Lambda, Amazon Simple Queue Service (SQS), AWS Secrets Manager, Amazon DynamoDB, and Amazon CloudWatch.

Ensono-Health-Events-1

Figure 1 – High-level solution architecture.

When AWS posts a health event, it’s sent to the default Amazon EventBridge bus (1). The first part of the architecture comprises an EventBridge rule (2), which subscribes to any events on the default bus that have a source criterion of “aws.health”. This rule is created on every AWS account that needs to be monitored and always targets the same custom EventBridge bus located in a single Ensono-owned AWS account (3).

The EventBridge rule leverages an AWS Identity and Access Management (IAM) role that allows each contributing AWS account to authenticate to the centralized Ensono EventBridge bus.

To further enhance security, the centralized Ensono EventBridge bus is protected by an IAM resource policy (4) which only accepts events from AWS accounts that belong to known authorized AWS Organizations.

An EventBridge rule located on the custom Ensono EventBridge bus is used to further filter events by region, service, and event type category (issue, accountNotification, and scheduledChange). The target for this rule is an SQS queue (5) which acts as a shock absorber for an upstream, asynchronous Lambda function.

The Lambda function will process events taken from the SQS queue (6A) and perform the following actions:

  • Extract the event type from the event payload (for example, AWS_S3_MAINTENANCE_SCHEDULED).
  • Query a DynamoDB (7A) table that contains an “allowlist” of event types of interest.
  • If the event type is not present in the allowlist, the event is ignored and logged as such.
  • If the event type is present in the allowlist, the ticket severity associated with the event type is returned from the database.
  • The Lambda function subsequently converts the JSON payload of the AWS Health Event into a format that’s compatible with Ensono’s ITSM system.
  • ITSM system is fronted by a secured webhook, and therefore the Lambda function acquires credentials from AWS Secrets Manager (8).
  • The Lambda function then HTTP POSTs the JSON event payload to the ITSM webhook, which returns a HTTP 200 response code on success (9).

Once the allow-listed event has made its way into Ensono’s ITSM system, an incident is raised. Depending on the event severity code, the incident receives an appropriate response time SLA. The ITSM system links the incident to the correct customer by inspecting the AWS account ID and any affected entities (in the event payload) whilst cross-referencing the objects against Ensono’s CMDB.

Remediation process knowledge bases are automatically linked to the ticket by inspecting the event’s description field; this ensures the incident is triaged correctly.

Customers have the option to introduce custom resolution times (fix SLAs) depending on their contract. Customers can also observe tickets and scheduled changes via Ensono’s proprietary Envision portal.

It’s important to monitor the health of the filtering system independently; to do this a dead letter queue (DLQ) is used to store events that have not been picked up or processed successfully by the Lambda function. A CloudWatch alarm has been configured to fire if the queue depth of the DLQ is greater than zero.

Meanwhile, AWS Config Rules are used to ensure the IAM role and EventBridge rule remain in place on the contributing AWS accounts.

The primary technical benefits of Ensono’s solution are:

  • Event-driven vs. poll-based (not cron based) makes it more efficient with faster notifications and no API throttling.
  • Serverless and asynchronous.
  • Advanced filtering of event types.
  • Works across multiple AWS Organizations.
  • Works with any AWS Support level.
  • Resource level tracking (CMDB configuration items for request for changes scope).
  • Links affected resources with their technical owners.
  • Severity of each event type can be independently defined.
  • Decoupled, scalable, and extensible allows for future enhancements.

Results Achieved

As a managed service provider, Ensono oversees the management of an extensive portfolio of AWS accounts on behalf of its customers. These accounts play a pivotal role in daily operations, necessitating consistent monitoring and reviews to ensure the ongoing health and compliance of customers’ AWS environments.

The previous approach to reviewing each individual account presented formidable challenges for Ensono—principally the risk of human errors in the process. Now, as AWS events occur Ensono’s integrated tool automatically detects each event, assesses its severity, and raises the same in the ticketing system.

The automated and integrated nature of this solution ensures that every alert, no matter how minor or critical, is properly captured, documented, and addressed. This integration has sped up Ensono’s response times and provides clear and efficient communication between teams and customers.

“It is akin to having a dedicated expert who can instantly identify and categorize events, eliminating manual intervention and guaranteeing that every alert is addressed promptly and according to its importance,” says Manan Kapadia, Ensono’s Director of Public Cloud.

The operations team at Ensono has reported the following results:

  • Efficiency: Ensono’s account review process has become more efficient, with lightened administrative workload.
  • Accuracy: Automation has significantly boosted the accuracy of account reviews; the system is designed to leave no room for critical details or alerts to slip through the cracks, and the precision has elevated the quality of account reviews.
  • Consistency: The solution engineers consistency into account reviews, using predefined rules and criteria Ensono established, enabling alignment to customer agreed standards.
  • Timeliness: Alerts and updates managed with proportionate urgency, ensuring swift response to issues or anomalies in customer accounts.
  • Cost savings: Reducing administrative overhead, improving accuracy, and streamlining operations allows more time for proactive initiatives and improvements.

Summary

AWS Health Events are a crucial mechanism for the efficient management of AWS infrastructure, but at enterprise or MSP scale this can translate into a lot of notices to manage. Additionally, common in the enterprises is the need for robust risk and change management workflows, which are not built into AWS Health today.

To ensure customers don’t miss critical notices, key dates, and ensure their applications remain available, secure, performant and up to date, Ensono developed the AWS Health Events Centralization Platform, an event-driven and serverless architecture built on AWS. Ensono provides this solution exclusively to their AWS managed services customers.

.
Ensono-APN-Blog-Connect-2023
.


Ensono – AWS Partner Spotlight

Ensono is an AWS Partner and MSP that supports large enterprises looking to transform their traditional environments with AWS.

Contact Ensono | Partner Overview | AWS Marketplace