Multi-tenant monitoring across accounts and regions using Amazon Managed Service for Prometheus

In this guest blog post, Nauman Noor (Managing Director), Fabio Dias (Cloud Developer), and Dylan Alibay (Cloud Developer) from the platform engineering team at State Street discuss their use of Amazon Managed Prometheus and AWS Distro for OpenTelemetry to enable monitoring in a multi-tenant, multi-account, and multi-region environment.

In the ever-evolving financial services landscape, State Street Corporation is a world leader in investment services, investment management, and investment research and trading for institutional investors.

State Street operates in a complex AWS environment that includes multiple tenants, regions, and accounts, which makes observability a challenge.

In this blog post, we will cover how State Street used AWS Observability Services to aggregate their observability data into a single pane of glass by leveraging AWS Distro for OpenTelemetry (ADOT) and Amazon Managed Service for Prometheus.

Solution overview

When evaluating different observability solutions, State Street wanted a service that would have low overhead to manage and wouldn’t need a large initial investment. They were looking for an option that minimized management burden and upfront costs.

The high-level architecture for the solution is illustrated in Figure 1.

State Street used Amazon Elastic Container Service (ECS) to run processing tasks in a scalable way using AWS Distro for OpenTelemetry (ADOT). Amazon Managed Prometheus (AMP) provided data persistence through Grafana dashboards, enabling data visualization and analysis. The key components were ECS for task processing, ADOT for scalability, AMP for data storage, and Grafana for visualization.

- High level end-to-end architecture of the solution

Figure 1 – High level end-to-end architecture of the solution

State Street collected performance metrics from the resources across their monitored environments, including their on-premises data centers and multiple AWS accounts spanning several regions. They aggregated this monitoring data by funneling it into a Central Account. Centralizing the data in this way improved reliability and security while also simplifying analysis across the different environments being monitored.

The details this workflow is divided in the following sections: Metric Collection, Aggregation, and Presentation.

Metric Collection
State Street leveraged a combination of AWS native metrics solutions and code instrumentation to generate the metrics. The scenario can be split into three main categories:

1. Native AWS service metrics:

– Collected from Amazon CloudWatch using YACE (Yet-Another-CloudWatch-Exporter)

– YACE exposes an ADOT compatible endpoint that queries and caches metrics from AWS API

2. Amazon EC2 and Amazon ECS metrics:

– Uses open-source exporters like node_exporter and cAdvisor

– Provides higher frequency and additional metrics

3. Ad-hoc metrics:

– Includes language specific metrics like JMX, python via opentelemetry-exporter-otlp

– Also includes business-oriented metrics defined at the application level

This information needs to be collected and processed, which is done by ECS services running ADOT configured as scrapers. Each scraper periodically queries a subset of existing resources, enriching them with identification metadata, and forward those metrics to the Central Account.

The architecture diagram showcasing this phase is illustrated in Figure 2.

Diagram of scrapers in a monitored account

Figure 2 – Diagram of scrapers in a monitored account

To optimize this process, the team deployed the scrapers in the same availability zones as the resources they monitor. This reduced latency and data transfer costs. For large environments, they split the scope into multiple scrapers, each responsible for querying specific resources within the configured measurement interval. They are configured for high availability with each Amazon ECS service configured to run multiple copies of each scraper. This ensures observability during maintenance activities and downtime. They also leveraged the deployment circuit breaker features of Amazon ECS to provide stability during service updates.

Aggregation
The Central Account receives metrics from several environments via a set of dedicated ECS services running ADOT. Those services are configured to receive telemetry via HTTPS and persist it on the AMP Workspaces. State Street calls those services middleware, and they enable a seamless integration with on-premises data sources, since they support open industry standards.

This approach has advantages for the AWS environments as well:

1. Low Authentication Overhead: Requires the set-up of basic authentication to ensure metrics are ingested only from authorized sources

2. Scalable: Reduced ingestion pressure on AMP Workspace by batching requests in the middleware

3. Flexible: Provides an abstraction layer between the resources and analysis, allowing for flexibility in choice of solutions for Metric Collection and Presentation

The architecture diagram for this phase is illustrated in Figure 3.

Diagram of the middleware on the Central Account

Figure 3 – Diagram of the middleware on the Central Account

Handling dynamic loads
Metric volume can fluctuate significantly due to new deployments, failovers, etc. The middleware needs to be responsive to avoid dropping data points during crucial observability events.

To ensure high availability, State Street leveraged ECS Service Auto Scaling.

However, the traditional memory-based scaling is inadequate as during regular operation, data flows out as quickly as it flows in without the need to buffer/cache the data. An increase in memory usage would indicate backpressure, eventually leading to memory starvation and the task being replaced.

State Street adopted auto-scaling based on request count per load balancer target instead. They ran load tests to establish how many requests per second they could reliably handle, as this can vary depending on the resources and configuration. An example of the autoscaling in action is shown in Figure 4, where the number of requests over time is superimposed with the scaling events. The dotted line represents the configured limit of requests per second.

Autoscaling in Action, ALB Request Count per Middleware Task

Figure 4 – Autoscaling in Action, ALB Request Count per Middleware Task

The highlighted points correspond to:

A. This is regular operations under the threshold with only 1 task: Middleware 1.

B. There is an increase in requests, going over the configured threshold.

C. A new task (Middleware 2) is added, which reduces the load per task into manageable levels.

D. Once the total number of requests decreases the additional task is no longer required and scales-in.

To further improve the availability of the solution through extreme surge in requests, the team enabled the memory limiter processor available in ADOT. The memory limiter monitors the memory usage and pessimistically drops data points so as to not overwhelm the running task when its memory usage is close to the maximum. This should rarely occur as they also configured a safety margin on the scale-out threshold to cover for the start up time of new middleware tasks.

Presentation
The middleware then persists the data into Amazon Managed Service for Prometheus Workspaces, a logical space dedicated to the storage and querying of Prometheus metrics. State Street used the Prometheus Remote Write Exporter. Metrics stored into the workspace are then used for dashboards and alerting using Grafana and native AMP alerting.

Figure 5 is a snapshot of this visualization, showing in green the ingestion rate of the AMP and in yellow the number of middleware ECS tasks.

Metrics visualization using Grafana

Figure 5 – Metrics visualization using Grafana

Tenant isolation
As a multi-tenant environment, one of the requirements was to control access to tenant information, including metrics.

To provide extra degrees of isolation, State Street replicated the entire solution for different groupings of tenants, segregating the metrics at the start of the process. They dedicated scrapers, middleware, and workspaces to each tenant group. When scraping, they filter the metrics to only include the applicable resources for each tenant group. Then, they route the metrics for each tenant group separately. Furthermore, they only grant each tenant group access to their own dedicated workspace – no group can access metrics for the other groups.

Conclusion

Even in simple settings, achieving comprehensive observability can be a challenge. State Street faced this challenge when tasked with consolidating their AWS cloud, on-premises environments, and multiple multi-tenant AWS accounts across regions into one cohesive and robust system.

To overcome this, they leveraged AWS Distro for OpenTelemetry, Amazon Managed Service for Prometheus, and Amazon Managed Grafana to build a monitoring framework that enables easy and effective resource monitoring. By utilizing these purpose-built AWS tools for OpenTelemetry and Prometheus, they were able to create a unified observability solution that spans environments, accounts, and regions. This allows their users to monitor resources across their complex, multi-faceted landscape.

To learn more about AWS Observability services, please check the below resources:

– One Observability Workshop

– AWS Observability Best Practices Guide

– Set up metrics ingestion from Amazon ECS using AWS Distro for OpenTelemetry

About the authors: