Choice Hotels adopts Amazon Managed Service for Prometheus for operational excellence and cost efficiency

This post was co-written with Stephen Cihak, Senior Director , Abhiram Madadi, Principal Engineer and Gopi Akula, Senior Manager at Choice Hotels

Who is Choice Hotels?

Choice Hotels International is one of the largest lodging franchisors in the world. A challenger in the upscale segment and a leader in midscale and extended stay, Choice has more than 7,400 hotels, representing over 625,000 rooms, in 45 countries and territories. A diverse portfolio of 22 brands that run the gamut from full-service upper upscale properties to midscale, extended stay and economy enables Choice to meet travelers’ needs in more places and for more occasions while driving more value for franchise owners and shareholders.

With a franchise-first focus and an industry-leading voluntary retention rate, Choice has been committed to providing its hotel owners with the support they need to succeed since it launched the country’s first hotel chain in 1941. The company’s longstanding leadership in technological innovation within the hospitality industry has been a crucial part of fulfilling that commitment: Choice was the first major hotel company to launch a cloud-based property management system, choiceADVANTAGE, in 2007 and the first to offer a consumer-facing iPhone app in 2009. It was the first to develop and deploy a cloud-based central reservation platform, launching choiceEDGE in 2018, the industry’s first new major global reservation system in over three decades. That same year, Choice became the first hotel company to partner with Google to launch voice-enabled booking via the Google Assistant app. And in 2021, Choice released the industry’s first mobile-enabled revenue management system, ChoiceMAX, continuing its legacy of technological innovation and earning the company Hospitality Technology’s prestigious Hotel Visionary Award for Enterprise Innovator. An Amazon Web Services customer since 2014, Choice was the first major hotel company to commit to going all-in with AWS and move its infrastructure entirely to the cloud, setting a goal of migrating all of its data center workloads to AWS by early 2024.

Choice Hotels’ Current Architecture and success criteria

Choice Hotels works on the platform first IT Principle. Open source-based services and products are their first choice for building a scalable, resilient and cost-effective shared services platform. To serve their millions of guests 24×7 and support thousands of hotels, the platform team needs to build solid, resilient, and cost-effective cloud shared services. One of the key pillars of the IT platform is application and infrastructure monitoring and observability . The Choice Hotels’ Cloud Platform and Site Reliability teams had built robust solutions to manage the organization’s needs. However, those solutions came with a heavy maintenance burden and high license costs. In looking for opportunities to optimize the team’s efficiency and free up time and resources, the team successfully implemented Amazon OpenSearch Service and reduced overall costs and maintenance efforts while increasing end-user functionality. This success drove the team to review and fine-tune the entire application monitoring stack.

Some of the considerations used to define the new monitoring system architectures were that they need to be easily maintained, cost effective, and highly scalable. In many solutions, this has resulted in leveraging managed services. The solution should also consider how managed services are charged to ensure that fully managed services are optimally utilized. This was a key factor in Choice’s usage of Amazon Managed Service for Prometheus and OpenTelemetry as costs did not initially align with expectations.

Choice Hotels was using a combination of open-source and packaged monitoring solutions that did not support traces. Aligning with open-source based solutions, the team decided to implement OpenTelemetry. OpenTelemetry (OTel) is an observability framework that provides a unified way to collect and export telemetry data from both cloud-native and traditional applications. This makes it easier to correlate metrics and traces, which can help troubleshoot problems more effectively. The framework is vendor agnostic and there is significant momentum in the industry towards its use. The solution also seamlessly integrates with all the languages and systems that Choice Hotels leverages.

While adopting the new observability framework it was important for Choice Hotels team leverage the autoinstrumented agents for traces and derive standard application metrics from those traces. This allowed the Cloud Platform to remove the need to federate the implementation of application metrics and manage them in a more centralized way, increasing the standardization of core metrics and visualizations.

In order to achieve this goal, the team chose to utilize the span metrics processor from the OpenTelemetry Collector’s contributions repository. In a system with distributed tracing implemented, span metrics can simplify the creation of standardized monitoring telemetry, conceptually similar to gathering standard host metrics of CPU, memory and disk usage. The span metrics processor collects telemetry from trace data, including request count, error count, and duration (RED) metrics. Additionally, the spans were exported to Jaeger which provides processing, aggregation, data mining, and visualizations of trace telemetry data.

The span metrics processor aggregates Request, Error and Duration (R.E.D) metrics from span data :

Request counts are generated based on unique, configurable span attributes such as service name, host ip, and status codes. Each unique set of attributes create a labeled metric series which can be aggregated as needed by the end consumer. For example, calculating the number of requests on based on a service name and operation across all host ips.
Error counts are computed from the Request counts whose status code is defined as an error by the span metrics processor.
Duration is computed from the difference between the span start and end times and exposed as a histogram for each unique set of attributes.

The team implemented OpenTelemetry using both an Agent & Gateway collector model. In this model, each microservice with an auto instrumentation agent writes spans to an agent collector running on the same host as the microservice. The agent collector then exports the spans to a central gateway collector. The gateway collector is responsible for processing the spans and generating metrics. This central gateway allows for rapid updates to telemetry processing without requiring a change on every client. This meant the team was able to iterate quickly when changes to telemetry configuration were needed such as adding new attributes to the span metrics or updating the export of metric data to a snapshot strategy.

After careful consideration of multiple options, the team chose Amazon Managed Service for Prometheus for metric storage and querying. Amazon Managed Service for Prometheus is a Prometheus-compatible service that monitors and provides alerts on containerized applications and infrastructure at scale. Amazon Managed Service for Prometheus integrates with Amazon Elastic Kubernetes Service, Amazon Elastic Container Service, and AWS Distro for OpenTelemetry (ADOT), an AWS-managed OpenTelemetry distribution.

The architecture comprises micro services running with an ADOT auto-instrumentation agent and ADOT collector as a local agent collector. Since the span metrics processor is not yet supported by the latest stable version of the ADOT collector, the team chose OpenTelemetry Collector from contributions repository to move forward as the solution for Gateway collector.

Choice Hotels’ application platform has both traditional and micro services with an (ADOT) Java Agent and ADOT collector deployed in these platforms and sending metrics to a central collector. The central OTel collector ingests up to 200 million active time-series. The team was able to quickly adopt the AWS-managed open-source service and support the high cardinality metrics while keeping the cost low.

Choice Hotel Architecture

Choice Hotel’s Solution

To achieve their goals, Choice Hotels fine-tuned the collector configuration and also implemented the best practices shared by AWS that’s listed below:

Send high-availability data to Amazon Managed Service for Prometheus with OpenTelemetry Contrib Collector

The OTel Contrib Collector distribution was chosen as Gateway collector because it supports the span metrics processor, which derives metrics from spans sent from ADOT agent collectors. This allows the Gateway collectors to export metrics to Amazon Managed Service for Prometheus using the prometheusremotewrite exporter.

However, the initial setup resulted in high volume metrics and duplicate samples being written to Amazon Managed Service for Prometheus. This was because the ADOT agent collectors export spans to the Gateway collector ALB, which could end up on any Gateway collector pod. To overcome this, each Gateway collector pod’s IP address was applied as an external label to the metric so that Amazon Managed Service for Prometheus could distinguish between the different gateway pods, further increasing the metric cardinality.

Reduce the ingestion rate caused due to Span Metric processing

Collecting RED metrics (Rate, Error, and Duration) from the span data is very critical to understanding the user experience. In the initial configuration, the central OTel collector was immediately exporting derived span metrics to Amazon Managed Service for Prometheus. This resulted in an exponential increase of metric ingestion into the Amazon Managed Service for Prometheus workspace, and therefore an exponential increase in cost) as the solution was released across the environment.

The team initially attempted to reduce the metric export volume by using the batch processor and the Prometheusremotewrite exporter. This did have an impact until the volume increased to a level where the remote write exporter queue began to fill up and caused increasingly higher memory consumption on collectors. The team quickly realized that exposing metrics as snapshots would allow for minimal memory consumption and provide an ability to limit the amount of metrics pushed to Amazon Managed Service for Prometheus. Updating the configuration to add a Prometheus exporter that exposed metrics as snapshots and a new Prometheus receiver to scrape these snapshots at a controlled intervals, the system could now efficiently maintain metric series data and limit the number of remote write calls to Amazon Managed Service for Prometheus. This reduced the capacity required for the central collectors and significantly reduced the runtime cost of Amazon Managed Service for Prometheus. The added benefit of this pattern is that it provides a regular “heartbeat” of datapoints and allows for improved monitoring of metric and span collection and processing.

This approach will help to ensure that the metrics are not overwhelming the remote write endpoint and that increased volumes do not negatively impact Amazon Managed Service for Prometheus costs.

Here are some additional details about the steps involved in setting up this solution:

Install the OTel Contrib Collector on the Gateway collector pods.
Configure the span metrics processor to export metrics to Prometheus.
Configure the Prometheus exporter to expose the metrics as snapshots.
Configure the Prometheus receiver to scrape the snapshots at a controlled interval.
Configure the exporter to pass the scraped metrics to Amazon Managed Service for Prometheus.

Once this solution is set up, it helped to ensure that the metrics are being exported to Amazon Managed Service for Prometheus in a reliable and efficient manner. The following is an example OTel configuration.

receivers:
otlp:
    protocols:
      grpc:
  # dummy reciever that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"
  prometheus:
    config:
      global:
        external_labels:
          otel_collector: otel_collector_$POD_IP
      scrape_configs:
      - job_name: 'atm'
        scrape_interval: 60s
        static_configs:
        - targets: 
          - ${POD_IP}:8889

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
  batch:
    send_batch_size: 1000
    timeout: 10s
    send_batch_max_size: 1000
  spanmetrics: 
    metrics_exporter: prometheus
    latency_histogram_buckets: [2ms, 4ms, 8ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 2s, 3s, 5s, 10s]
    dimensions:
      - name: http.method
        default: GET
      - name: http.status_code
      - name: service.namespace

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: {{ .Values.jaeger_endpoint | quote }}
  prometheusremotewrite:
    endpoint: {{ .Values.amp_endpoint | quote }}
    auth:
      authenticator: sigv4auth

extensions:
  memory_ballast:
    size_in_percentage: 30
  sigv4auth:
    region: "us-west-2"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics,memory_limiter,batch]
      exporters: [jaeger]
    metrics/spanmetrics:
      # dummy reciever to pass the validation as pipeline requires one.
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter,batch]
      exporters: [prometheusremotewrite]
  extensions: [sigv4auth,memory_ballast]

Note: At the time of this implementation, OTel had not yet defined a connector, a component that is both a receiver and an exporter. Since then, the spanmetrics processor has been deprecated and replaced by the spanmetrics connector which has some breaking changes. As such, the configuration above is specific to the deprecated version. The Choice Hotels team will be moving to the new implementation in the coming months.

Conclusion

In this post, we articulated how Choice Hotels successfully adopted AWS Distro for OpenTelemetry and Amazon Managed Service for Prometheus for their observability platform. We shared some best practices around how AWS helped Choice Hotels in achieving the scale to ingest 200 Million active time series of metrics in an efficient and cost-effective manner through adjustments to how and when metrics are shipped to Amazon Managed Service for Prometheus.

To learn more about AWS Observability services, please check the below resources:

About the authors