How to reduce Istio sidecar metric cardinality with Amazon Managed Service for Prometheus

The complexity of distributed systems has grown significantly, making monitoring and observability essential for application and infrastructure reliability. As organizations adopt microservice-based architectures and large-scale distributed systems, they face the challenge of managing an increasing volume of telemetry data, particularly high metric cardinality in systems like Prometheus. To address this, many are turning to service meshes like Istio. Istio offers a comprehensive framework for observing, connecting, and securing microservices. However, its extensive metrics capabilities, while providing valuable insights into service performance, also introduce the challenge of managing high cardinality in Istio sidecar metrics.

In this blog post, let’s explore strategies for optimizing Istio sidecar metrics by streamlining cardinality and effectively managing high-volume data using Amazon Managed Service for Prometheus.. The blog will use the latest Istio version and the bookinfo sample application to illustrate the concept of high metric cardinality and how it can be efficiently managed. By tackling high metric cardinality challenges, developers, operators, Site reliability engineers (SREs), and Istio users can enhance their monitoring systems, enabling efficient collection, storage, and analysis of telemetry data within a service mesh environment.

Introduction to metric cardinality

High metric cardinality refers to the large number of unique combinations of metric names and their associated labels, which can be a source of performance bottlenecks and increased resource usage in monitoring systems. Some customers may need high cardinality data for critical apps and they may want to reduce metric cardinality for less critical apps. Service meshes like Istio, which provide a wealth of telemetry data out-of-the-box, can inadvertently contribute to this high cardinality by generating a multitude of metrics with various labels. If not addressed, high metric cardinality can hinder application performance analysis and inflate resource usage in monitoring systems, especially in time-series databases like Prometheus, where numerous data points are collected across various dimensions like application instances, versions, and custom labels.

Potential impact of high metric cardinality on monitoring systems:

Increased Resource Consumption: As unique time-series grow, so do the storage and memory needs, potentially leading to higher costs and slower query performance.
Complexity: The volume of unique time-series can overwhelm, making it hard to spot patterns or trends.
Management Challenges: More unique metrics complicate system management and maintenance, increasing the likelihood of misconfigurations and missed alerts.
Query Performance Degradation: High cardinality slows down data finding and aggregation, affecting query performance and issue resolution.

Reducing metric cardinality for non-critical apps can help mitigate these challenges, ensuring a balanced approach to monitoring and analysis. However, it’s worth noting that many of these challenges can be mitigated by using Amazon Managed Service for Prometheus (AMP),which streamlines resource management, enhances query efficiency, and simplifies overall monitoring system maintenance.

Managing high metric cardinality

To effectively address the challenges posed by high metric cardinality, it is crucial to focus on distinguishing between necessary and unnecessary cardinality in the data collected by the monitoring system. While careful design and management of labels and metrics are essential, it’s equally important to evaluate what level of cardinality is genuinely useful for monitoring and troubleshooting. Identifying and eliminating redundant or unnecessary cardinality can be a more practical approach than attempting to reduce all high cardinality. Additionally, monitoring systems like Prometheus can be tuned and optimized to handle high cardinality more efficiently, although this may involve trade-offs in terms of resource consumption and query performance.

Introduction to Istio Service Mesh

Istio is a service mesh leveraging the high-performance Envoy proxy to streamline the connection, management, and security of microservices. It simplifies traffic management, security, and observability, allowing developers to delegate these tasks to the service mesh, while also enabling detailed metric collection on network traffic between microservices via sidecar proxies.

Challenges with Istio Sidecar metrics

Istio’s use of sidecar proxies can lead to high metric cardinality, potentially impacting monitoring system performance. However, these diverse metrics, with unique label combinations, provide essential insights into microservice behavior, aiding in debugging complex systems. Istio addresses these challenges with selective metrics collection and data aggregation tools, striking a balance between detailed monitoring and efficiency.

Steps to identify and address high Istio sidecar metric cardinality:

Step 1: Identify the sidecar metrics:

The first step is to identify the sidecar metrics that are causing the high cardinality. Istio provides a lot of metrics out of the box, and each sidecar can generate a lot of metrics, so it’s essential to identify the metrics that are most relevant to your use case.

Step 2: Analyze the metric cardinality:

Once you have identified the relevant metrics, you need to analyze their cardinality. Metric cardinality refers to the number of unique combinations of labels for a given metric. High cardinality can lead to high resource consumption, slow queries, and more. You can use Istio’s built-in Prometheus or Grafana dashboards to analyze metric cardinality.

Step 3: Reduce metric cardinality:

To reduce metric cardinality, you can use several mechanisms such as label filtering, aggregation, downsampling, request classification and prometheus federation.

A hypothetical example to showcase high metric cardinality

Suppose you have a microservices-based application with the following characteristics:

50 microservices
Each microservice runs 10 instances (replicas)
Each microservice instance has an associated sidecar proxy (Envoy)
Each microservice generates 10 custom metrics

With this setup, the total number of metrics generated by the microservices is:

50 (microservices) * 10 (instances) * 10 (custom metrics) = 5,000 metrics

However, this doesn’t account for the metrics generated by the sidecar proxies. Let’s assume that each sidecar proxy generates 20 metrics related to traffic, errors, and latency. Then, the total number of metrics generated by the sidecar proxies is:

50 (microservices) * 10 (instances) * 20 (sidecar metrics) = 10,000 metrics

In total, the application generates 15,000 metrics.

Now, let’s add label dimensions to these metrics: Each metric has a ‘service‘ label, representing the microservice it belongs to (50 unique values). Each metric has an ‘instance‘ label, representing the specific instance of the microservice it belongs to (500 unique values)

With these labels, the metric cardinality becomes:

15,000 (total metrics) * 50 (service labels) * 500 (instance labels) = 375,000,000 unique time series

Reducing Istio Sidecar Metric Cardinality techniques

After Istio version 1.6, mixer component is no longer available, and the envoy proxy generates sidecar metrics directly.

To reduce high metric cardinality we can implement several techniques: label filtering, aggregation, downsampling, request classification and prometheus federation.

Label Filtering

Label filtering is a mechanism that allows you to filter out irrelevant labels from a metric, thereby reducing its cardinality. Removing unnecessary labels, can reduce the number of unique label combinations, reducing the overall metric cardinality. We can also use Prometheus’s relabeling feature to filter out unwanted labels.

For example, you might have a metric that has labels for the source and destination IP addresses, but you might only be interested in the metric for a specific source IP address. In that case, you can use label filtering to filter out all other source IP addresses.

Here’s an example of label filtering:

For a metric called istio_requests_total, which has the following labels:

source_app, destination_app, source_version, destination_version and response_code

To filter out all requests from a specific source app, you can use the following Prometheus query:

istio_requests_total{source_app="my-app"}

This query will only return requests from the my-app source app, thereby reducing the metric cardinality.

You can also implement label filtering directly in the Prometheus scrape configuration. You can use labeldrop and labelkeep actions within the scrape configuration.

scrape_configs:
  - job_name: "istio"
    static_configs:
      - targets: ["istio-service:port"]
    metric_relabel_configs:
      # Remove requests from a specific source app
      - source_labels: [source_app]
        regex: "my-app"
        action: labeldrop

In the above code, labeldrop action targets the source_app label and applies a regular expression to match the value you want to exclude, in this case my-app. As a result, the source_app label will be dropped from the metric before it’s ingested by Prometheus, reducing metric cardinality for the specified source app

Another example: We are filtering out labels that do not match the Istio version, sidecar container name, and application label. We are then dropping all labels that start with _meta.

scrape_configs:
  - job_name: "istio-sidecar-metrics"
    metrics_path: "/metrics"
    scheme: "http"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_istio_version]
        regex: "1.*"
        action: keep
      - source_labels: [__meta_istio_sidecar_container_name]
        regex: ".*proxy"
        action: keep
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: "myapp"
        action: keep
      - action: labeldrop
        regex: "__meta_.*"

Aggregation

Aggregation is the process of combining metrics with similar labels into a single metric. By aggregating metrics, we can reduce the overall metric cardinality.

Here’s an example of aggregation:

For the metric istio_requests_total, which has the following labels:

source_app, destination_app, source_version, destination_version, http_method,response_code

To aggregate all requests by HTTP method, you can use the following Prometheus query:

sum by (http_method) (istio_requests_total)

We can use the Istio metrics API to configure aggregation for Istio sidecar metrics.

apiVersion: config.istio.io/v1alpha2
kind: metric
metadata:
  name: http-request-count
  namespace: istio-system
spec:
  value: "destination_workload_namespace,destination_service,source_workload_namespace,source_workload,source_app,request_protocol,request_method,response_code"
  dimensions:
    - name: destination_workload_namespace
    - name: destination_service
    - name: source_workload_namespace
    - name: source_workload
    - name: source_app
    - name: request_protocol
    - name: request_method
    - name: response_code
    - name: reporter
      value: source
  monitoredResourceType: '"UNSPECIFIED"'
  aggregation:
    type: "SUM"

In this example, we are aggregating the http-request-count metric based on several labels such as destination_workload_namespace, destination_service, source_workload_namespace, source_workload, source_app, request_protocol, request_method, response_code, and reporter.

Downsampling

Downsampling is the process of reducing the number of samples in a metric by averaging or summarizing them over a specified time interval. Allows you to reduce the resolution of a metric over time. High-resolution metrics can consume a lot of resources, so downsampling can help reduce resource consumption and improve query performance. There are several ways to downsample metrics, such as by

Time range
Aggregation function
Sampling rate

The most common downsampling technique is to aggregate metrics over time intervals, such as every 5 minutes or every hour.

For a metric istio_requests_total, which has a high resolution of 1 second. To reduce the resolution to 1 minute, you can use the following Prometheus query:

sum(rate(istio_requests_total[1m]))

This query will aggregate the metric over 1-minute intervals, thereby reducing its resolution and cardinality. By downsampling the metric, you can reduce resource consumption and improve query performance without losing too much information.

It’s important to note that downsampling can lead to loss of information, so it’s essential to choose the appropriate downsampling technique based on your use case. If you need high-resolution metrics for real-time monitoring, then downsampling might not be the best approach. However, if you need long-term storage and analysis of metrics, then downsampling can be an effective way to reduce resource consumption and improve query performance.

We can use Prometheus’s downsampling feature to downsample Istio sidecar metrics.

- job_name: 'istio-sidecar-metrics'
  metrics_path: '/metrics'
  scheme: 'http'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_istio_version]
      regex: '1.*'
      action: keep
    - source_labels: [__meta_istio_sidecar_container_name]
      regex: '.*proxy'
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: 'myapp'
      action: keep
    - action: labeldrop
      regex: '__meta_.*'
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'istio_(.*)'
      replacement: '${1}'
      action: replace
  scrape_interval: 30s
  honor_labels: true
  params:
    match[]:
      - '{reporter="source"}'
      - '{reporter="destination"}'
  sample_limit: 10000
  target_limit: 1000
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'istio_(.*)'
      replacement: '${1}'
      action: replace
  relabel_configs:
    - source_labels: [__name__]
      regex: '.*_(bucket|sum|count)$'
      action: keep
    - source_labels: [__name__]
      regex: '.*_bucket'
      action: replace
      target_label: le
    - source_labels: [__name__]
      regex: '(.*)_bucket'
      action: keep
      target_label: __name__
    - source_labels: [__name__]
      regex: '(.*)_sum'
      action: keep
      target_label: __name__
    - source_labels: [__name__]
      regex: '(.*)_count'
      action: keep
      target_label: __name__

In this example, we are downsampling the metrics using the sample_limit and target_limit parameters. We are also using the metric_relabel_configs to modify the metric names and the relabel_configs to filter out unwanted labels.

Prometheus Federation

Prometheus federation is a technique that allows you to aggregate metrics data from multiple Prometheus servers into a single, global view. This can be particularly useful when dealing with high metric cardinality in Istio environments, as it can help reduce the amount of data that needs to be processed and stored by each Prometheus server.

Prometheus Federation Design

Figure 1: Prometheus Federation

Set up a central Prometheus server that will act as the federation endpoint. This server should be configured to scrape metrics data from all of the Istio sidecar proxies in your environment.
Configure each of the Istio sidecar proxies to scrape a subset of the metrics data that they would normally scrape. For example, you might configure each Istio sidecar proxy to scrape only the metrics data that is relevant to the specific microservice that it is associated with.
Set up a federation configuration file on each of the sidecar proxies that specifies the address of the central Prometheus server as the federation endpoint.
Configure the central Prometheus server to scrape the metrics data from each of the sidecar proxies using the federation configuration files.

By using Prometheus federation in this way, you can reduce the amount of metrics data that needs to be processed and stored by each sidecar proxy, while still maintaining a comprehensive view of your Istio environment’s metrics data

Prometheus Federation configuration file example based on a sample e-commerce app

Multiple Prometheus instances can be used to scrape metrics from a subset of your services. One prometheus instance can scrape metrics from web-frontend and product-service while another prometheus instance could scrape metrics from inventory service and cart-service. Apply label filtering to reduce number of unique label combinations for each metric. This can be done in prometheus configuration file using labeldrop or labelkeep options.

Apply aggregation to further reduce number of unique label combinations. This can be done using Prometheus sum, avg or max functions to combine metrics with same labels

Apply downsampling to reduce granularity of metrics and it can be done using prometheus’s sum_over_time, avg_over_time or max_over_time functions to aggregate metrics over a longer time period

# Global configuration
global:
  scrape_interval: 15s

# Configurations for individual microservices
scrape_configs:
  - job_name: 'web-frontend'
    static_configs:
      - targets: ['web-frontend:8080']
  - job_name: 'product-service'
    static_configs:
      - targets: ['product-service:8080']
  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service:8080']
  - job_name: 'cart-service'
    static_configs:
      - targets: ['cart-service:8080']

# Federated Prometheus configuration
remote_write:
  - url: 'http://federated-prometheus:9090/api/v1/write'
scrape_configs:
  - job_name: 'federated-prometheus'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="web-frontend"}'
        - '{job="product-service"}'
        - '{job="inventory-service"}'
        - '{job="cart-service"}'
    relabel_configs:
      # Drop irrelevant labels
      - source_labels: ['instance', 'pod', 'app']
        action: 'labeldrop'
      # Apply aggregation
      - regex: '.*'
        action: 'labelmap'
        replacement: '${1}'
      - source_labels: ['__name__']
        regex: '(.*)_total'
        action: 'labeldrop'
      - source_labels: ['__name__']
        regex: '.*_bucket'
        action: 'labeldrop'
      - source_labels: ['__name__']
        regex: '.*_sum'
        action: 'keep'
      - source_labels: ['__name__']
        regex: '.*_count'
        action: 'keep'
      # Apply downsampling
      - source_labels: []
        action: 'keep'
        regex: '.*'
        modulus: 5
        # Use average over 5 minute intervals
        target_label: '__name__'
        target_label_replace: '${1}_avg'
        relabel_configs:
          - source_labels: ['__name__']
            regex: '.*_sum'
            action: 'keep'
          - source_labels: ['__name__']
            regex: '.*_count'
            action: 'keep'
            target_label: 'instance'
          - source_labels: []
            regex: '.*'
            action: 'replace'
            target_label: '__name__'
            replacement: '${1}_avg'

In this example, label filtering is applied to drop irrelevant labels, aggregation is applied to combine metrics with the same labels, and downsampling is applied to reduce the granularity of the metrics by averaging over 5 minute intervals. The resulting metrics are then written to a federated Prometheus instance for further analysis.

Note: These specific configurations may need to be modified to fit the requirements of your specific application. Just refer them to as example configurations

Request Classification

Request classification is a technique used in Istio to reduce the metric cardinality, implemented using Envoy filters. When a request is received by an Istio sidecar proxy, it is classified based on a set of rules defined in the Envoy filter chain. These rules are used to determine the specific metrics that should be generated for the request

For example, let’s say we have a microservice application with three services: Service A, Service B, and Service C. Each service has a sidecar proxy that generates Istio sidecar metrics. By default, the sidecar proxies generate metrics for every request that passes through them, resulting in a large number of metrics with high cardinality.

To reduce the metric cardinality, we can use request classification to generate metrics only for specific requests. We can define rules in the Envoy filter chain to classify requests based on specific criteria, such as the service name or the HTTP method.

For example, we can define a rule to generate metrics only for requests to Service A that use the GET method. This reduces the number of metrics generated by the sidecar proxy and simplifies the metric cardinality.

apiVersion: networking.istio.io/v1alpha3
kind: Sidecar
metadata:
  name: my-sidecar
spec:
  workloadSelector:
    labels:
      app: my-app
  egress:
    - host: "*.example.com"
      port:
        number: 443
      tls:
        mode: ISTIO_MUTUAL
  proxyMetadata:
    ISTIO_META_HTTP10: "true"
  workloadMetadata:
    filters:
      - name: envoy.filters.http.lua
        config:
          inlineCode: |
            function envoy_on_request(request_handle)
              if request_handle:headers():get(":method") == "GET" then
                request_handle:logInfo("GET request detected")
                request_handle:setRequestHeader("x-istio-metrics", "true")
              end
            end

In this example, incoming requests to the product-service that match the /products URI prefix are classified as products, while requests that match the /reviews URI prefix are classified as reviews. You can then use these classifications to filter or aggregate metrics in Prometheus, for example by only showing metrics for requests classified as products.

A VirtualService example in Istio to implement request classification

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    - match:
        - uri:
            prefix: /products
      route:
        - destination:
            host: product-service
      headers:
        request:
          set:
            classification: products
    - match:
        - uri:
            prefix: /reviews
      route:
        - destination:
            host: product-service
      headers:
        request:
          set:
            classification: reviews

Request Classification can help improve the performance and scalability of microservice applications in Istio.

Demonstrate high Istio sidecar metric cardinality on Amazon EKS:

Pre-requisites:

EKS Cluster 1.27 version
Istio 1.19.1 ( https://istio.io/latest/docs/setup/install/istioctl/)
Install Istio prometheus and Grafana add-ons
- kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/samples/addons/prometheus.yaml
- kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/samples/addons/grafana.yaml
Deploy bookinfo app to the EKS cluster (https://istio.io/latest/docs/examples/bookinfo/)
Understanding of Istio and its constructs are critical for this blog

In the istio-system namespace you will patch Grafana and Prometheus service type to LoadBalancer instead of ClusterIP

kubectl patch svc prometheus -n istio-system --type='json' -p '[{"op":"replace","path":"/spec/type","value":"LoadBalancer"}]'

kubectl patch service grafana -n istio-system -p '{"spec": {"type": "LoadBalancer"}}'

Generate traffic by sending repeated requests to the productpage service using curl in a loop to generate a significant number of metrics

while true; do curl -s "http://<your-istio-ingressgateway-IP>/productpage"; done

By running this query you will see a noticeable increase in the number of unique time series for each metric, which reflects the increased cardinality resulting from the load on the bookinfo application

count by(__name__) ({__name__=~"istio.*"})

Figure 2: Prometheus Query

To further visualize cardinality you can run this query

count({__name__=~"istio.*"})

This query will give you the total count of unique time series across all Istio metrics, representing overall cardinality of Istio metrics. For the bookinfo example I see 2287 unique labels and this number will drastically grow if you have more microservices.

Figure 3: Visualize Cardinality

High cardinality can lead to a larger memory footprint and more computational resources for storing and processing metrics

Managing High Volume Data with Amazon Managed Service for Prometheus

AMP is a fully managed Prometheus service that makes it easy to store, query, and analyze time-series data. It is designed to handle large volumes of data and provides automatic scaling and long-term retention for your metrics. With AMP, you can easily manage high volume data and simplify the cardinality of your Istio sidecar metrics.

Benefits of Amazon Managed Service for Prometheus

Efficient Resource Utilization:

With AMP, resource consumption is optimized. It automatically scales storage and memory as needed, reducing the risk of higher costs, slow queries, and database instability, ensuring your monitoring system remains performant and cost-effective.

Simplified Data Interpretation:

Despite high metric cardinality, AMP provides tools for efficient data querying and visualization. It offers features for label-based filtering, aggregation, and summarization, making it easier to understand and interpret complex data. This helps identify patterns and trends even within vast amounts of unique time-series.

Streamlined Management:

Managing and maintaining your monitoring system becomes more straightforward with AMP. AWS handles the underlying infrastructure and maintenance tasks, reducing the risk of misconfigurations and missed alerts. This ensures a more robust and reliable monitoring environment, with less administrative overhead.

Optimized Query Performance:

AMP is designed to handle high metric cardinality efficiently. Its scalable architecture and optimized query engine enable quick data retrieval and aggregation, enhancing the user experience. Operators can promptly diagnose and resolve issues, ensuring minimal disruption to services.

How to ingest metrics into AMP:

To ingest sidecar metrics into AMP, we need to set up a Prometheus remote write endpoint that sends the Istio sidecar metrics to AMP.
AMP provides a remote write endpoint that can be used to send metrics from any application that supports the Prometheus remote write protocol.
This allows Istio to send metrics data to AMP, which can then be analyzed and visualized using the Prometheus query language and Grafana dashboards.

To get started with AMP, we need to create an AMP workspace. The workspace acts as a central location for storing and querying metrics data from our microservices. We can create an AMP workspace using the AWS Management Console or the AWS CLI. Once we have created an AMP workspace, we can configure our Prometheus instances to send metrics data to AMP.

Figure 4: Environment Setup

To do this, we need to modify the Prometheus configuration file to include the AMP endpoint.

serviceAccounts:
  server:
    name: amp-iamproxy-ingest-service-account
    annotations: 
      eks.amazonaws.com/role-arn: ${IAM_PROXY_PROMETHEUS_ROLE_ARN}
server:
  remoteWrite:
    - url: https://aps-workspaces.${REGION}.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write
      sigv4:
        region: ${REGION}
      queue_config:
        max_samples_per_send: 1000
        max_shards: 200
        capacity: 2500

By configuring our Prometheus instances to send metrics data to AMP, we can achieve long-term retention of our metrics data, simplifying the management of our metrics infrastructure.

We can also use the AMP console to visualize and query our metrics data. The console provides a range of built-in dashboards and visualizations, allowing us to gain insights into the performance of our microservices. Additionally, we can use the Prometheus Query Language (PromQL) to query our metrics data and create custom dashboards and visualizations.

AMP for Long-Term Metrics Retention

AMP provides automatic scaling and long-term retention for your metrics data. This allows you to store metrics data for a longer period of time and perform historical analysis. With AMP, you can easily configure retention periods and store metrics data for up to 15 months. Long-term retention of metrics data is important for analyzing trends, detecting anomalies, and identifying performance issues over time.

In addition, AMP provides built-in integrations with other AWS services, such as Amazon CloudWatch and Amazon Opensearch. This makes it easier to visualize and analyze your metrics, and to set up alerts and notifications based on specific metric values or trends.

Conclusion

In conclusion, reducing the cardinality of Istio metrics is important for improving the observability of your microservices-based applications. By carefully selecting the metrics you want to collect and using AMP, you can reduce the operational overhead of managing and scaling a Prometheus deployment, and make it easier to analyze and understand the performance of your Istio-based applications.

AWS Cloud Operations Blog