AWS Open Source Blog
Improving HA and long-term storage for Prometheus using Thanos on EKS with S3
Prometheus is an open source systems monitoring and alerting toolkit that is widely adopted as a standard monitoring tool with self-managed and provider-managed Kubernetes. Prometheus provides many useful features, such as dynamic service discovery, powerful queries, and seamless alert notification integration. Beyond certain scale, however, problems arise when basic Prometheus capabilities do not meet requirements such as:
- Storing petabyte-scale historical data in a reliable and cost-efficient way
- Accessing all metrics using a single-query API
- Merging replicated data collected via Prometheus high-availability (HA) setups
Thanos was built in response to these challenges. Thanos, which is released under the Apache 2.0 license, offers a set of components that can be composed into a highly available Prometheus setup with long-term storage capabilities. Thanos uses the Prometheus 2.0 storage format to cost-efficiently store historical metric data in object storage, such as Amazon Simple Storage Service (Amazon S3), while retaining fast query latencies. In summary, Thanos is intended to provide:
- Global query view of metrics
- Virtually unlimited retention of metrics, including downsampling
- High availability of components, including support for Prometheus HA
In this post, we’ll learn how to implement Thanos for HA and long-term storage for Prometheus metrics using Amazon S3 on an Amazon Elastic Kubernetes Service (Amazon EKS) platform.
Overview of solution
Thanos is an open source project that is capable of integrating with a Prometheus deployment, enabling a highly available metric system with long-term, scalable storage. For the simpler setup, we can get started with three new Thanos components:
- Thanos SideCar: SideCar runs with every Prometheus instance. The sidecar uploads Prometheus data every two hours to storage (an S3 bucket in our case). It also serves real-time metrics that are not uploaded in bucket.
- Thanos Store: Store serves metrics from Amazon S3 storage.
- Thanos Querier: Querier has a user interface similar to that of Prometheus and it handles Prometheus query API. Querier queries Store and Sidecar to return the relevant metrics. If there are multiple Prometheus instances set up for HA, it can also de-duplicate the metrics.
We can also install Thanos Compactor, which applies compaction procedure to Prometheus block data stored in an S3 bucket. It is also responsible for downsampling data.
Prerequisites
This guide has the following requirements:
- An AWS account with adequate permissions to operate IAM roles, IAM policy, Amazon EKS, and Amazon S3.
- Running Amazon EKS cluster (Kubernetes 1.13 or above).
- Prometheus or Prometheus Operator Helm Chart installed (v2.2.1+).
- Helm 3.x.
- Working knowledge of Kubernetes and using kubectl.
- AWS Command Line Interface (AWS CLI) with at least version 1.18.86 or 2.0.25.
- eksctl version 0.22.0 or above.
- Confirm that all Thanos components are installed in the same Kubernetes namespace as Prometheus.
- Clone the Kubernetes manifests for Thanos Querier and Store Deployment steps:
git clone -b release-0.12 https://github.com/thanos-io/kube-thanos.git
- Thanos Compact manifests.
All instructions in this document use Prometheus Operator chart version 8.15.6.
Deployment overview
Before beginning with Thanos deployment, we configure an S3 bucket to use as object storage and create IAM policy required to access this bucket.
To deploy the Thanos components, we complete the following:
- Enable Thanos Sidecar for Prometheus.
- Deploy Thanos Querier with the ability to talk to Sidecar.
- Confirm that Thanos Sidecar is able to upload Prometheus metrics to our S3 bucket.
- Deploy Thanos Store to retrieve metrics data stored in long-term storage (in this case, our S3 bucket).
- Set up Thanos Compactor for data compaction and downsampling.
Configure S3 bucket and IAM policy
- To store metric data, create an S3 bucket in an AWS Region local to the Prometheus environment. Use the appropriate console or API-based mechanisms.
- Create an IAM policy to attach to the IAM role to give access to ServiceAccount used by Prometheus POD.
Next, create an Amazon EKS cluster using the configuration below. Once created, the cluster enables the following:
- Provision the cluster with Kubernetes version 1.16 with one managed node group.
- IAM OIDC provider to provide fine-grained permission management for an application running on Amazon EKS that uses other AWS services.
- Create monitoring namespace on the provisioned Amazon EKS cluster. Use prometheus-prometheus-oper-prometheus ServiceAccount to run Prometheus POD.
- Map the IAM policy to the ServiceAccount role to provide required permissions on the S3 bucket storing Thanos metric data.
Next, we complete the following steps:
1. Run the command # eksctl create cluster -f eks-cluster-config.yaml
to create the Amazon EKS cluster with the configuration stored in file eks-cluster-config.yaml.
2. After eksctl completes provisioning the cluster, verify the cluster health using the command kubectl get nodes
:
3. Verify the IAM OIDC provider by running the command aws eks describe-cluster --name thanosdemo --query "cluster.identity.oidc.issuer"
:
4. Verify the ServiceAccount created for Prometheus POD in monitoring namespace with the command kubectl describe serviceaccount prometheus-prometheus-oper-prometheus -n monitoring
:
Installing Helm CLI
Before we can get started, let’s install Helm CLI and configure the Helm repository. Complete the following steps:
1. Install the Helm CLI:
2. Verify Helm version:
3. Configure the chart repository:
Installing and configuring Prometheus and Thanos
1. Get the prometheus-operator chart default configuration values by running the command helm show values stable/prometheus-operator > values_default.yaml
.
2. The prometheus-operator chart creates the Kubernetes resources required to run Prometheus as part of the installation. We must disable ServiceAccount creation for Prometheus POD as ServiceAccount prometheus-prometheus-oper-prometheus was created during the cluster install. Configure to create: false
and add the ServiceAccount name under Deploy a Prometheus Instance
section in the values_default.yaml
file:
3. Add the Thanos Sidecar configuration after thanos with the command {} in values_default.yaml
:
4. Configure objectStorageConfig with the configuration file with the command thanos-storage-config.yaml
:
Note: Learn more about additional object storage configuration options in the Thanos documentation.
5. Create Kubernetes secret:
6. Install Thanos Sidecar with Prometheus POD:
7. Check the status of Prometheus POD and Thanos Sidecar with the command kubectl get po -n monitoring -l app=prometheus
:
8. Check the status of Thanos Sidecar container in Prometheus POD: kubectl describe pod prometheus-prometheus-prometheus-oper-prometheus-0 -n monitoring
In Prometheus POD, the status of Thanos Sidecar is:
Deploy Thanos Querier
Thanos Querier assists in retrieving metrics from all Prometheus instances. It can be used with Grafana because of its compatibility with original PromQL and HTTP APIs.
1. Add metric store configuration as thanos-query-deployment.yaml
under spec.spec.containers args query section:
--store=thanos-store.monitoring.svc.cluster.local:10901
--store=prometheus-operated.monitoring.svc.cluster.local:10901
The preceding store configuration adds Thanos Store service to retrieve historical metric data from object storage (S3 bucket) and Prometheus service for the latest metric data. We will be deploying Thanos Store service in the next step.
2. Apply the Query deployment, service, and serviceMonitor manifests to create Kubernetes objects:
Deploy Thanos Store
Thanos Store collaborates with querier
for retrieving historical data from the given bucket.
Make the following changes in the Thanos Store configuration files:
1. Add ServiceAccountName to spec.template
.spec
to enable S3 bucket access in thanos-store-statefulSet.yaml
:
serviceAccountName
:
prometheus-prometheus-oper-prometheus
.
2. Change the spec.template.spec.containers.env
in thanos-store-statefulSet.yaml
to:
3. Apply the Store statefulSet, service, and serviceMonitor manifests:
Deploy Thanos Compactor
Thanos Compactor completes the downsampling for historical data. The compactor needs a local disk space to store intermediate data for processing.
Make the following changes in Thanos Compactor configuration files:
1. Add the ServiceAccountName to spec.template
.spec
to enable S3 bucket access in thanos-compact-statefulSet.yaml
:
serviceAccountName: prometheus-prometheus-oper-prometheus
2. Change the spec.template.spec.containers.env
in thanos-compact-statefulSet.yaml
to:
3. Apply the Compact statefulSet, service, and serviceMonitor manifests:
kubectl
apply -f
thanos-compact-statefulSet.yaml
-f thanos-compact-service.yaml -f thanos-compact-serviceMonitor.yaml
4. Check the status of all Thanos components:
Configure Thanos as Grafana data source
To start viewing metric data with Grafana UI, we can add Thanos Querier service as one of the data sources. Do so by going to Grafana, Configuration, Data Sources, Add data source.
Cleaning up
To avoid incurring future charges, delete the resources. Use the following commands to clean up the Thanos environment:
1. Remove Thanos Querier, Store, and Compactor:
2. Remove Thanos Sidecar by removing the sidecar configuration added during the Thanos configuration process. Finish by applying changes with:
3. Delete the S3 bucket being used for storing metric data:
4. To delete the EKS cluster:
Costs
Thanos enables users to archive metric data from Prometheus in an object store such as Amazon S3. This provides virtually unlimited storage for our monitoring system. For cost considerations, Thanos adds the price of storing and querying data from the object storage and running the store node to existing Prometheus setup. Compute used by queriers, compactors, and rule nodes require similar compute resources, as they save by not doing the same work directly on Prometheus servers.
In a typical Prometheus setup, the data that is accessed locally travels over the network in Thanos. Data transferred within the same AWS Region is free between Amazon S3 object store and Thanos.
For metric data, Prometheus uses and average of one-to-two bytes per sample for storage. If we store around 100,000 samples with a size of two bytes per day using Thanos, the storage consumption is around 196 KB on Amazon S3. This costs < 0.05 USD per day. The cost of retrievals by the store node depends on individual querying pattern, and you can add around 20% to the total storage cost to account for retrieval cost estimation.
Applying appropriate downsampling, resolution, and retention policies on Thanos object storage allows further optimization.
Conclusion
In this blog post, we explored how to transform Prometheus into a robust monitoring system. Using Thanos with Prometheus enables us to scale Prometheus horizontally. By using open source Thanos components and Amazon S3, we get a global view, virtually unlimited retention, and potential metric high availability.