Salesforce Commerce Cloud migrates from Self-hosted Prometheus to Amazon Managed Service for Prometheus

Introduction

Salesforce Commerce Cloud empowers thousands of retailers worldwide to create seamless shopping experiences. Behind these experiences lies a complex infrastructure that demands reliable monitoring at scale. As the platform evolved from static, first-party instances to dynamic cloud-based environments, the monitoring needs outgrew the self-managed Prometheus solution.

This post details Salesforce’s Commerce Cloud journey from a resource-intensive, self-hosted Prometheus and Thanos monitoring stack to Amazon Managed Service for Prometheus. The transition enabled Commerce Cloud achieved a 40% reduction in the direct AWS costs while eliminating maintenance overhead and improving system reliability. Commerce Cloud teams now focus on innovation rather than infrastructure management, with monitoring that seamlessly scales across multiple Amazon Elastic Kubernetes Service (Amazon EKS) clusters and regions.

Understanding Salesforce Commerce Cloud Environments

Commerce Cloud Sandboxes

Traditionally, Commerce Cloud Sandboxes have been hosted on first-party instances that run continuously, 24/7. While dependable, this approach has limitations: scaling instances up or down is slow, leading to inefficiencies and higher operational costs.The public cloud’s dynamic provisioning capabilities allow for resources to be allocated and de-allocated in real time, based on demand. This flexibility results in substantial cost savings and improved performance, as instances can be tailored to meet actual usage needs rather than being over-provisioned.

Commerce Intelligence Platform

Commerce Intelligence Platform aggregates and visualizes different data sources (orders, promotions, and visits) from the Commerce Cloud platform. With significant batch processing requirements, using Kubernetes and on-demand compute capacity from AWS brings advantages for handling fluctuating workloads.

From Monolithic Infrastructure to Agile Cloud Solutions

Commerce Cloud started with a single cluster hosting all customers on Amazon EKS, monitored by a self-hosted Prometheus setup with Thanos for long-term storage. As adoption grew, the system expanded rapidly, requiring increasingly larger Amazon Elastic Compute Cloud (Amazon EC2) nodes to handle memory demands and increased scraping intervals to manage scale.

This solution had several limitations:

Data silos restricted to single clusters
No alerting across cluster boundaries
Occasional outages in our critical monitoring platform
Significant maintenance requirements for our small team
Increased overhead as we expanded to multiple regions

As the regional presence broadened, managing multiple clusters became increasingly complex and resource-intensive.

The New Solution: Enhanced Monitoring and Efficiency

Using Amazon Managed Service for Prometheus, Commerce Cloud was able to address the key challenges such as recurring monitoring failures with regional expansions and long-term metric storage while reducing local Prometheus retention. This approach eliminates production outages and provides unified access to metrics across all clusters.

In the final setup, Prometheus is used only in agent mode for metric scraping and alerting based on Grafana. This architecture significantly reduces the operational overhead while improving reliability.

Figure 1 – Salesforce Commerce Cloud Architecture Diagram

Migration Experience and Lessons Learned

The Amazon Managed Service for Prometheus migration was seamless, with initial concerns about metric ingestion limits proving unnecessary as the service scaled efficiently to support the growth. Drawing from the experience with self-hosted solutions, it is recommended to provide a feature for functionality similar to Prometheus Stats page and streamline the service quota management to eliminate the challenge of forecasting requirements without historical data.

Amazon Managed Service for Prometheus by the Numbers

Amazon Managed Service for Prometheus setup for On-Demand Sandboxes now manages 27 million active series, doubling during software releases and node patching. It ingests over 400,000 metric points per second from six different EKS clusters.

Figure 2 – On-Demand Sandboxes AMP metrics

Previously, due to the growing number of On-Demand Sandboxes, the self-hosted Prometheus required constant monitoring to ensure sufficient capacity, necessitating frequent resource increases and migrations to larger EC2 nodes. With Amazon Managed Service for Prometheus, metrics from all the clusters were consolidated without capacity planning concerns. The Commerce Intelligence system now manages 3 million active series and ingests over 25,000 metric points per second from five different EKS clusters.

Figure 3 – Commerce Intelligence AMP metrics

This transition resulted in approximately 40% reduction in direct AWS costs, not including the significant savings in engineering time previously spent maintaining the monitoring infrastructure.

How Amazon Managed Service for Prometheus Helps Commerce Cloud Innovate Faster

Amazon Managed Service for Prometheus has significantly reduced the maintenance overhead. At Salesforce, trust is the #1 value, so keeping all software updated is essential. Previously with Thanos, each team had to implement updates individually, occupying one engineer for a full day each month or more when dealing with breaking changes. Since adopting Amazon Managed Service for Prometheus, this maintenance effort has disappeared, freeing engineering capacity for customer-focused innovation. The monitoring infrastructure is now more scalable and reliable, facilitating smoother operations and better service delivery. This allows Commerce Cloud to redirect engineering focus from operational issues to customer-facing innovations.

Next Steps

AWS recently released managed scraping support for Amazon Managed Service for Prometheus, which eliminates the need for monitoring-related components in our clusters. Commerce Cloud plans to implement this capability to further reduce our operational overhead.

Conclusion

Migrating to Amazon Managed Service for Prometheus has transformed Commerce Cloud’s monitoring capabilities while reducing costs and operational overhead. The service’s scalability, reliability, and seamless integration with the rest of the AWS infrastructure have enabled Commerce Cloud to focus on innovation rather than infrastructure management. For organizations facing monitoring challenges at scale, Amazon Managed Service for Prometheus offers a compelling solution that combines Prometheus functionality with the convenience of a fully managed service.

AWS Cloud Operations Blog

Salesforce Commerce Cloud migrates from Self-hosted Prometheus to Amazon Managed Service for Prometheus

Introduction

Understanding Salesforce Commerce Cloud Environments

Commerce Cloud Sandboxes

Commerce Intelligence Platform

From Monolithic Infrastructure to Agile Cloud Solutions

The New Solution: Enhanced Monitoring and Efficiency

Migration Experience and Lessons Learned

Amazon Managed Service for Prometheus by the Numbers

How Amazon Managed Service for Prometheus Helps Commerce Cloud Innovate Faster

Next Steps

Conclusion

About the authors

Resources

Follow

Learn

Resources

Developers

Help