AWS Cloud Operations Blog

GPU Cost Attribution in Amazon EKS Using Amazon Managed Service for Prometheus, Amazon Managed Grafana, and OpenTelemetry

As organizations scale their AI and machine learning workloads on Amazon Elastic Kubernetes Service (Amazon EKS), GPU instances often represent the largest portion of compute costs. Without granular visibility into how these resources are consumed, teams struggle to attribute costs accurately. Consider a shared EKS cluster where Team A (Research) runs experimental ML models on […]

Transfer AWS accounts between AWS Organizations while preserving AWS Lake Formation permissions

Many AWS customers move their AWS accounts between organizations When your company manages more than one organization, and whether you regularly move accounts between them; or you are consolidating accounts after a merger, acquisition, or divesture. Account migrations are part of operating on AWS. Previously, moving an account meant removing it from the source organization, making it standalone, then inviting it to the target organization. For accounts with AWS Resource Access […]

Introducing native histogram support in Amazon Managed Service for Prometheus

If you run Kubernetes or microservices workloads on AWS, you probably track latency, request durations, and other value distributions with Prometheus histograms. To do that with classic histograms, you predefine a set of bucket boundaries, and Prometheus emits one time series per boundary plus a sum and a count. A single latency histogram with 20 […]

Build a Multi Account Patch Compliance Dashboard with Kiro Specs

Introduction Robust patch management is essential for maintaining system security, reliability, and compliance across your IT infrastructure. AWS Systems Manager Patch Manager provides a full-featured patching solution, enabling you to automate the deployment of operating system updates to managed nodes across AWS accounts, on-premises, and multicloud environments. However, as your organization scales across dozens or […]

From Monolith to Multi-Account: Pinterest’s AWS Organization Transformation Journey

Introduction Pinterest launched in 2009 with a mission to bring everyone the inspiration to create a life they love. As one of the early cloud pioneers, Pinterest grew to hundreds of thousands of resources and exabytes of data within a single AWS account well before most cloud-native organizations operated at that scale or the best […]

How Honeycomb improved resilience using AWS Fault Injection Service

Building resilience within cloud workloads is an important goal for ISVs to prevent application downtime, increase system reliability, and build customer trust. Honeycomb.io is a fast and collaborative observability platform for software developers and engineering teams to understand and troubleshoot their cloud-native applications. Honeycomb gives you the rich context at sub-second query speeds and AI-assisted […]

AWS Observability ICYMI: Jan-May 2026

Welcome to the first edition of the AWS Observability ICYMI (In Case You Missed It) recap! The first five months of 2026 has been transformational for AWS observability with over 40 launches across Amazon CloudWatch, AWS X-Ray, Amazon Managed Grafana, and Amazon Managed Service for Prometheus. Two major themes defined this period: OpenTelemetry as the […]

Import Historical data from AWS CloudTrail Lake to Amazon CloudWatch

Organizations managing workloads on AWS rely on AWS CloudTrail to answer the fundamental questions: Who did what, where, and when? Since January 2022, customers have stored their CloudTrail activity logs in CloudTrail Lake, a managed data lake purpose-built for capturing, storing, querying user and API activity across their AWS environment. As organizations scale across multiple […]

Shift-Left Tag Compliance using AWS Organizations and Terraform

In this post you will learn about AWS Organizations tag policies, the tag_policy_compliance Terraform provider setting, a reusable tagging module that automatically applies required tags, and a test-driven approach that dynamically validates against your organizational policies.

Simplifying Prometheus metrics collection across your AWS infrastructure

If you’re running services such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) containers, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters in AWS, maintaining separate Prometheus servers for each environment creates significant operational burden. Managing scraper configurations, high availability, scaling, and security distracts you from building great applications. AWS managed […]