AWS Cloud Operations & Migrations Blog

Category: Compute

Monitoring GPU workloads on Amazon EKS using AWS managed open-source services

As machine learning (ML) workloads continue to grow in popularity, many customers are looking to run them on Kubernetes with graphics processing unit (GPU) support. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training and cost-effective ML inference. Monitoring GPU utilization gives valuable information for researchers working […]

Announcing Container Insights with Enhanced Observability for Amazon EKS

Container Insights is Amazon’s fully-managed monitoring and observability service that provides DevOps engineers, developers, SREs, and IT managers with out-of-the-box visibility into their containerized applications and microservice environments. With Container Insights, you can monitor, isolate, and diagnose issues in your Kubernetes clusters with minimal effort. It delivers infrastructure telemetry like CPU, memory, network, and disk […]

Scaling GitHub usage with AWS

Introduction Customers that migrate on-premises enterprise applications to AWS often look for guidance on how to migrate GitHub to AWS. Customers find it challenging to scale as they are constrained by on premises GitHub infrastructure. Organisations that run Github on AWS can get up and running quickly. GitHub on AWS enables teams to collaborate efficiently […]

Using Tag-Based Filtering to Manage AWS Health Monitoring and Alerting at Scale

AWS provides customers regular updates of service notifications and planned activities via e-mail to the root account owners or the operational, security and billing contacts. AWS also provides granular notifications to customers via AWS Health allowing them to fine-tune their alerts on issues relating directly to them. Alongside Health Dashboard’s monitoring capabilities, customers can also […]

Centralize image administration for virtual machines and containers using EC2 Image Builder

Customers may have different processes for image building across virtual machines, containers, or both. This variation in processes introduces operational overhead in managing images, including the initial configuration and the ongoing updates. From the AWS Well-Architected Operational Excellence Pillar, section “Document and share lessons learned”, these images should be standardized, configured with the latest patches, […]

Estimating Total Cost of Ownership (TCO) for modernizing workloads on AWS using Containerization – Part 2

Introduction Part one of this series described the methodology used to calculate the TCO for containerization and we covered the first scenario of estimating TCO with server inventory information. In the second part we focus on second scenario where we will estimate TCO with application level information. Scenario 2: Estimating TCO with only application level […]

Monitor Amazon EKS Control Plane metrics using AWS Open Source monitoring services

Have you encountered situations where your Kubernetes API calls are constantly throttled by the control plane? Did you see the 429 HTTP response code “Too many requests” all over the place and have no clue on what’s wrong with your cluster? In this blog post, we will talk about monitoring some of the key metrics […]

Auto-remediate best practice deviations detected by AWS Trusted Advisor

AWS Trusted Advisor inspects your AWS infrastructure and provides best practice recommendations when opportunities exist to reduce cost, optimize your AWS infrastructure, improve system availability and performance, help close security gaps and monitor service quotas. Trusted Advisor recommendations are based on best practices identified by AWS services experts and learnings from serving thousands of customers […]

Estimating Total Cost of Ownership (TCO) for modernizing workloads on AWS using Containerization – Part 1

Introduction When you migrate your on-premises applications to the cloud, you can use a cloud migration strategy. AWS supports the seven most common migration strategies, “The 7 Rs”. Which approach makes sense for a specific workload is situational and depends on that organization’s business drivers and strategy. Understanding the total cost of ownership (TCO) is […]

Using Lambda-backed Custom Resources to Reduce Overhead in a Multi-Account Environment

Using Lambda-backed Custom Resources to Reduce Overhead in a Multi-Account Environment

Introduction Many of my customers use AWS CloudFormation to streamline provisioning operations for AWS and third-party resources, that they describe with code in JSON- or YAML-formatted CloudFormation templates. Some workloads require custom logic or inputs beyond standard parameter values. For these scenarios, an often overlooked and useful CloudFormation feature lies in AWS Lambda-backed custom resources. With Lambda-backed custom […]