AWS Cloud Operations Blog
Category: Compute
Getting insights from Amazon Managed Service for Prometheus using natural language powered by Amazon Bedrock
As applications scale, customers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging, and resolving operational issues. Organizations allocate money and developer time to deploy and manage various monitoring tools, while also dedicating considerable effort to training teams on their usage. When issues arise, operators navigate through […]
Visualize AWS Systems Manager Patch Manager information using Amazon QuickSight
In this blog post, learn how to build an Amazon QuickSight dashboard to visualize critical patch and inventory information to speed up MTTR. Also, you can use filters to search for a specific AWS Account, specific AWS Region, Amazon Elastic Compute Cloud (Amazon EC2) name, or check installed/missed packages. You want to visualize system patching […]
Leverage Amazon Q to upgrade Lambda runtime functions
Cloud operations are at the heart of every organization. Operating in the cloud allows IT teams to focus on business outcomes, optimizing IT processes while accelerating software development and innovation. These days, it is no longer a question if your organization is moving to the cloud, but how quickly you can move with security and […]
Use AWS Systems Manager Automation runbooks to resolve Elastic Block Store related operational tasks
Customers have been using various forms of automation for years to define a sequence of actions on Amazon Elastic Block Store (EBS). While before, customers were facing operational overhead related to EBS tasks, AWS Systems Manager (SSM) Automations can now be leveraged to meet a wide variety of customer use cases. In this blog post, a […]
Automate your Multicloud operations with AWS Systems Manager and AWS Lambda
A multicloud strategy presents various challenges, including observing and managing applications and infrastructure across multiple cloud platforms. Maintaining consistent tooling for visualizing operational data and automating actions helps organizations address this challenge. Amazon CloudWatch and AWS Systems Manager are two services that provide unified monitoring, observability, and automation capabilities for workloads deployed on AWS, on-premises, […]
Developing an AWS Service Catalog self-managed engine for governance
AWS Service Catalog lets you centrally manage your cloud resources to achieve governance at scale of your Infrastructure as Code (IaC) templates. AWS Service Catalog supports AWS CloudFormation natively and allows customers to use other IaC such as Terraform Community and Terraform Cloud via Service Catalog reference engine. We often hear customers asking how to […]
How to perform Failover and Failback using AWS Elastic Disaster Recovery (AWS DRS) between VMware and AWS environments
Enterprises face a variety of threats such as natural disasters, cyber-attacks and technology failures that could severely disrupt operations. A comprehensive disaster recovery plan is crucial to quickly respond and recover from these events. In this blog post, we’ll show how to plan and implement a comprehensive disaster recovery solution between your VMware on-premises environment […]
Understanding AWS High Availability and Replication for vSphere Administrators
Introduction vSphere HA is a fundamental and frequently used feature of vSphere. If any of several failure scenarios occur, it restarts a virtual machine. The failure scenarios range from VM or host crashes to unresponsive hosts (for example, due to network isolation or outage). Translating vSphere High Availability (HA) to the public cloud can be […]
Optimize your cloud deployments with Prioritized Trusted Advisor recommendations in your operational workflows
AWS Trusted Advisor Priority helps you focus on the most important recommendations for optimizing your cloud deployments, improving resilience, and addressing security gaps. As an AWS Enterprise Support customer, you gain access to prioritized and context-driven recommendations, curated both by your AWS account team and machine-generated checks from AWS services. Note: AWS Trusted Advisor Priority […]
Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights
As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]