AWS Cloud Operations Blog

Reimagine AIOps with Amazon CloudWatch Investigations and Amazon Nova Sonic

Reimagine AIOps with Amazon CloudWatch Investigations and Amazon Nova Sonic in Amazon Bedrock to transform how cloud operations teams handle incidents. Traditional monitoring approaches require engineers to navigate multiple complex dashboards, analyze extensive logs, and manually execute remediation steps—a process that becomes particularly challenging during after-hours incidents or when away from workstations. When minutes matter and business continuity is at stake, this manual approach creates critical inefficiencies in incident detection and resolution.

Next-generation AIOps (Artificial Intelligence for IT Operations) addresses these challenges by applying AI to automate and enhance IT operations processes. By integrating Amazon CloudWatch Investigations with Amazon Nova Sonic in Amazon Bedrock, we can transform incident management through speech-to-speech interactions. This solution enables operations teams to detect, analyze, and resolve issues through natural voice conversation, significantly reducing the time spent navigating between different monitoring tools during critical outages.

This post demonstrates how to implement an AIOps solution using CloudWatch Investigations and Amazon Nova Sonic in Amazon Bedrock. You’ll learn how to configure CloudWatch alarms for automated investigations and integrate them with Amazon Nova Sonic to create a conversational AI interface that reduces mean time to resolution (MTTR). We’ll walk through a sample scenario showing how this integration transforms incident management from complex technical procedures into an intuitive dialogue experience for DevOps engineers, SREs, and cloud operations managers.

Overview of Solution

This solution provides speech-to-speech infrastructure operations troubleshooting via real-time conversation. The “Operational Investigation” system integrates Amazon CloudWatch Investigations with Amazon Nova Sonic in Amazon Bedrock to automatically detect infrastructure issues, investigate problems, and execute remediation actions by speech interactions.

The “Sample Application” provides a simple test environment to demonstrate the solution’s capabilities. When alarms are triggered by its monitored AWS resources, the solution starts a CloudWatch Investigation, the investigation sends investigation insights messages to an Amazon SNS (Simple Notification Service) topic and get processed by an AWS Lambda function. The function stores the processed messages in an Amazon S3 bucket, which are later queried and used as context by Nova Sonic to provide speech-to-speech assistance in real time incident analysis and response.

Figure – Solution architecture diagram

Solution Walkthrough

Prerequisites:

Deploy the Solution

Run the following CLI command:

git clone https://github.com/aws-samples/sample-intelligent-ops
  • Deploy the AWS CloudFormation template to create the test infrastructure

Run the following CLI command:

aws cloudformation deploy \
--template-file cf-template.yml \
--stack-name intelligent-operations-test-stack \
--capabilities CAPABILITY_IAM \
--region us-east-1

Set up Amazon Q Developer in Chat Applications (AWS Chatbot)

In your AWS Console, set up Amazon Q Developer in Chat Applications (AWS Chatbot) to use SNS topic integration for investigation findings forwarding. If you already have a chat client configured and want to reuse it, go to the ‘Configured clients’ setting in Amazon Q Developer in chat applications (previously AWS Chatbot) console, edit the ‘Notifications’ section so that it includes the “cwintegrationTopic” SNS topic. See configure integration between notifications and AWS Chatbot for more details. You can skip the rest of the steps.

  1. In AWS Console, go to Amazon Q Developer in chat applications (previously AWS Chatbot) console
  2. Create a new client by clicking the ‘Configure new client’ button.
  3. Select the type of client of your preference from the drop-down menu (Amazon Chime, Microsoft Teams, or Slack) and click ‘configure’ button.
  4.  Select your Chime/Teams/Slack workspace and allow AWS to authenticate for configuration, you may be redirected to the respective provider site to complete authentication steps. See Getting started with Amazon Q Developer in chat applications for more details about how to set up workspace of different types.
  5. In AWS console, go to the newly created client workspace under ‘Configured clients’, click ‘Configure new channel’ button to start the chat channel configuration flow.
  6. In the configuration flow, complete the basic fields with your specific preference in ‘Configuration details’ and ‘Slack/Teams/Chime channel’ sections.
  7. In ‘Permissions’ section, provide an AWS Identity and Access Management (IAM) role name in the ‘Role name’ field if you prefer AWS create a new role automatically. You can leave the rest of settings with their default values.
  8. In “Notifications” section, make sure you choose ‘us-east-1’ as the region and choose ‘cwIntegrationTopic’ from ‘Topic1’ drop-down.
  9. Click ‘Configure’ button to complete the setup process.

Configure CloudWatch Investigations

If you have previously set up CloudWatch Investigations in your account and region, go to step 4 directly.

  1. In CloudWatch console, go to ‘Configuration’ under ‘AI Operations’, make sure you are in ‘us-east-1’ region
  2. If you have never set up CloudWatch Investigations for the region, the “Pending initial configuration” pane shows; click the ‘Configure for this account’ button to configure it for the first time.
  3.  You can leave the default setting as is and click the ‘Create investigation group’ button; this will complete the configuration with the basic settings. See CloudWatch investigations – Get started for more details about the settings.
  4. Click into ‘Configuration’ under ‘AI Operations’ and finish the optional configuration by adding the SNS topic under the ‘Chat integration’ section. Make sure you select the correct SNS topic (one ending with ‘cwIntegrationTopic’), click ‘done’ and the configuration update takes effect automatically.
  5. Go to ‘All alarms’ under ‘Alarms’ in CloudWatch console, edit the ‘API-Gateway-5XX-Errors’ alarm by selecting the alarm and clicking ‘Edit’ from the ‘Actions’ drop-down. This alarm was already created by the steps from the Deploy the Solution section.
  6. Keep clicking the ‘Next’ button until the step that includes the ‘Investigation action’ setting, and add an ‘Investigation’ action pointing to the default investigation group created in step 4. Continue on by clicking the ‘Next’ button and then finally the ‘Update alarm’ button.

Test the Solution

Trigger CloudWatch Investigations using test workload

The following steps are for generating a scenario to simulate traffic against the Sample Workload set up in the Deploy the Solution section. If you have other ways of generating load, it will work too, and you can skip the steps.

  • In the solution’s project folder, open the ‘request_generator.py’ file in the ‘test’ directory and update line 8 with the test workload API URL. The URL can be found in CloudFormation console in the ‘intelligent-operations-test-stack’ Outputs section.
  • Make sure the ‘aiohttp’ package is installed in your local Python environment; you can install it by running the following CLI commands:
# Change directory to ‘test’ under project folder

cd test

# Create a virtual environment

python -m venv .venv

# Activate the virtual environment

# On macOS/Linux:

source .venv/bin/activate

# On Windows:

.venv\Scripts\activate

# Ensure pip is available

python -m ensurepip –upgrade

# Install required packages

python -m pip install -r requirements.txt --force-reinstall
  • Run ‘request_generator.py’, this will generate 10 concurrent requests to the API Gateway endpoint to simulate high error rates and trigger the CloudWatch Investigations flow.
python request_generator.py
  • Wait for a few minutes, go to CloudWatch console, under “All alarms” you should see the ‘API-Gateway-5XX-Errors’ alarm has been triggered, and in ‘Investigations’ under ‘AI Operations’, a new investigation has just started.

Running the Speech-to-Speech Model Streaming Server

  • Go to the CloudFormation console and click into the ‘intelligent-operations-test-stack’ stack, locate ‘MessagesBucketName’ under ‘Outputs’, and take note of the bucket name value.
  • In your CLI terminal, find and take note of your CloudWatch Investigations group ID by running the following command.
aws aiops list-investigation-groups --region us-east-1# the investigation group id is the last segment of the output ARN, e.g. arn:aws:aiops:us-east-1:123456789012:investigation-group/ThisPartIsTheId
  •  In the solution’s project folder, open the .env.example file with any text editor and update it with your specific resource IDs, save the file, and rename it as .env:

Q_INVESTIGATION_BUCKET: Your S3 bucket name found in step 1.

INVESTIGATION_GROUP_ID: the investigation group ID noted in step 2.

  • Install dependencies by running the CLI command:
npm install # IMPORTANT – make sure current working directory is the project root, ‘cd’ into the directory if it is not
  • Build the streaming server by running the CLI command:
npm run build
  • Start the streaming server:
npm run start

# run ‘npm run dev’ if you encountered issues and needs hints 

By completing the steps, you have just started a speech-to-speech streaming server on your local machine that can be accessed by a web browser to interact with your operational investigations.

Test User Interaction

  1. Use a web browser to access the application at http://localhost:5173. The web UI is tested to work on Chrome and Safari
  2. Make sure you allow your web browser to access your microphone device when asked by popup windows
  3. Follow the following user speech script with example questions to go through the test interaction with the voice-enabled personal assistant. You can also find the sample questions in ‘/test/questions.txt’ file from the code repository.

User: “Hi there, who are you?”

User: “Some users just reported problems using my application can you check if there is anything wrong?”

User: “The name of the application is called my-super-workload.”

User: “My bad, actually the name is my-production-workload.”

User: “When did the issue occur?”

User: “What does it mean the function is throttled?”

User: “Okay, what’s your recommendation?”

User: “Okay, let’s go increase that lambda concurrency limit.”

User: “Can you help me with another request?”

User: “Can you also check if there’s anything wrong with my test workload?”

User: “Okay, thank you, bye for now.”

By completing the steps, you simulated a simple test workload (Amazon API Gateway + AWS Lambda) receiving too many user requests and exhausting its reserved concurrency limit, causing HTTP 5XX errors. When errors exceeded the alarm threshold, a CloudWatch alarm triggered an automatic CloudWatch investigation. Without having to check any observability tools and manually troubleshoot, you then used speech-to-speech interaction powered by Amazon Nova Sonic in Amazon Bedrock to learn incident insights, root causes (Lambda function throttled), all automatically analyzed by CloudWatch Investigations. Finally you instructed the Nova Sonic powered assistant to apply remediation by executing a predefined AWS Systems Manager Automation runbook that automates an increase of the Lambda function’s concurrency limit that relieves the throttling.

Cleaning up
Delete the CloudFormation stack to clean up all resources, including the S3 bucket used, by running the following CLI command:

aws cloudformation delete-stack \
--stack-name intelligent-operations-test-stack \
--region us-east-1

Conclusion
The integration of CloudWatch Investigations with Amazon Bedrock Nova Sonic represents a significant advancement in cloud operations management. This AIOps approach directly addresses key operational challenges by

  1. Reducing dashboard complexity – As demonstrated in our walkthrough, operations teams no longer need to navigate multiple monitoring interfaces during critical incidents. Instead of switching between CloudWatch dashboards, logs, and metrics, the solution consolidates investigation data and presents it through intuitive speech interaction.
  2. Improving incident resolution efficiency – The solution demonstrates measurable improvements in incident response by automating investigation steps and providing clear remediation options through conversation. In our example, the Lambda throttling issue was identified and resolved without manually parsing logs or metrics, significantly reducing resolution time.
  3. Enabling anywhere operations – Engineers can interact with cloud resources through natural conversation regardless of location, particularly valuable during after-hours incidents when access to monitoring dashboards may be limited.

By embracing these speech-to-speech intelligent systems, organizations aren’t just solving today’s operational challenges — they’re building the foundation for a more efficient, responsive, and human-centered approach to cloud operations.

Next Steps:

To continue advancing your AIOps capabilities, consider these follow-up actions:

  • Expand your voice-enabled monitoring to additional application components and services
  • Customize remediation runbooks for your specific operational needs
  • Integrate this solution with your existing ChatOps platforms for comprehensive team collaboration
  • Explore additional Amazon Bedrock capabilities to further enhance your AI-powered operations

For more information on building resilient cloud operations, check out AWS related content on Amazon CloudWatch Investigations best practices, and Amazon Bedrock Nova Sonic implementation guide.

Rovan Omar

Rovan Omar

Rovan is a Principal Technologist at AWS who blends deep expertise in cloud architecture, cybersecurity, and generative AI to drive resilience, innovation, and operational excellence. Her work enhances developer productivity and aligns technology with measurable business outcomes.

Andres Silva

Andres Silva

Andres Silva is a Global Cloud Operations Leader and Principal Specialist Solutions Architect at Amazon Web Services (AWS), where he helps enterprises transform their cloud operations. With over 30 years of experience in technology, including a decade at AWS, he specializes in DevOps, cloud technologies, and SaaS infrastructure management. Based in High Point, North Carolina, Silva drives enterprise-wide cloud operations strategies with a focus on AIOps and Observability. He partners with global organizations to architect and implement intelligent cloud operations frameworks that leverage artificial intelligence to enable operational excellence and automated incident response at scale.

Sean Xiaohai Wang

Sean Xiaohai Wang

Sean Xiaohai Wang is a Senior Technical Account Manager at AWS who helps enterprise customers build and operate efficiently on the cloud. With extensive tech industry experience and telecommunications expertise, Sean partners with clients to architect innovative solutions leveraging serverless and AI technologies while optimizing cloud operations. His passion for efficiency drives him to develop smarter approaches that maximize customer value while minimizing complexity, enabling organizations to achieve their business objectives through effective AWS implementation.