AWS for Industries

Multi-agent collaboration using Amazon Bedrock for Telecom Network Operations

Telecom Network Operations are complex by nature, needing constant vigilance across a range of activities such as real-time alarm monitoring, scheduled maintenance tracking, and performance analytics. One of the most critical metrics in this domain is Mean Time to Resolve (MTTR)—a key indicator of how efficiently network teams can detect, diagnose, and resolve issues. A high MTTR not only threatens service reliability but also degrades customer experience and drives up operational costs. Traditionally, managing this complexity has necessitated manual coordination or monolithic systems, both of which fall short in today’s fast-paced, high-availability environments.

To address this challenge, forward-thinking telecom operators are turning to AI powered multi-agent collaboration, a scalable and intelligent approach that distributes responsibility across specialized agents. Each agent is designed to focus on a specific domain—such as checking for scheduled maintenance, analyzing real-time alarms, or evaluating KPI anomalies—while a supervisor agent orchestrates the overall workflow. The Amazon Bedrock fully managed multi-agent collaboration means that these systems are more ready for implementation than ever. Amazon Bedrock handles the behind-the-scenes orchestration, task delegation, and communication between agents, enabling faster diagnostics, more accurate decisions, and a significant reduction in MTTR. The result is a more agile, resilient, and responsive network operations model tailored for the demands of modern telecom infrastructure.

Background and challenges

Traditional Network Operations Center (NOC) troubleshooting workflows often follow a linear, manual process that depends heavily on static playbooks and isolated monitoring tools. When an issue arises, engineers typically need to sift through multiple systems—cross-referencing alarms, maintenance schedules, and performance metrics—all of which are frequently presented in different interfaces. This fragmented approach not only slows down the identification of root causes but also heightens the risk of missing crucial insights. The process demands significant manual effort to correlate data from various sources, which can lead to inconsistencies, errors, and ultimately, extended MTTR.

Furthermore, the reliance on tribal knowledge within NOCs presents a significant challenge. Critical operational insights often reside with experienced engineers, but they are rarely documented or easily accessible to others. This creates inefficiencies in knowledge transfer, especially when team members are unavailable or when new engineers are on-boarded. As networks become increasingly complex and distributed, the difficulty of correlating maintenance schedules, active alarms, and performance trends in real-time only intensifies. In light of these challenges, there is a pressing need for more efficient, accurate, and automated problem resolution—one that reduces manual intervention, enhances decision-making, and accelerates incident response times.

Solution overview

The Amazon Bedrock multi-agent collaboration has allowed us to build the Network Operations Assistant—an intelligent, chat-based tool that enables network teams to monitor and manage infrastructure in real time. The Amazon Bedrock multi-agent architecture allows the assistant to deliver a unified interface for accessing alarms, maintenance schedules, KPIs, and overall network health.

Figure 1-Architecture-Network Operations AssistantFigure 1: Architecture: Network Operations Assistant

Inside the Network Operations Assistant: key technical components

Building an intelligent assistant to manage network infrastructure necessitates a thoughtful blend of AI, serverless compute, and scalable front-end technology. The following sections offer a closer look at the core components that power our Network Operations Assistant.

Amazon Bedrock and Agents: the AI foundation

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) through a unified API. Amazon Bedrock Agents extend these capabilities by enabling purpose-built AI assistants that can execute specific tasks and integrate with back-end systems while maintaining context and delivering coherent responses.

Network Operations Assistant uses a sophisticated multi-agent architecture powered by Amazon Bedrock, with each agent specialized for specific operational tasks:

  • Supervisor Agent: The supervisor agent uses specific instructions to understand and orchestrate the network operations workflow through specialized agents. It acts as the central coordinator, breaking down user queries about network issues into discrete tasks, delegating these to appropriate sub-agents, and synthesizing their findings into comprehensive, actionable responses.
supervisor_instruction = “”"
You are an intelligent Network Operations Supervisor with access to specialized sub-agents
that can help you provide comprehensive network operations support.

AVAILABLE SUB-AGENTS:
– MaintenanceAgent: Specializes in checking maintenance schedules, planned work,
and service windows
– AlarmAgent: Monitors and analyzes network alarms, outages, and critical alerts
– KPIAgent: Analyzes performance metrics, identifies anomalies, and assesses
network health

CORE RESPONSIBILITIES:
You intelligently route user queries to the appropriate sub-agents based on the
nature of their request. Analyze each user query to determine which sub-agents
can provide relevant information, then coordinate their responses to deliver
comprehensive answers.
“”"

Orchestration flow:

a. Receives user query about network status or issues

b. Determines which sub-agents to engage based on query context

c. Coordinates with MaintenanceAgent to check for scheduled work

d. Engages AlarmAgent to assess active network issues

e. Consults KPIAgent for performance impact analysis

f. Synthesizes all findings into a coherent response

g. Recommends specific actions based on collective analysis

Figure 2-Agent Orchestration FlowFigure 2: Agent Orchestration Flow

    • Maintenance Agent: keeps track of maintenance schedules and alerts the team to upcoming or ongoing work.

Agent prompt configuration

maintenance_instruction = """You are responsible for checking network maintenance schedules.
When asked about a site:
1. Check if there is any ongoing maintenance using the check_maintenance function
2. Check if there is upcoming maintenance
3. Provide details about any scheduled work
4. Format the information clearly for the supervisor agent"""

Action group function definition

maintenance_functions = [{
'name': 'check_maintenance',
'description': 'Check maintenance schedule for a network site',
'parameters': {
'site_id': {
'type': 'string',
'description': 'The ID of the network site to check (format: site_location_number, e.g., site_dallas_001)'
}
}
}]
  • Alarm Agent: watches over network alarms, prioritizing issues by severity and recommending next steps.

Agent prompt configuration

alarm_instruction = """You are responsible for monitoring network alarms.
When asked about a site:
1. Use the check_alarms function to get active alarms
2. Analyze alarm severity and impact
3. Determine if immediate action is needed
4. Report findings clearly to the supervisor agent"""

Action group function definition

alarm_functions = [{
'name': 'check_alarms',
'description': 'Check active alarms for a network site',
'parameters': {
'site_id': {
'type': 'string',
'description': 'The ID of the network site to check (format: site_location_number, e.g., site_dallas_001)'
}
}
}]
  • KPI Agent: Analyzes key performance indicators to spot trends and anomalies. Although the code in this implementation uses threshold-based anomaly detection, production deployments should use more sophisticated approaches for accurate and reliable anomaly detection. These could include:
    • ML models for predictive analytics
    • Time series analysis for trend identification
    • Multivariate anomaly detection to consider correlations between metrics
    • Adaptive thresholding that adjusts based on historical patterns

Agent prompt configuration

kpi_instruction = """You are responsible for analyzing network KPIs.
When asked about a site:
1. Use the analyze_kpis function to check performance metrics
2. Identify any anomalies or concerning trends
3. Analyze the impact of any issues found
4. Provide clear analysis to the supervisor agent"""

Action group function definition

kpi_functions = [{
'name': 'analyze_kpis',
'description': 'Analyze KPI metrics for a network site',
'parameters': {
'site_id': {
'type': 'string',
'description': 'The ID of the network site to analyze (format: site_location_number, e.g., site_dallas_001)'
}
}
}]

FMs and agent configuration

The solution uses the Amazon Nova family of models, specifically:

  • Nova Lite: powers the specialized sub-agents for efficient, focused tasks
  • Nova Pro: can be configured for the Supervisor Agent when more complex reasoning is needed

Although this solution uses the Amazon Nova family of models, you can customize the setup to use any of the FMs supported by Amazon Bedrock Agents, such as Antropic’s Claude Sonnet 3.7, Haiku 3, etc., depending on your specific requirements.

# Agent configuration with Nova models
AGENT_FOUNDATION_MODEL = 'amazon.nova-lite-v1:0'
SUPERVISOR_AGENT_FOUNDATION_MODEL = 'amazon.nova-pro-v1:0'

Using Bedrock Agents features for network operations

In this implementation, we’ve demonstrated how to use Action Groups to execute various network management tasks through AWS Lambda functions, and used different FMs (Nova Pro for Supervisor Agent, Nova Lite for sub-agents) based on the complexity of each agent’s role. Beyond these features, Amazon Bedrock Agents offers more capabilities that can help solve other practical challenges in network operations:

  • Knowledge base integration
    • Challenge: network operations teams need access to extensive documentation, procedures, and historical data
    • Solution: the Amazon Bedrock Agents knowledge base enables agents to:
      • Access relevant documentation and troubleshooting guides
      • Reference historical incident resolutions
      • Maintain up-to-date operational procedures
      • Scale tribal knowledge across the organization
  • Memory and context management
    • Challenge: complex network troubleshooting often necessitates maintaining context across multiple interactions
    • Solution: Amazon Bedrock Agents’ built-in conversation memory allows:
      • Contextual awareness across troubleshooting sessions
      • Reference to previous similar incidents
      • Efficient handling of follow-up questions
      • Progressive problem resolution

This multi-agent architecture enables sophisticated problem-solving and decision-making capabilities while maintaining a conversational interface for users.

Serverless integration layer: connecting agents with network data

These AI agents rely on serverless Lambda functions to process data efficiently and scale on demand:

  • Maintenance checker: gathers and processes maintenance schedule data.
  • Alarm checker: retrieves and analyzes active network alarms.
  • KPI analyzer: examines performance metrics to identify important trends.

This solution currently uses Lambda functions for efficient data processing and scalability. However, Amazon Bedrock Agents also support the Model Context Protocol (MCP) for direct integration with external systems. This solution can be extended based on the use-case to use MCP for real-time, context-aware interactions. For example:

  1. Power outage integration: Direct connections with utility company APIs to correlate network issues with reported outages.
  2. Fiber cut detection: Integration with ‘Call Before You Dig’ databases to identify potential causes of fiber outages.
  3. Weather impact analysis: Real-time weather data analysis for predicting network disruptions.
  4. Traffic and road work updates: Integration with traffic systems to link network issues with infrastructure damage.

These MCP-powered enhancements would allow our Network Operations Assistant to provide more comprehensive, contextual insights, further reducing troubleshooting time and improving root cause analysis accuracy.

Frontend integration: flexible, secure, and scalable interface

In this implementation, the user-facing side is a Streamlit app deployed in a Docker container on AWS Fargate for scaling. The application’s security is handled by Amazon Cognito, which provides robust user authentication and can be configured to integrate with enterprise Identity Providers (IdPs) for Single Sign-On (SSO) capabilities, supporting standards such as SAML 2.0 and OpenID Connect. Lambda@Edge functions work in conjunction with Amazon CloudFront to provide real-time authorization checks and secure access control at the edge. Content delivery is managed through CloudFront and an Application Load Balancer (ALB), making sure of fast, secure, and global access with efficient load distribution.

Enterprises can replace the Streamlit front-end with their preferred technology stack. The back-end layer for Amazon Bedrock Agents can be invoked from any modern front-end framework, such as React, Angular, Vue.js, or existing enterprise applications through API Gateway or other integration patterns. The architecture remains consistent regardless of the chosen front-end, allowing organizations to do the following:

  • Integrate AI agents into their existing network management interfaces
  • Scale the solution according to their operational requirements
  • Maintain enterprise security standards and access controls
  • Choose the front-end framework that best suits their team’s expertise and needs

Data storage: direct, flexible, and ready to scale

All network data lives in an Amazon S3 bucket as CSV files, including information on network sites, maintenance schedules, alarms, and KPIs. This setup allows you to update or replace data with real network sources as needed.

s3://netops-network-data-{account-id}/
├── data/
│   ├── sites.csv              # Network site information
│   ├── maintenance.csv        # Maintenance schedules
│   ├── alarms.csv            # Active and historical alarms
│   └── kpi_metrics.csv       # Performance metrics

Deployment

The following sections walk you through how to deploy the Network Operations Assistant.

Prerequisites
Make sure you have the following set up:

  • AWS Command Line Interface (AWS CLI) configured with the right permissions
  • AWS SAM CLI installed for serverless deployment
  • Docker installed, with support for multi-architecture builds
  • Python 3.9+
  • Amazon Bedrock model access to Nova Pro and Nova Lite models in the specific AWS Region where you’re deploying the solution.
  • PowerShell 5.1+ or PowerShell Core 7+ (Windows only)

AWS Configuration
Configure your AWS credentials:

aws configure

Or set up a named profile:

aws configure --profile your-profile-name

Step 1: Clone the repository

First, get the project code from GitHub:

git clone https://github.com/aws-samples/sample-multi-agent-collaboration-using-bedrock-for-telco-network-ops.git
cd sample-multi-agent-collaboration-using-bedrock-for-telco-network-ops

Step 2: Run the deployment script

The deployment script automates everything—from setting up Lambda functions to launching the Streamlit app:

For Linux/macOS:
./build_and_deploy.sh [stack-name] [region] [profile]

For Windows PowerShell:
.\build_and_deploy.ps1 -StackName [stack-name] -Region [region] -Profile [profile]

Parameters explained:

  • stack-name: choose an AWS CloudFormation stack name (default is netops) – Note: Do not include hyphens (-) in the stack name
  • region: specify the Region (default is us-east-1)
  • profile: choose the AWS CLI profile (default is default)

What happens behind the scenes?

When you run the script, it does the following:

  • Create a Lambda layer with essential libraries such as pandas and numpy
  • Configure Amazon Bedrock Agents with the right permissions
  • Deploy Lambda functions that power each AI agent
  • Set up an S3 bucket loaded with synthetic network data
  • Build and deploy the Streamlit app in a container on AWS Fargate
  • Set up Amazon Cognito for user authentication and management
  • Configure an ALB to distribute traffic to the Fargate containers
  • Deploy Lambda@Edge functions for real-time authorization checks
  • Set up a CloudFront distribution integrated with Cognito and Lambda@Edge for secure, fast, global access
  • Configure AWS Identity and Access Management (IAM) roles and policies to make sure of proper access controls across all components

Synthetic data structure

The solution includes preloaded synthetic data for demonstration purposes:

 

Dataset Description
1 Sites data Network site info: location, type, commissioning date
2 Maintenance Scheduled and ongoing maintenance activities
3 Alarms Active and cleared alarms, such as severity levels
4 KPI metrics Performance metrics: throughput, latency, packet loss, and other KPIs

Updating with your own network data (optional)

You can replace the demo data with your own real-world network data after deployment.

1. Prepare your CSV files

Make sure that they match the structure of the preceding synthetic data.

2. Upload to the S3 bucket

Replace {account-id} with your actual AWS account ID:

aws s3 cp your-sites.csv s3://netops-network-data-{account-id}/data/ --profile your-profile
aws s3 cp your-maintenance.csv s3://netops-network-data-{account-id}/data/ --profile your-profile
aws s3 cp your-alarms.csv s3://netops-network-data-{account-id}/data/ --profile your-profile
aws s3 cp your-kpi-metrics.csv s3://netops-network-data-{account-id}/data/ --profile your-profile

Step 3: Access Your Network Operations Assistant

When the deployment finishes (usually within 15–20 minutes), the following occurs:

1. The deployment script prompts you to set up an initial Cognito user:

a. Enter a username when prompted
b. Provide a password that meets the necessary security criteria

2. The CloudFront URL is available in the CloudFormation stack outputs section. This is the access point for your Network Operations Assistant.

3. Open the CloudFront URL in your browser.

4. You should be presented with a log in screen. Use the Cognito username and password you set up during deployment to authenticate.

5. After successful authentication, you can start managing your network with the Assistant.

The following is an example CloudFormation Output:

CloudFrontURL: https://d123456abcdef.cloudfront.net

Remember to keep your credentials secure. You can manage more users and permissions through the Amazon Cognito console after initial setup.

Demonstration of the Network Operations Assistant

Getting started

1. Open the CloudFront URL provided in the CloudFormation stack outputs
2. Authenticate using the Cognito credentials set during deployment
3. You should see a chat interface with a welcome message
4. Type your network operations query in natural language
5. The assistant processes your query and provide a response

Try these example queries to explore the capabilities:

Basic site information

  • “What’s the status of site_dallas_001?”
  • “Give me an overview of site_birmingham_003”
  • “Show me all sites in Atlanta”

Maintenance queries

  • “Is there any ongoing maintenance at site_dallas_002?”
  • “Show me upcoming maintenance for site_birmingham_004”
  • “When is the next scheduled maintenance for site_atlanta_003?”

Alarm monitoring

  • “Are there any critical alarms active right now?”
  • “Show me all alarms for site_dallas_001”
  • “What’s the status of the SITE_DOWN alarm on site_birmingham_002?”

Performance analysis

  • “How is site_ridgeland_005 performing?”
  • “Show me the throughput metrics for site_dallas_003”
  • “Are there any anomalies in the network performance today?”
  • “Compare the latency between site_birmingham_001 and site_birmingham_002”

Complex queries

  • “Give me a full report on site_dallas_001 including maintenance, alarms, and performance”
  • “Which sites have both active alarms and scheduled maintenance?”
  • “What’s the impact of the current maintenance on site_atlanta_003’s performance?”

How it works: agentic workflow in action

The Network Operations Assistant UI has tracing enabled, allowing you to see in detail the steps that the supervisor agent takes to process your query. In this section we examine an example workflow for the question “What is the status of site_dallas_001?”.

1. Initial user request: the user inputs the query “What is the status of site_dallas_001?”

2. Supervisor agent orchestration: the supervisor agent analyzes the query and determines the need to check multiple aspects of the site’s status. Then, it orchestrates the following sub-agents:

i. Maintenance Agent
ii. Alarm Agent
iii. KPI Agent

3. Parallel data gathering: each sub-agent performs its specific task concurrently:

i. Maintenance Agent checks for ongoing and upcoming maintenance
ii. Alarm Agent retrieves active alarms for the site
iii. KPI Agent collects and analyzes performance metrics

4. Data analysis and correlation: the supervisor agent collates the information from all sub-agents and performs an analysis to identify any correlations or critical issues.

5. Response formulation: based on the collected data and analysis, the supervisor agent formulates a comprehensive response about the site’s status.

6. Presentation to user: the assistant presents the findings to the user in a clear, concise format, highlighting any critical information or necessary actions.

Observing the trace in the UI allows you to see each of these steps unfold in real-time, providing transparency into the assistant’s decision-making process and the sources of information it uses to answer your query.

Figure 3-Assistant in action

Figure 3: Assistant in action

Figure 4-Assistant in action

Figure 4: Assistant in action

Conclusion

The Network Operations Assistant, built on the Amazon Bedrock multi-agent collaboration framework, demonstrates the transformative potential of intelligent AI systems in telecom network management. Orchestrating specialized agents to deliver real-time insights allows for comprehensive maintenance tracking, and proactive KPI analysis through an intuitive chat interface. This solution empowers network teams to operate with unprecedented efficiency and responsiveness.

Although our solution uses the fully managed capabilities of Amazon Bedrock Agents, there are other approaches to building AI agents for network operations. In Part 2 of this series, we explore implementing a similar solution using the open source Strands Agents SDK, which offers greater control and customization for those who need fine-grained control over agent behavior and interactions. We also demonstrate how to use the MCP for real-time, context-aware interactions with external systems, showing how these tools can enhance network operations.

Regardless of the chosen framework, the power of agentic AI in network operations is clear. The flexible, scalable multi-agent architecture enhances situational awareness and paves the way for future innovations. Advanced tool integrations, greater automation, and the potential use of the MCP for real-time, context-aware interactions with external systems all point to a future where network operations become increasingly proactive and efficient.

The multi-agent approach allows for:

1. Rapid problem identification and resolution
2. Enhanced decision-making through correlated data analysis
3. Proactive maintenance and performance optimization
4. Scalable expertise that can be easily updated and expanded

As telecom networks grow more complex and dynamic, solutions such as the Network Operations Assistant will be crucial in meeting operational challenges. This implementation using Amazon Bedrock Agents provides a robust foundation for operational excellence and innovation. Stay tuned for Part 2, where we explore alternative implementations using Strands Agents SDK and MCP, and provide you with a comprehensive understanding of different approaches to AI-powered network operations.

Ultimately, the adoption of these AI-driven solutions enables telecoms to not only keep pace with the increasing complexity of modern networks but also to thrive in this environment, delivering superior service quality and reliability to their customers.

Ready to build your own Network Operations Assistant? Start with this deployment guide, or explore the complete source code and documentation in our GitHub repository.

Manoj CS

Manoj CS

Manoj CS is a Senior Solutions Architect at AWS, based in Atlanta, Georgia. He specializes in helping telecommunications customers build innovative solutions on AWS and brings over 16 years of expertise in application development with extensive experience in building secure and scalable architectures on cloud infrastructure. He's a member of AWS's internal AI/ML community and serves as a generative AI subject matter expert. Outside of work, Manoj enjoys spending quality time with his family, gardening, and traveling.

Varun Mehta

Varun Mehta

Varun Mehta is a Senior Solutions Architect at AWS, passionate about enabling customers to design and deploy enterprise-scale, well-architected solutions on the AWS Cloud. He brings 16 years of expertise in networking and cloud technologies, with deep experience in building secure, scalable, and resilient architectures. Varun specializes in cloud networking, hybrid connectivity, and data center modernization, helping enterprises accelerate their cloud adoption journey.

Vijay Veggalam

Vijay Veggalam

Vijay Veggalam is a Solutions Architecture Leader in the Telecom Industry Business Unit at AWS developing innovative AI/ML Solutions that leverage AWS Infrastructure and Services. With over 27 years of experience, Vijay is an advisor to telecom companies and partners worldwide to reduce cost of operations, improve availability of networks and services. With 4 AI/ML patents, he specializes in the telco journey to autonomous networks including autonomous operations with Agentic AI, essential for newer revenue models for telcos.