AWS for Industries
Multi-agent collaboration using Amazon Bedrock for Telecom Network Operations
Telecom Network Operations are complex by nature, needing constant vigilance across a range of activities such as real-time alarm monitoring, scheduled maintenance tracking, and performance analytics. One of the most critical metrics in this domain is Mean Time to Resolve (MTTR)—a key indicator of how efficiently network teams can detect, diagnose, and resolve issues. A high MTTR not only threatens service reliability but also degrades customer experience and drives up operational costs. Traditionally, managing this complexity has necessitated manual coordination or monolithic systems, both of which fall short in today’s fast-paced, high-availability environments.
To address this challenge, forward-thinking telecom operators are turning to AI powered multi-agent collaboration, a scalable and intelligent approach that distributes responsibility across specialized agents. Each agent is designed to focus on a specific domain—such as checking for scheduled maintenance, analyzing real-time alarms, or evaluating KPI anomalies—while a supervisor agent orchestrates the overall workflow. The Amazon Bedrock fully managed multi-agent collaboration means that these systems are more ready for implementation than ever. Amazon Bedrock handles the behind-the-scenes orchestration, task delegation, and communication between agents, enabling faster diagnostics, more accurate decisions, and a significant reduction in MTTR. The result is a more agile, resilient, and responsive network operations model tailored for the demands of modern telecom infrastructure.
Background and challenges
Traditional Network Operations Center (NOC) troubleshooting workflows often follow a linear, manual process that depends heavily on static playbooks and isolated monitoring tools. When an issue arises, engineers typically need to sift through multiple systems—cross-referencing alarms, maintenance schedules, and performance metrics—all of which are frequently presented in different interfaces. This fragmented approach not only slows down the identification of root causes but also heightens the risk of missing crucial insights. The process demands significant manual effort to correlate data from various sources, which can lead to inconsistencies, errors, and ultimately, extended MTTR.
Furthermore, the reliance on tribal knowledge within NOCs presents a significant challenge. Critical operational insights often reside with experienced engineers, but they are rarely documented or easily accessible to others. This creates inefficiencies in knowledge transfer, especially when team members are unavailable or when new engineers are on-boarded. As networks become increasingly complex and distributed, the difficulty of correlating maintenance schedules, active alarms, and performance trends in real-time only intensifies. In light of these challenges, there is a pressing need for more efficient, accurate, and automated problem resolution—one that reduces manual intervention, enhances decision-making, and accelerates incident response times.
Solution overview
The Amazon Bedrock multi-agent collaboration has allowed us to build the Network Operations Assistant—an intelligent, chat-based tool that enables network teams to monitor and manage infrastructure in real time. The Amazon Bedrock multi-agent architecture allows the assistant to deliver a unified interface for accessing alarms, maintenance schedules, KPIs, and overall network health.
Figure 1: Architecture: Network Operations Assistant
Inside the Network Operations Assistant: key technical components
Building an intelligent assistant to manage network infrastructure necessitates a thoughtful blend of AI, serverless compute, and scalable front-end technology. The following sections offer a closer look at the core components that power our Network Operations Assistant.
Amazon Bedrock and Agents: the AI foundation
Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) through a unified API. Amazon Bedrock Agents extend these capabilities by enabling purpose-built AI assistants that can execute specific tasks and integrate with back-end systems while maintaining context and delivering coherent responses.
Network Operations Assistant uses a sophisticated multi-agent architecture powered by Amazon Bedrock, with each agent specialized for specific operational tasks:
- Supervisor Agent: The supervisor agent uses specific instructions to understand and orchestrate the network operations workflow through specialized agents. It acts as the central coordinator, breaking down user queries about network issues into discrete tasks, delegating these to appropriate sub-agents, and synthesizing their findings into comprehensive, actionable responses.
Orchestration flow:
a. Receives user query about network status or issues
b. Determines which sub-agents to engage based on query context
c. Coordinates with MaintenanceAgent to check for scheduled work
d. Engages AlarmAgent to assess active network issues
e. Consults KPIAgent for performance impact analysis
f. Synthesizes all findings into a coherent response
g. Recommends specific actions based on collective analysis
Figure 2: Agent Orchestration Flow
-
- Maintenance Agent: keeps track of maintenance schedules and alerts the team to upcoming or ongoing work.
Agent prompt configuration
Action group function definition
- Alarm Agent: watches over network alarms, prioritizing issues by severity and recommending next steps.
Agent prompt configuration
Action group function definition
- KPI Agent: Analyzes key performance indicators to spot trends and anomalies. Although the code in this implementation uses threshold-based anomaly detection, production deployments should use more sophisticated approaches for accurate and reliable anomaly detection. These could include:
- ML models for predictive analytics
- Time series analysis for trend identification
- Multivariate anomaly detection to consider correlations between metrics
- Adaptive thresholding that adjusts based on historical patterns
Agent prompt configuration
Action group function definition
FMs and agent configuration
The solution uses the Amazon Nova family of models, specifically:
- Nova Lite: powers the specialized sub-agents for efficient, focused tasks
- Nova Pro: can be configured for the Supervisor Agent when more complex reasoning is needed
Although this solution uses the Amazon Nova family of models, you can customize the setup to use any of the FMs supported by Amazon Bedrock Agents, such as Antropic’s Claude Sonnet 3.7, Haiku 3, etc., depending on your specific requirements.
Using Bedrock Agents features for network operations
In this implementation, we’ve demonstrated how to use Action Groups to execute various network management tasks through AWS Lambda functions, and used different FMs (Nova Pro for Supervisor Agent, Nova Lite for sub-agents) based on the complexity of each agent’s role. Beyond these features, Amazon Bedrock Agents offers more capabilities that can help solve other practical challenges in network operations:
- Knowledge base integration
- Challenge: network operations teams need access to extensive documentation, procedures, and historical data
- Solution: the Amazon Bedrock Agents knowledge base enables agents to:
- Access relevant documentation and troubleshooting guides
- Reference historical incident resolutions
- Maintain up-to-date operational procedures
- Scale tribal knowledge across the organization
- Memory and context management
- Challenge: complex network troubleshooting often necessitates maintaining context across multiple interactions
- Solution: Amazon Bedrock Agents’ built-in conversation memory allows:
- Contextual awareness across troubleshooting sessions
- Reference to previous similar incidents
- Efficient handling of follow-up questions
- Progressive problem resolution
This multi-agent architecture enables sophisticated problem-solving and decision-making capabilities while maintaining a conversational interface for users.
Serverless integration layer: connecting agents with network data
These AI agents rely on serverless Lambda functions to process data efficiently and scale on demand:
- Maintenance checker: gathers and processes maintenance schedule data.
- Alarm checker: retrieves and analyzes active network alarms.
- KPI analyzer: examines performance metrics to identify important trends.
This solution currently uses Lambda functions for efficient data processing and scalability. However, Amazon Bedrock Agents also support the Model Context Protocol (MCP) for direct integration with external systems. This solution can be extended based on the use-case to use MCP for real-time, context-aware interactions. For example:
- Power outage integration: Direct connections with utility company APIs to correlate network issues with reported outages.
- Fiber cut detection: Integration with ‘Call Before You Dig’ databases to identify potential causes of fiber outages.
- Weather impact analysis: Real-time weather data analysis for predicting network disruptions.
- Traffic and road work updates: Integration with traffic systems to link network issues with infrastructure damage.
These MCP-powered enhancements would allow our Network Operations Assistant to provide more comprehensive, contextual insights, further reducing troubleshooting time and improving root cause analysis accuracy.
Frontend integration: flexible, secure, and scalable interface
In this implementation, the user-facing side is a Streamlit app deployed in a Docker container on AWS Fargate for scaling. The application’s security is handled by Amazon Cognito, which provides robust user authentication and can be configured to integrate with enterprise Identity Providers (IdPs) for Single Sign-On (SSO) capabilities, supporting standards such as SAML 2.0 and OpenID Connect. Lambda@Edge functions work in conjunction with Amazon CloudFront to provide real-time authorization checks and secure access control at the edge. Content delivery is managed through CloudFront and an Application Load Balancer (ALB), making sure of fast, secure, and global access with efficient load distribution.
Enterprises can replace the Streamlit front-end with their preferred technology stack. The back-end layer for Amazon Bedrock Agents can be invoked from any modern front-end framework, such as React, Angular, Vue.js, or existing enterprise applications through API Gateway or other integration patterns. The architecture remains consistent regardless of the chosen front-end, allowing organizations to do the following:
- Integrate AI agents into their existing network management interfaces
- Scale the solution according to their operational requirements
- Maintain enterprise security standards and access controls
- Choose the front-end framework that best suits their team’s expertise and needs
Data storage: direct, flexible, and ready to scale
All network data lives in an Amazon S3 bucket as CSV files, including information on network sites, maintenance schedules, alarms, and KPIs. This setup allows you to update or replace data with real network sources as needed.
Deployment
The following sections walk you through how to deploy the Network Operations Assistant.
Prerequisites
Make sure you have the following set up:
- AWS Command Line Interface (AWS CLI) configured with the right permissions
- AWS SAM CLI installed for serverless deployment
- Docker installed, with support for multi-architecture builds
- Python 3.9+
- Amazon Bedrock model access to Nova Pro and Nova Lite models in the specific AWS Region where you’re deploying the solution.
- PowerShell 5.1+ or PowerShell Core 7+ (Windows only)
AWS Configuration
Configure your AWS credentials:
aws configure
Or set up a named profile:
aws configure --profile your-profile-name
Step 1: Clone the repository
First, get the project code from GitHub:
Step 2: Run the deployment script
The deployment script automates everything—from setting up Lambda functions to launching the Streamlit app:
For Linux/macOS:
./build_and_deploy.sh [stack-name] [region] [profile]
For Windows PowerShell:
.\build_and_deploy.ps1 -StackName [stack-name] -Region [region] -Profile [profile]
Parameters explained:
stack-name
: choose an AWS CloudFormation stack name (default is netops) – Note: Do not include hyphens (-) in the stack nameregion
: specify the Region (default is us-east-1)profile
: choose the AWS CLI profile (default is default)
What happens behind the scenes?
When you run the script, it does the following:
- Create a Lambda layer with essential libraries such as pandas and numpy
- Configure Amazon Bedrock Agents with the right permissions
- Deploy Lambda functions that power each AI agent
- Set up an S3 bucket loaded with synthetic network data
- Build and deploy the Streamlit app in a container on AWS Fargate
- Set up Amazon Cognito for user authentication and management
- Configure an ALB to distribute traffic to the Fargate containers
- Deploy Lambda@Edge functions for real-time authorization checks
- Set up a CloudFront distribution integrated with Cognito and Lambda@Edge for secure, fast, global access
- Configure AWS Identity and Access Management (IAM) roles and policies to make sure of proper access controls across all components
Synthetic data structure
The solution includes preloaded synthetic data for demonstration purposes:
Dataset | Description | |
1 | Sites data | Network site info: location, type, commissioning date |
2 | Maintenance | Scheduled and ongoing maintenance activities |
3 | Alarms | Active and cleared alarms, such as severity levels |
4 | KPI metrics | Performance metrics: throughput, latency, packet loss, and other KPIs |
Updating with your own network data (optional)
You can replace the demo data with your own real-world network data after deployment.
1. Prepare your CSV files
Make sure that they match the structure of the preceding synthetic data.
2. Upload to the S3 bucket
Replace {account-id}
with your actual AWS account ID:
Step 3: Access Your Network Operations Assistant
When the deployment finishes (usually within 15–20 minutes), the following occurs:
1. The deployment script prompts you to set up an initial Cognito user:
a. Enter a username when prompted
b. Provide a password that meets the necessary security criteria
2. The CloudFront URL is available in the CloudFormation stack outputs section. This is the access point for your Network Operations Assistant.
3. Open the CloudFront URL in your browser.
4. You should be presented with a log in screen. Use the Cognito username and password you set up during deployment to authenticate.
5. After successful authentication, you can start managing your network with the Assistant.
The following is an example CloudFormation Output:
CloudFrontURL: https://d123456abcdef.cloudfront.net
Remember to keep your credentials secure. You can manage more users and permissions through the Amazon Cognito console after initial setup.
Demonstration of the Network Operations Assistant
Getting started
1. Open the CloudFront URL provided in the CloudFormation stack outputs
2. Authenticate using the Cognito credentials set during deployment
3. You should see a chat interface with a welcome message
4. Type your network operations query in natural language
5. The assistant processes your query and provide a response
Try these example queries to explore the capabilities:
Basic site information
- “What’s the status of site_dallas_001?”
- “Give me an overview of site_birmingham_003”
- “Show me all sites in Atlanta”
Maintenance queries
- “Is there any ongoing maintenance at site_dallas_002?”
- “Show me upcoming maintenance for site_birmingham_004”
- “When is the next scheduled maintenance for site_atlanta_003?”
Alarm monitoring
- “Are there any critical alarms active right now?”
- “Show me all alarms for site_dallas_001”
- “What’s the status of the SITE_DOWN alarm on site_birmingham_002?”
Performance analysis
- “How is site_ridgeland_005 performing?”
- “Show me the throughput metrics for site_dallas_003”
- “Are there any anomalies in the network performance today?”
- “Compare the latency between site_birmingham_001 and site_birmingham_002”
Complex queries
- “Give me a full report on site_dallas_001 including maintenance, alarms, and performance”
- “Which sites have both active alarms and scheduled maintenance?”
- “What’s the impact of the current maintenance on site_atlanta_003’s performance?”
How it works: agentic workflow in action
The Network Operations Assistant UI has tracing enabled, allowing you to see in detail the steps that the supervisor agent takes to process your query. In this section we examine an example workflow for the question “What is the status of site_dallas_001?”.
1. Initial user request: the user inputs the query “What is the status of site_dallas_001?”
2. Supervisor agent orchestration: the supervisor agent analyzes the query and determines the need to check multiple aspects of the site’s status. Then, it orchestrates the following sub-agents:
i. Maintenance Agent
ii. Alarm Agent
iii. KPI Agent
3. Parallel data gathering: each sub-agent performs its specific task concurrently:
i. Maintenance Agent checks for ongoing and upcoming maintenance
ii. Alarm Agent retrieves active alarms for the site
iii. KPI Agent collects and analyzes performance metrics
4. Data analysis and correlation: the supervisor agent collates the information from all sub-agents and performs an analysis to identify any correlations or critical issues.
5. Response formulation: based on the collected data and analysis, the supervisor agent formulates a comprehensive response about the site’s status.
6. Presentation to user: the assistant presents the findings to the user in a clear, concise format, highlighting any critical information or necessary actions.
Observing the trace in the UI allows you to see each of these steps unfold in real-time, providing transparency into the assistant’s decision-making process and the sources of information it uses to answer your query.
Figure 3: Assistant in action
Figure 4: Assistant in action
Conclusion
The Network Operations Assistant, built on the Amazon Bedrock multi-agent collaboration framework, demonstrates the transformative potential of intelligent AI systems in telecom network management. Orchestrating specialized agents to deliver real-time insights allows for comprehensive maintenance tracking, and proactive KPI analysis through an intuitive chat interface. This solution empowers network teams to operate with unprecedented efficiency and responsiveness.
Although our solution uses the fully managed capabilities of Amazon Bedrock Agents, there are other approaches to building AI agents for network operations. In Part 2 of this series, we explore implementing a similar solution using the open source Strands Agents SDK, which offers greater control and customization for those who need fine-grained control over agent behavior and interactions. We also demonstrate how to use the MCP for real-time, context-aware interactions with external systems, showing how these tools can enhance network operations.
Regardless of the chosen framework, the power of agentic AI in network operations is clear. The flexible, scalable multi-agent architecture enhances situational awareness and paves the way for future innovations. Advanced tool integrations, greater automation, and the potential use of the MCP for real-time, context-aware interactions with external systems all point to a future where network operations become increasingly proactive and efficient.
The multi-agent approach allows for:
1. Rapid problem identification and resolution
2. Enhanced decision-making through correlated data analysis
3. Proactive maintenance and performance optimization
4. Scalable expertise that can be easily updated and expanded
As telecom networks grow more complex and dynamic, solutions such as the Network Operations Assistant will be crucial in meeting operational challenges. This implementation using Amazon Bedrock Agents provides a robust foundation for operational excellence and innovation. Stay tuned for Part 2, where we explore alternative implementations using Strands Agents SDK and MCP, and provide you with a comprehensive understanding of different approaches to AI-powered network operations.
Ultimately, the adoption of these AI-driven solutions enables telecoms to not only keep pace with the increasing complexity of modern networks but also to thrive in this environment, delivering superior service quality and reliability to their customers.
Ready to build your own Network Operations Assistant? Start with this deployment guide, or explore the complete source code and documentation in our GitHub repository.