AWS Cloud Enterprise Strategy Blog

From Tools to Teammates: CTO’s Guide to Evolving Architecture for Agentic AI

In my previous blog, I shared how to evolve leadership for agentic AI using familiar mental models. As a CTO, I’ve been thinking about the corresponding architectural shifts required: We need to move from building predictable systems to developing autonomous capabilities that augment teams. Based on hands-on explorations and working with fellow technology leaders navigating this shift, I want to share some ideas for architectural evolution.

These changes build on established principles of distributed computing that you may already use in rule-based systems. With agentic systems, the challenge isn’t inventing new concepts but applying them at scale. Hundreds of thousands of nondeterministic agents operating at once introduces challenges beyond today’s implementations.

System Architecture: From Rigid Orchestration to Intelligent Coordination

Most enterprise architectures resemble assembly lines. Service A calls Service B in a predetermined sequence, which updates Database C, which triggers Process D. Every step is scripted, every handoff is defined, and outcomes are predictable. This rigid orchestration is optimized for stable, consistent processes where system boundaries and workflows are well-understood.

The orchestration becomes even more important in agentic systems—but now we’re coordinating adaptive services instead of scripted ones. Research from MIT on distributed AI systems shows that how agents interact matters more than an individual agent’s intelligence.¹

Take the example of a customer refund request. The automation follows a rigid sequence: (1) route the refund inquiry, (2) check the database for order and return policy, (3) apply refund rules, and (4) send a response. Complex cases (e.g., damaged goods, partial returns of a bundle, multiple payment methods) require manual interventions.

In an agentic system, a refund agent can assess and adapt to context. It can (1) review the customer’s history, (2) check the inventory system to offer a replacement, (3) get information from shipping to see what might have happened during the delivery, and (4) provide a personalized solution that creates a better customer experience.

You can theoretically adapt existing architecture to cover these scenarios. But managing hundreds of agents and every possible scenario is impractical. Instead of a single refund service that follows predetermined rules, you need:

  • Event-driven coordination: The refund agent publishes “customer-dispute-initiated” events that other relevant agents (inventory, shipping, customer-history) dynamically respond to.
  • Contextual workflows: If the shipping agent reports “package-damaged-in-transit,” the workflow automatically involves the delivery carrier claims agent and adjusts the refund logic.
  • Persistent memory patterns: The entire interaction is stored as “customer had shipping damage, resolved with replacement and credit, prefers expedited shipping,” not just “refund processed.” The added context shapes how future agents interact with this customer.

This architectural shift mirrors how we manage high-agency employees. Instead of micromanaging agents, we orchestrate autonomous “team members” that make intelligent decisions within their areas of responsibility.

Data Architecture: From Centralized Repositories to Distributed Intelligence

A typical enterprise data architecture stores information in rigid schemas, predefined tables, fixed relationships, and static data models across many systems. Enterprises tend to centralize this information in data lakes, where the relationships are programmed, not learned. “Content” (i.e., unstructured data in org charts, SOPs, operating plans, process documents, etc.) remains separate. This works when humans are the primary decision-makers. People can wait for reports and combine data with content to add context and relationships.

Agentic AI requires distributed intelligence that includes context and relationships. Agents need on-demand access to relevant information with the context to interpret it. They need to contribute insights back to organizational knowledge in real time so the whole team (including other agents) can benefit from it.

Consider employee onboarding. Current systems typically pull structured data (e.g., employee role, department, location, manager assignment, and required training) from HR and learning management systems. But creating an effective onboarding experience requires connecting this data with unstructured content to determine which onboarding approaches work best for different personality types or career backgrounds. Today HR and managers manually piece together this information to design personalized onboarding experiences.

AI agents can discover and share these contextual patterns across the organization. An onboarding agent might learn that engineers from “fast-paced startup environments” struggle with teams that emphasize “detailed documentation” and “structured processes.” The system understands that “startup culture,” “agile environment,” and “move-fast mindset” represent similar backgrounds, while “process-heavy,” “compliance-focused,” and “documentation-driven” describe similar destination teams. This insight is made available to other similar scenarios, so agents onboarding a “scrappy marketing manager from a growth company” into a “regulated financial services team” can apply the same cultural adaptation strategies.

The learning compounds. Each successful onboarding teaches the system more about how different professional backgrounds align with team cultures, creating organizational intelligence that improves with every hire. “Data as an asset” becomes “knowledge as a capability.”

The architectural change centers on enabling real-time correlation and learning:

  • Semantic data integration: Use vector embeddings that combine structured HR data (e.g., “role” or “supervisor”) with unstructured content (e.g., team wikis or project docs) in unified vector representations.
  • Dynamic knowledge graphs: Automatically build and update relationships between employees, teams, skills, and team attributes based on onboarding outcomes.
  • Vector similarity search: Enable real-time pattern recognition across thousands of hiring decisions. Use high-dimensional vector databases to identify employees with similar backgrounds and teams with comparable cultures, even when described using different terminology.
  • Contextual retrieval systems: Combine vector similarity with graph relationships to answer complex queries like “find teams similar to the data science group that successfully onboarded engineers from startup backgrounds.” These systems require hybrid architectures that can navigate semantic similarity and explicit organizational relationships.

Each agent interaction enhances organizational intelligence and requires different architectural patterns than traditional data warehousing and ETL pipelines.

Security: From Static Permissions to Dynamic Delegation

Enterprise security typically relies on static permissions, established user roles, assigned system access, and user authentication. This works when employees access predefined systems with consistent authority levels. Traditional security architecture grants permissions based on job function. Customer service reps get CRM access, and finance teams get ERP access with clear boundaries of what each role can do.

Consider an escalating billing dispute in customer service. A customer service rep logs into the CRM system with predetermined permissions and follows established procedures. But when the issue requires (a) accessing the customer’s payment processor data, (b) coordinating with the shipping partner’s tracking system, and (c) potentially authorizing an approval-required refund, static permissions require manual workarounds. This often results in the customer waiting while the rep checks with someone who has access. Giving broad permissions is not a good security practice either because it increases risk.

An agentic approach enables dynamic delegation. A customer service agent can (a) prove it’s acting on behalf of Customer X for specific actions, (b) authenticate with external payment systems using contextual authority that expires after the interaction, and (c) escalate to a customer support supervisor with proper context and audit trails. The customer gets a faster resolution. And the business maintains security because every action is tied to the agent’s identity and the customer’s delegated authority.

Think about how you manage security for human employees. You don’t just check their ID badge once when they enter the office. You provide role-based access, validate their authority for specific actions, and maintain audit trails of what they did and why. With agentic AI, validation must be continuous, scalable, and situation-specific. The same agent might have different authorities depending on the customer it’s helping, the escalation level, and the external systems it needs to access.

This architectural evolution requires enabling contextual, delegated authority:

  • Context-aware authentication: The customer service agent proves not just its identity, but also its current authority to act on behalf of a specific customer for a specific purpose.
  • Temporal authorization: Permissions automatically expire when the customer interaction ends. This is safer than using long-lived access tokens, which can lead to ongoing security risks.
  • Cross-organizational delegation: The agent uses the customer’s granted authority to authenticate with external systems (e.g., payment processors, shipping partners) while maintaining complete audit trails.
  • Granular delegation controls: Customers can grant agents permission to “view my payment history” but not “change my payment method.” These contextual permissions are validated in real time.

This change shifts static role assignments to dynamic authority delegation. It requires security architectures to manage permissions that change based on customer context, interaction scope, and temporal boundaries across organizational silos.​​​​​​​​​​​​​​​​

Integration: From API Contracts to Semantic Protocols

Enterprise integration relies on API contracts, precise specifications of data formats, required fields, and expected responses. Service A knows exactly what to send to Service B and exactly what response to expect. APIs are like formal business letters—structured, precise, and ritualistic. This works when interactions are generally predefined or follow a specific structure and order.

Consider a marketing campaign that must respond to a trending topic. Typical integration handles planned campaigns well. The marketing automation system sends predefined customer segments to the email platform, which returns delivery statistics, while the social media scheduler posts predetermined content based on fixed timing rules. But when a viral moment requires immediate response (like a competitor’s PR crisis or a breaking industry news), it typically creates a manual all-hands-on-deck havoc.

Social media systems aren’t typically designed to tell the email platform, “Pause the competitor comparison campaign because they’re in crisis.” And the email system doesn’t typically communicate to the advertising platform, “This customer segment is responding well to authenticity messaging right now.”

An agentic approach enables dynamic coordination. A social media agent detects shifting sentiment and communicates context to email and advertising agents that can adjust messaging, pause conflicting campaigns, and amplify authentic content across channels. Instead of waiting for scheduled data exchanges, agents negotiate adjustments in real time, share contextual insights about what’s working, and coordinate responses across multiple platforms and customer touchpoints.

Agent communication starts to resemble conversations between colleagues: “Customer engagement on competitor comparison ads dropped 40% after their data breach news broke. Suggest pivoting to security messaging and increasing budget on trust-building content.”

To enable this contextual coordination at a massive scale, we need to use:

  • Context-aware protocols: Agents can communicate why they need something, not just what they need because protocols carry intent and situational awareness, not just data.
  • Intent-based service discovery: Agents can find relevant capabilities dynamically rather than relying on predefined service catalogs.
  • Dynamic negotiation patterns: Agents can propose alternatives, suggest optimizations, and coordinate complex multichannel responses without predetermined workflows.

Instead of rigid handoffs between specialized systems, agents can form campaign optimization teams, share real-time market context, and coordinate their actions while maintaining accountability for their specific channels and capabilities.​​​​​​​​​​​​​​​​

Monitoring: From System Health to Behavioral Intelligence

Traditional monitoring tells us whether systems are up, accurate, and performant. But we need deeper insights when we work with AI agents. Standard metrics like uptime, response time, and error rates can’t tell us whether agents are making effective decisions or improving their operations over time.

Consider a site reliability scenario where infrastructure agents manage auto-scaling and incident response. Traditional monitoring works fine for predictable scenarios: When CPU usage hits 80%, it triggers scaling rules and sends alerts if thresholds are breached. This approach effectively tracks system health, response times, and resource utilization when success means maintaining predefined performance targets.

But once infrastructure agents start making autonomous decisions (e.g., preemptively scaling based on traffic patterns, coordinating failover strategies across regions, or optimizing resource allocation based on cost and performance trade-offs), traditional metrics become insufficient.

When an agent scales down during apparently peak traffic after detecting an unusual bot pattern, you need to understand its reasoning. When scaling decisions change over time, you need to distinguish between improved optimization and concerning drift in decision-making logic.

An agentic approach requires behavioral observability. You need to understand not just that scaling happened, but why the agent chose that response, how its decision patterns have evolved, and whether its predictions are improving. When an infrastructure agent escalates an incident, you need visibility into the context driving that decision. Was it unusual error patterns, cascading failure indicators, or something the agent learned from previous incidents?

This change is about understanding agent reasoning and evolution:

  • Event correlation across distributed agent decisions: Understand how infrastructure agents coordinate their actions and whether their collaborative patterns are effective.
  • Context preservation in observability pipelines: Capture the environmental conditions, historical patterns, and decision factors that influenced an agent’s actions.
  • Real-time behavioral boundary monitoring: Detect when agent decision patterns drift outside expected parameters, whether due to learning, environmental changes, or potential issues.
  • Pattern recognition at scale: Identify trends in agent behavior across different infrastructure domains and time periods.

Understanding why agents made decisions and how their reasoning evolves enables teams to build trust in agent decisions.

Looking Ahead

These architectural shifts lay the foundation for evolving AI models from sophisticated tools to true teammates. They build on decades of distributed systems research for the unique challenges that agentic AI introduces—challenges that require innovation and pragmatism.

This balance between innovation and reliability is why our approach at AWS is built on what Swami Sivasubramanian describes as an “evolving foundation.” Organizations need architectures that harness an agent’s nondeterministic behavior at scale while maintaining enterprise-grade reliability and security.

This vision drove the development of Amazon Bedrock AgentCore, which provides modular services that support any framework, model, or protocol. It also helps address shifts around identity, observability, integration, and security. By handling operational complexity, AgentCore lets organizations focus on what matters most: building intelligent agent experiences that scale with their business needs.

Reference:

1. MIT Media Lab. “What is a Multi-Agent System?” MIT Media Lab Articles, 2024.

Ishit Vachhrajani

Ishit Vachhrajani

Ishit leads a global team of Enterprise Strategists consisting of former CXOs and senior executives from large enterprises. Enterprise Strategists partner with executives of some of world’s largest companies, helping them understand how the cloud can enable them to spend more time focusing on customer needs with its ability to increase speed and agility, drive innovation, and form new operating models. Prior to joining AWS, Ishit was Chief Technology Officer at A+E Networks responsible for global technology across cloud, architecture, applications and products, data analytics, technology operations, and cyber security. Ishit led a major transformation at A+E moving to the cloud, reorganizing for agility, implementing a unified global financial system, creating an industry leading data analytics platform, revamping global content sales and advertising sales products, all while significantly reducing operational costs. He has previously held leadership positions at NBCUniversal and global consulting organizations. Ishit has been recognized with several awards including the CEO award called “Create Great” at A+E Networks. He is passionate about mentoring next generation of leaders and serves on a number of peer advisory groups. Ishit earned his bachelor’s degree in Instrumentation & Control Engineering with a gold medal for academic achievement from the Nirma Institute of Technology in India.