AWS Architecture Blog

Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite

Background

Amazon Key empowers customers to securely manage access to their homes and businesses through innovative solutions. Through a suite of consumer and business products, the Amazon Key team is transforming how customers receive deliveries and manage access to their spaces. Our In-Garage Delivery service offers a secure and convenient solution for receiving Amazon packages and groceries directly inside customers’ garages. For property managers and building owners, Amazon Key provides comprehensive access management solutions that enable safe and efficient delivery operations in apartment buildings and gated communities, enhancing both security and convenience for residents.

In this post, we explore how the Amazon Key team used Amazon EventBridge to modernize their architecture, transforming a tightly coupled monolithic system into a resilient, event-driven solution. We explore the technical challenges we faced, our implementation approach, and the architectural patterns that helped us achieve improved reliability and scalability. The post covers our solutions for managing event schemas at scale, handling multiple service integrations efficiently, and building an extensible architecture that accommodates future growth.

Opportunities

Service Coupling and System Fragility

Our legacy architecture faced significant challenges stemming from its tightly coupled design, where service interactions created a complex web of dependencies impacting system stability and scalability. Making service modifications was particularly challenging, as adding or removing services required careful consideration of numerous interdependencies. An incident highlighted this vulnerability when an issue in Service-A triggered a cascade of failures across many upstream services, with increased timeouts leading to retry attempts and ultimately resulting in service deadlocks. System fragility was further demonstrated when problems with a single device vendor, despite being responsible only for specific delivery operations, caused widespread degradation across multiple system services.

Loose Event Schemas

Our old event management infrastructure lacked explicit schema definitions and employed a loosely-typed data architecture, leading to several critical issues. Events were difficult to maintain as use cases expanded, and the absence of formal schema documentation impacted transparency and team collaboration. The design made it almost impossible to implement backward-incompatible changes, such as removing unused fields or events for performance optimization. Without a repository for schema management, team-to-team collaboration for schema modifications (adding fields, removing fields, deprecating fields, or marking fields as required) became challenging. The system also lacked organized validation logic, making it difficult for publishers to identify invalid events before they entered the system. Additionally, the loosely typed schemas lost important semantic context, such as inheritance and composition relationships between different event schemas.

Inconsistent Event Routing and Management

The event routing logic was manually managed and lacked the sophistication needed for growing use cases. The system only supported basic validation of events, primarily checking for required fields, with limited capability for extending validation rules or implementing more complex routing logic. Features that were commonly available in off-the-shelf solutions, such as parallel publishing to multiple subscribers, required significant custom development and ongoing maintenance effort. The implementation only supported a limited number of subscribers to the event pipeline, with no sustainable pathway for adding more consumers. While attempts were made to reduce coupling through SNS/SQS pairs between services, these solutions were implemented on an ad-hoc basis, lacking standardization and creating additional maintenance overhead. This approach led to redundant work and failed to abstract away common functionality, resulting in an inefficient and hard-to-maintain system.These challenges collectively highlighted the need for a more robust and flexible architectural approach that could better serve the system’s evolving needs while improving reliability, maintainability, and scalability.

Design

Given our requirements and the architectural challenges we faced, we implemented a single-bus, multi-account pattern to optimize our system architecture. In this design, each service team maintains complete ownership and autonomy over their application stack, enabling independent development and deployment cycles. Meanwhile, our DevOps team manages a centralized infrastructure stack that encompasses event bus rules, target configurations, and service integrations. This separation of concerns provides several key benefits:

  1. Clear ownership boundaries: Service teams can focus on their core business logic while leveraging a standardized event infrastructure.
  2. Centralized governance: The DevOps team facilitates consistent event routing patterns, security controls, and monitoring across service integrations.
  3. Simplified operations: A single event bus reduces operational complexity while maintaining logical separation through well-defined routing rules.
  4. Enhanced security: The multi-account structure provides natural isolation boundaries while still enabling controlled cross-account event flows.
  5. Streamlined compliance: Centralized management of data exchange patterns makes it easier to implement and maintain compliance requirements.

While EventBridge provided the foundation, we developed additional components to meet our specific requirements.  Our team built three key components: a schema repository serving as the single source of truth for event definitions, a client library that handles schema validation and provides developer-friendly abstractions, and an infrastructure library offering reusable components for subscriber integration.

Event Schema Repository

Amazon EventBridge’s schema discovery and documentation capabilities provide powerful solutions for managing event-driven architectures. The service automatically captures event structures in the schema registry, maintaining versions as events evolve over time. While EventBridge provides developers with tools to implement validation using external solutions or custom application code, it currently does not include native schema validation capabilities. For our organization’s large-scale event-driven architecture, schema validation was a critical requirement. We evaluated two implementation approaches: a centralized validation service or client-side validation at the publisher/subscriber level. The centralized approach would have required managing additional infrastructure, scaling considerations, and introduced latency through extra network hops. After analyzing these factors alongside our requirements for schema governance and team autonomy, we implemented a custom schema repository with client-side validation.

This architecture prioritizes developer experience through immediate validation feedback while maintaining our standards for schema versioning and release management. The repository serves as the foundation for our event-driven architecture, providing essential capabilities for data governance and quality control. By acting as the single source of truth for event definitions, it enables standardized validation across clients, enforces data quality checks, establishes clear ownership boundaries, and maintains comprehensive audit trails for schema changes. Publishers and subscribers leverage these schemas to maintain data consistency and compatibility as their services evolve. The repository has become instrumental in facilitating efficient cross-team collaboration through self-service schema discovery, documentation, and automated validation during development. It maintains a comprehensive registry of event publishers and their corresponding subscribers, providing clear visibility into event flow patterns and dependencies across the system. Teams can quickly manage schema evolution with clear deprecation policies and migration paths, while the system helps detect breaking changes early in the development cycle. This collaborative approach has significantly improved team velocity and reduced integration issues between services.

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$id": "/resource/event/schema/EventV1.json",
    "title": "EventV1",
    "description": "Schema for a simple event.",
    "type": "object",
    "properties": {
        "id": {
            "description": "Id of the event.",
            "type": "string"
        },
        "type": {
            "description": "Type of the event.",
            "$ref": "EventType.json"
        },
        "time": {
            "description": "Time at which the event occurred. It uses ISO 8601 Date Time Format. Reference: https://www.iso.org/iso-8601-date-and-time-format.html",
            "type": "string",
            "format": "date-time"
        },
        "publisher": {
            "description": "Publisher of the event.",
            "$ref": "../core/Publisher.json"
        }
    },
    "required": [
        "id",
        "type",
        "time",
        "publisher"
    ]
}

Client Library

The client library serves as a crucial component for both publishers and subscribers, streamlining their integration with the central event bus. At its core, the library leverages our Event Schema Repository, generating code bindings at build time to provide developers with type-safe and intuitive interfaces for event creation and handling. This approach significantly enhances developer productivity by offering straightforward and convenient methods to construct events and interact with the bus, reducing the likelihood of errors and improving code readability.

A key feature of the client library is its built-in validation mechanism. By utilizing the schemas from our local repository, the library performs thorough validation of events before they are published. This proactive approach catches potential issues early in the development cycle, making sure that only well-formed events conforming to the agreed-upon schemas make it to the event bus. Once validated, the library handles the serialization process and manages the actual publishing of events to the bus, abstracting and simplifying data transformation and transport.

For subscribers, the client library offers equally valuable functionality. It seamlessly handles the deserialization of incoming events, presenting them to the subscribing services in a readily usable format. This feature saves development time and reduces the risk of parsing errors, allowing teams to focus on business logic rather than data handling intricacies. By providing these comprehensive capabilities, our client library has become an indispensable tool in our event-driven network, promoting consistency, reliability, and efficiency across our microservices architecture.

Subscriber Constructs Library

We developed a subscriber constructs library using AWS Cloud Development Kit (CDK) to simplify and standardize the integration process with our central event bus. This library abstracts the setup and management of underlying infrastructure required for event consumption, enabling teams to focus on their core business logic rather than infrastructure configuration details.

The library automates the creation of essential components required for reliable event processing. It provisions a dedicated event bus within the subscriber’s account, establishes the necessary IAM roles and permissions for secure cross-account communication with the central event bus, and configures standardized monitoring and alerting for event processing. This automation not only reduces the potential for configuration errors but also facilitates consistent implementation of our architectural patterns across different teams.

/**
 * Subscriber implementation to provision necessary AWS infrastructure.
 *
 */
const subscription = new Subscription(scope, id, {
    name: "DeliveryService", // Name of your application
    application: {
       region: Region.US_EAST_1, // Region of your Application
    },
});

Conclusion

Amazon Key team’s journey to modernize their architecture and build a resilient, event-driven solution exemplifies the powerful benefits of leveraging AWS EventBridge and adopting a well-designed event-driven architecture. By addressing the challenges of service coupling, loose event schemas, and inconsistent event routing, the team was able to transform their system into a more reliable, scalable, and maintainable resource. The key architectural patterns and components they implemented have had a significant impact on their ability to deliver innovative solutions to their customers.

Reliability and Scale:

  • Built a decoupled event system processing 2000 events/second with 99.99% success rate
  • Achieved consistent 80ms p90 latency from ingestion to target invocation across 14M subscriber calls
  • Avoided the need for new infrastructure for event exchange through standardized event routing
  • Enabled migration of existing complex interdependencies to event-driven architecture

Developer Experience:

  • Reduced service integration time for new use cases from five days to one day (80% improvement)
  • New event onboarding on the Custom Event Schema repository now takes four hours, down from 48 hours
  • Publisher/subscriber integration completed in eight hours, previously took 40 hours
  • Standardized client library addressed 90% of common integration errors

Security and Governance :

  • Single control plane manages 100% of event bus infrastructure
  • Automated security compliance checks catch 100% of unauthorized data exchange patterns
  • Real-time monitoring dashboard tracks every event flow and schema change
  • Schema repository provides complete audit trail for system modifications

The solutions developed by the Amazon Key team provide a blueprint for other organizations looking to modernize their architectures and leverage the power of event-driven design patterns. By adopting similar architectural patterns and components, such as the schema repository and client libraries, other organizations can be empowered to achieve similar benefits.


About the authors

Ali Ufuk Yucel

Ali Ufuk Yucel

Ali Ufuk Yucel is a Senior engineering leader with over 15 years of experience in software development and technology leadership. Currently serving as Senior Software Development Manager for AWS EventBridge, he leads strategic initiatives in cloud computing and distributed systems. Previously, he led AWS’s Managed Workflows for Apache Airflow (MWAA). With a Computer Engineering background and Executive MBA, Ali brings a unique blend of technical expertise and business acumen to his leadership roles.

Karan Jaswani

Karan Jaswani

Karan Jaswani is a Senior Software Developer at Amazon.com, dedicated to building scalable solutions that enhance customer experiences. Karan has contributed to the Amazon Retail pipeline for 7+ years and is currently working on Amazon Key, developing secure access solutions for homes and buildings to enable services such as deliveries. He is passionate about API contracts, event-driven architecture, and leveraging these technologies to build innovative human- computer interactions.

Avinash Kolluri

Avinash Kolluri

Avinash Kolluri is a Senior Solutions Architect at Amazon Web Services (AWS), where he supports Amazon and its subsidiaries in building innovative cloud solutions. With over 15 years of experience in Software Development and Technology, he brings deep expertise in architecting scalable and resilient systems. His extensive background spans across cloud architecture, distributed systems, and enterprise solutions design. Currently, he specializes in helping Amazon optimize their AWS infrastructure and implement best-in-class cloud architectures. Avinash is passionate about leveraging technology to drive business transformation and continues to be at the forefront of cloud innovation.

Michael Gasch

Michael Gasch

Michael Gasch is a Senior Product Manager at AWS for Application Integration with +20 years of industry experience. In his role, he works across key AWS serverless services, including EventBridge, Step Functions and Lambda. Outside his day job, he’s a maintainer of various open source projects, such as CloudEvents SDK for Go and the Kafka connector for EventBridge.