AWS Partner Network (APN) Blog

Building Resilient Distributed Systems with Temporal and AWS

By Sai Kotagiri, Senior Partner Solutions Architect, AWS,
by Neil Dahlke, Staff Solutions Architect, Temporal

Temporal

Large-scale digital businesses, with distributed systems, process thousands of transactions per minute during peak sales periods. These organizations need systems that maintain accurate records and ensure successful transaction completion, even during infrastructure challenges. This isn’t hypothetical – E-commerce platforms, streaming services, and ride-sharing apps require 99.99% uptime, process 10,000s transactions per minute and require sub-second response times. These systems must continue running smoothly even when facing network failures, server failure, networks timeouts, or database crashes, service timeouts, or other infrastructure disruptions. Durable Execution, an approach for managing distributed architectures, ensures that every transaction, from payment processing to order fulfillment, completes successfully – even if parts of the system temporarily fail.

In this post, we will examine distributed system challenges and demonstrate how Temporal’s solution, on AWS, enables Durable Execution across organizations. You will learn how Temporal’s solution, powers resilient applications, ensuring consistency and automatic recovery during system failures. We will wrap up with resources to help you get started with Temporal on AWS.

Distributed System Reliability: The Enterprise Challenge

Let us explore the hypothetical scenario where a major e-commerce company experienced a technical challenge during a busy holiday sale. The incident, as illustrated in below figure, which originated from a hardware issue in a database cluster, temporarily affected various aspects of order processing like fraud check, charging payment, preparing and shipping orders. The situation worsened as network issues, API timeouts and queue overflows emerged, leading to inventory tracking discrepancies. Customers experienced degraded service quality, manifesting as extended page load times and transaction completion failures.

Application architectures must incorporate fault tolerance and resilience, anticipating component failures at any level.

Figure. 1: Application architectures must incorporate fault tolerance and resilience, anticipating component failures at any level.

This outage forces businesses to address these questions:

  • How can e-commerce platforms achieve fault tolerance?
  • How can we build resilient applications that prevent failures from cascading across system boundaries?
  • How can applications recover when disruptions are resolved?
  • How can companies ensure application state persistence during failures?
  • How can we pinpoint exactly what went wrong and when?

Understanding Temporal: A Durable Execution Solution

Durable Execution is a capability that ensures transaction completion in distributed systems despite disruptions. By tracking each step’s state, the system can resume operations from the exact point of failure. For example, if a system crashes between payment approval and shipping label generation, it won’t duplicate the payment but instead continues with label creation. This approach prevents transaction duplicates, preserves order integrity during outages, and maintains customer trust through reliable processing.

Temporal is an open-source microservices orchestration platform that provides durable execution through SDKs, enabling applications to maintain state across failures and distributed processes. It ensures organized and reliable process execution, coordinating all steps and resuming precisely where it stopped after any disruption.

Development Process

In a scenario like order processing:

  1. Define a Workflow (e.g., Process Order) to orchestrate your application’s execution flow
  2. Define Activities (e.g., Check Fraud, Prepare Shipment) within the Workflow for specific business logic, with custom retry policies as needed
  3. Define and configure Workers to communicate with the Temporal Server through task queues
  4. Set up the Temporal Service to coordinate execution. Options include using self-hosting on AWS or Temporal Cloud

Order processing workflow and its activities (tasks) are implemented using Temporal SDK, while Temporal service manages its execution and state.

Figure 2: Order processing workflow and its activities (tasks) are implemented using Temporal SDK, while Temporal service manages its execution and state.

How It Works

  1. The Temporal Client initiates workflows defined in your application and communicates with the Temporal Server to manage their execution
  2. The Temporal Server manages workflow states and orchestrates their execution.
  3. Workers execute both workflow logic and activities by pulling tasks from Temporal server-managed task queues.

Temporal architecture showing workflow execution through Workers authenticated via mTLS, connecting to either self-hosted Temporal servers or Temporal Cloud services.

Figure 3: Temporal architecture showing workflow execution through Workers authenticated via mTLS, connecting to either self-hosted Temporal servers or Temporal Cloud services.

Workflow Orchestration: The Temporal Advantage

Temporal orchestrates workflows through an event-driven architecture that ensures reliable execution and recovery. At its core, the system maintains a detailed event history that records every workflow action, task execution, and outcome as immutable events, creating an audit trail. The system records workflow states at each transaction step, storing execution data, variables, and decision outcomes in a persistent database.

When failures occur, Temporal follows a systematic recovery process. First, it detects the failure and assesses the current state. Then, it reviews the event history and resumes execution from the last valid state. The system includes automatic retry mechanisms for reliability. The platform guarantees exactly-once execution semantics for operations. It maintains consistent state management throughout the recovery process. This orchestration ensures that workflows remain durable and eventually complete, regardless of infrastructure failures, network issues, or system crashes.

By combining event history tracking with state persistence, Temporal creates a resilient system capable of reliably resuming operations from any point of failure. These features make it ideal for mission-critical business processes.

Temporal persists the entire state of a workflow, including local variables and execution position. This allows workflows to be resumed from any point after a failure or system restart.Figure 4: Temporal persists the entire state of a workflow, including local variables and execution position. This allows workflows to be resumed from any point after a failure or system restart.

Maximizing Efficiency with Temporal’s Containerized Workers

Modern order processing systems must adapt to fluctuating demand, particularly during peak seasons like holiday sales. This is where containerization proves invaluable in managing Temporal Workers efficiently. By leveraging AWS container services like Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS), businesses can automatically scale their order processing capabilities to match real-time demands.

For instance, when order volume spikes, payment processing Workers and shipping label generation Workers can scale up to handle more concurrent transactions, while inventory check Workers distribute across multiple containers to maintain quick response times. This containerized approach means that each component of the order processing workflow – from payment validation to shipping – can scale up or scale down resources independently, optimizing both performance and cost. The system delivers consistent processing speeds, scaling from baseline operations to peak holiday volumes without performance degradation.

Temporal workers, running as pods inside an EKS cluster, can be scaled easily and can offer quicker responses. For detailed implementation refer to Quick launch Temporal workers on EKSFigure 5: Temporal workers, running as pods inside an EKS cluster, can be scaled easily and can offer quicker responses. For detailed implementation refer to Quick launch Temporal workers on EKS

AWS and Temporal: Powering Scalable, Resilient Distributed Applications

AWS and Temporal deliver a powerful combination for building resilient distributed applications at scale. The integration leverages key AWS services to enhance security and operational excellence. AWS Certificate Manager handles mTLS certificate management, while AWS PrivateLink enables secure, private connectivity between customer Amazon Virtual Private Cloud (VPC) and Temporal Cloud. Real-time audit logging is streamlined through Amazon Kinesis integration, providing comprehensive operational visibility.

Customers can deploy their Temporal-based applications using AWS’s comprehensive infrastructure services – running workers on Amazon Elastic Compute Cloud (Amazon EC2) or containerized environments with Amazon ECS or Amazon EKS  for optimal scalability and resource utilization.

This partnership has delivered transformative results for enterprises. As one engineering manager at a major financial services firm noted: “It currently takes a team of 3 experienced developers roughly 20 weeks to build a new feature in our legacy systems and only 2 weeks using Temporal. That’s akin to a team of 8 developers (using Temporal) being as productive as a team of 80.

Temporal Cloud’s durable execution capabilities, combined with AWS’s global infrastructure and managed services, create a robust foundation for mission-critical applications. This integration enables organizations to build fault-tolerant distributed systems with reduced complexity while leveraging AWS’s scale, security, and reliability – empowering teams to focus on business logic rather than infrastructure challenges.

Conclusion

AWS provides infrastructure. Temporal adds durable execution capabilities. Together, they enable organizations to build resilient distributed systems. These systems maintain consistency and reliability despite failures. From e-commerce and payment solutions to logistics applications, this partnership offers developers the tools to create fault-tolerant, scalable systems with improved developer experience and reduced complexity.

To get started, check out Temporal platform on AWS Marketplace and deepen your understanding through tutorials and demos.

Temporal Technologies – AWS Partner Spotlight

Temporal, an AWS Advanced Technology Partner, offers a robust workflow orchestration solution. The platform delivers increased value to customers through its durable execution framework, which ensures consistency and automatically recovers from system failures.

Contact Temporal | Partner Overview | AWS Marketplace | Temporal Documentation