Our internal customers saw processing delays decrease from 1 second to 100 milliseconds; those latency reductions ultimately translate into Amazon customers getting their orders faster. That’s really a testament to the performance of Amazon DynamoDB. It shows how it can serve as the foundation of a highly efficient, mission-critical system.
Mike Thomas Software Development Manager, Amazon Herd
  • About Amazon Herd

    Herd is a workflow-orchestration engine that powers more than 1,300 workflows for key functions of Amazon.com, including order processing, fulfillment-center operations, and parts of the Amazon Alexa backend. 

  • AWS Services Used

  • Benefits of AWS

    • Facilitates a 90% drop in workflow-processing delays
    • Reduces scalability effort from 60 weeks to 6 weeks
    • Reduces operational burden and frees up the team to innovate

When customers across the globe place orders on Amazon.com, those orders are processed through many different backend systems. One of those key systems is Herd, a workflow-orchestration engine developed by the Amazon eCommerce Foundation Team. Herd controls the business logic for processing all Amazon.com customer orders worldwide, orchestrating more than 1,300 workflows for everything from order processing to fulfillment-center operations to coordinating parts of the Amazon Alexa backend. A mission-critical system used by more than 300 Amazon engineering teams, Herd executes more than 4 billion workflows on peak days.

If such a mission-critical system wasn’t working, the customer impact would be significant. “If the system goes down, Amazon order processing could stop and no customers would get what they bought,” says Mike Thomas, a software development manager on the Herd team. “For example, fulfillment-center operations would have delays, and delivery of Kindle eBooks would be impaired. It could bring things to a standstill.”

Those fears were real for the Herd team, which struggled to keep the system running smoothly on more than 150 Oracle databases. “It was hard to work around the availability and latency problems we had with database-software upgrades,” says Thomas. “And when a primary database had a hardware failure or was rebooted for a security upgrade, customers noticed a performance drop.”

The Herd team had another major challenge—scaling the system to keep pace with rapid growth. “Our workflow traffic was doubling on average year-over-year, and scaling the Oracle databases to support that was becoming a nightmare,” Thomas says. “We had to manually manage each database, and scaling up involved way too much effort as the number of parts in the system grew. To scale, we had to duplicate a lot of infrastructure in the system to talk to each database, and we had to manage a separate web service for each database.”

Scaling to support workflow spikes—during events such as Amazon Prime Day and Cyber Monday—required extensive manual effort. To set up each new Oracle database, software developers had to run through a 21-step checklist per database on a system which—at its peak—consisted of 150 Oracle databases running on more than 300 hosts.

As the system scaled, team members spent an increasing percentage of their time managing system operations. This reduced their focus on building new features needed by Amazon business units, such as detailed visibility into workflow types. With the existing architecture, building new features required adding indexes in Oracle, which meant each database would have to perform more writes-to-disk per workflow. “We had a backlog of new features, and it was bad for our morale and for our internal customers to not be able to work on them,” says Thomas.

To solve its availability and scalability concerns, the Herd team moved the system from Oracle to Amazon DynamoDB (DynamoDB), a fully managed NoSQL database service. “We needed to move away from the business of managing individual database shards ourselves, and Amazon DynamoDB abstracts all that for us,” says Thomas. “Also, DynamoDB delivers powerful performance and ease of operation, and scales quickly to where we need it to be.”

To meet its specific needs, the team took advantage of the DynamoDB Global Secondary Index feature (released in December 2013), which enables the creation of indexes and lookups using attributes other than an item’s primary key. “The Global Secondary Index feature was a critical factor in our decision to move to Amazon DynamoDB,” says Thomas. “We tried to design something in DynamoDB before the feature was available and it didn’t work.”

The team built two new custom services on top of DynamoDB: TimerService, a scheduled priority queue that determines when workflows execute; and ViewService, which provides workflow-backlog monitoring and drill-downs used by Herd customers. Both services partition workflow-related records across a distributed set of hosts, holding the records in in-memory data structures to facilitate cost-effective, efficient querying. The services rely on DynamoDB as the underlying durable storage system for the data. Because DynamoDB uses multi-availability zone storage, if there are failures of the underlying hardware, it’s transparent to Herd.

After migrating a production client onto the new system in September 2015, the Herd team moved all remaining clients to the system the following year. During the 2016 holiday season, all Herd workflows ran on DynamoDB for the first time.

The results were impressive, with a 90 percent drop in client workflow-processing delays. “Our internal customers saw processing delays decrease from 1 second to 100 milliseconds; those latency reductions ultimately translate into Amazon customers getting their orders faster,” says Thomas. “That’s really a testament to the performance of Amazon DynamoDB. It shows how it can serve as the foundation of a highly efficient, mission-critical system.”

Additionally, the Herd team reduced by 90 percent the time needed to scale the system for large events—from 60 weeks to less than 6 weeks. “If we were still running on the old architecture, we would have had to use 1,000 hosts running Oracle software to support this year’s Amazon Prime Day,” says Thomas. “That’s a huge number of hosts to manage and scale.” The Herd system’s availability is also no longer subject to Oracle failovers, scheduled upgrades, or latency degradation during nightly index rebuilds. “Herd is a mission-critical system for Amazon, and we are extremely confident in Amazon DynamoDB as the technology on which to run it,” says Thomas.

The Herd team now spends much less time maintaining the solution, which translates into more time spent creating new features that add value for Amazon engineering teams. For example, the team is working on a way to offer detailed visibility into active workflows, including the ability for customers to query and monitor workflow backlogs based on user-defined custom attributes. “We would never have found the time to create features like that on the old Oracle architecture,” says Thomas. “Because of Amazon DynamoDB, we can concentrate on delivering new features for customers instead of focusing on scaling and maintaining the databases. This is really transforming Herd as a product, and it will ultimately improve things for Amazon.com customers as well.”

Learn more about Amazon DynamoDB