How Deliveroo migrated their Dispatcher service to Amazon DynamoDB

Deliveroo operates a hyperlocal, three-sided marketplace, connecting local consumers, restaurants and grocers, and riders to fulfil purchases in under 30 minutes. By offering fast and reliable delivery that consumers can track online, Deliveroo has grown rapidly. It operates in several markets worldwide, working with thousands of restaurants, grocers, and riders, and serving millions of consumers.

Deliveroo’s Dispatcher service is responsible for offering orders to riders. It continuously looks for newly placed orders and available riders in the areas Deliveroo operates in and uses optimization algorithms to make the best possible dispatch decisions. Until recently, Deliveroo stored the Dispatcher service data in Amazon Relational Database Service (Amazon RDS) for PostgreSQL and benefited from using a managed relational database service. However, due to the fast growth of demand, the relational database model posed a few operational and scalability challenges:

Some tables contained over a billion rows, making schema changes difficult and potentially slowing down queries.
Adding new indexes to tables with billions of rows took a long time and, at times, slowed down transactions while index creation was in progress.
Updating old data in large tables required careful planning because the number of deleted rows produced by updates required tuning the automatic vacuuming settings in PostgreSQL. Vacuuming is a maintenance task that reclaims the storage occupied by deleted rows.

This guest blog post by Deliveroo shows how they migrated Deliveroo’s Dispatcher database from Amazon RDS for PostgreSQL to Amazon DynamoDB. It covers the considerations and challenges of migrating a mission-critical service from a relational database to a NoSQL key-value data store, and shows the approach Deliveroo followed to maintain the business logic while changing the data model and processing architecture.

Challenges and solutions

We considered sharding the existing RDS for PostgreSQL database to reduce the number of records. This solution was appealing because we were familiar with the relational model and PostgreSQL, and our application logic would only have required small changes. However, we decided against this approach for the following reasons:

Because new major database engine versions sometimes introduce breaking changes, Amazon RDS doesn’t perform major engine version upgrades automatically. Furthermore, these upgrades can cause a brief outage while Amazon RDS upgrades each instance. Although it’s possible to minimize the downtime using AWS Database Migration Service (AWS DMS), such upgrades require a considerable level of planning.
Managing the scale of the Dispatcher service required constant tuning of various PostgreSQL parameters. Although we had become familiar with these settings, we didn’t feel that tuning the database parameters was the best use of our engineers’ time.

Therefore, we decided to migrate our data store to DynamoDB, which is a fully managed, serverless, NoSQL key-value data store. First, DynamoDB would provide us with consistent single-digit millisecond performance for reads and writes regardless of table size. Second, the schemaless nature of DynamoDB would allow us to introduce new attributes quickly without having to modify the existing data. Third, using a fully managed serverless data store would reduce the operational overhead because we no longer had to worry about planning for and performing database engine upgrades.

However, the migration to DynamoDB would introduce a number of challenges. We would have to redesign the data model to support the read access patterns. We also had to change the application code to work with the DynamoDB API. We also faced another challenge—we had to make these changes quickly before the PostgreSQL version we were running (version 9.6) reached its end of life. The short timescale wouldn’t give us enough time to fully remodel the data and make the required changes to the application, so in the interim, we decided to upgrade to the latest PostgreSQL version available at the time (version 12) to buy some time.

Upgrade Amazon RDS for PostgreSQL

To avoid any downtime during the version upgrade, we analyzed the options available as described in Upgrade Amazon RDS and Amazon Aurora PostgreSQL version 9.6, and chose the AWS DMS route, which also gave us the opportunity to perform data type conversions.

After detailed planning and several test runs, we upgraded the production database version 9.6 to version 12 using AWS DMS. The extensive planning and investment for the upgrade paid off. The upgrade process didn’t cause any downtime, and the move to the latest version available bought us the time we needed to migrate to DynamoDB.

Query patterns

One of the key concepts for NoSQL design is understanding the business problems and the query patterns before starting the design. We went through the Dispatcher service’s code base and compiled a list of the ways we queried the existing relational database. In close collaboration with AWS Solutions Architects and after a few iterations, we designed a model for DynamoDB that satisfied multiple query patterns.

Query pattern 1: Retrieve individual orders and riders

Using as few tables as possible improves scalability and reduces application overhead and overall backup costs, so we use a single table for multiple item types such as orders and riders. DynamoDB tables are schemaless—other than the partition key, you don’t need to define any extra attributes when creating a table. Because our table is going to contain items with different types, we use a generic attribute name, PK, for the partition key.

When using the partition key as the primary key, the partition key value has to be unique. To make the identifiers across item types unique, we add a prefix to the identifier values. We use o as the prefix for orders and r as the prefix for riders, and add the hash sign (#) as an arbitrary delimiter, so the partition key value for order 1 becomes o#1, for example.

The following table summarizes the partition key values and data attributes. The additional data attributes aren’t the focus of this post, so the tables don’t include them.

PK	Data Attributes
o#1	More attributes of this order
o#2	More attributes of this order
r#1	More attributes of this rider
o#3	More attributes of this order
r#2	More attributes of this rider

The first query pattern is retrieving an order by its unique identifier. To look up an order, we add the o# prefix to the order identifier and perform a single GetItem operation against the partition key. For riders, we perform a similar operation using the r# prefix.

Although we can infer the item type using the prefix in the partition key, adding an ItemType column (see the following table) adds clarity and is a recommended best practice.

PK	ItemType	Data Attributes
o#1	order	More attributes of this order
o#2	order	More attributes of this order
r#1	rider	More attributes of this rider
o#3	order	More attributes of this order
r#2	rider	More attributes of this rider

Query pattern 2: Retrieve pickups and associated deliveries

A pickup from a restaurant can contain one or more deliveries because a rider could pick up multiple orders for different consumers from the same restaurant. We model a pickup as an item with a partition key value starting with p#. To model the one-to-many relationship between pickups and deliveries, we store each delivery as a separate item using the pickup identifier as the partition key value and the delivery identifier, starting with d# as the sort key value, as shown in the following table.

PK	SK	ItemType	Data Attributes
p#1	–	pickup	More attributes of this pickup
p#1	d#1	delivery	More attributes of this delivery
p#1	d#2	delivery	More attributes of this delivery

A group of items that share the same partition key value is called an item collection. Item collections are the primary mechanism for modeling one-to-many relationships in DynamoDB. The use of item collections can speed up queries by organizing related data so that it can be efficiently retrieved together.

A sort key is optional in a DynamoDB table, but if defined, all items must have a sort key value. For top-level entities such as the pickup, orders, and riders, we repeat the partition key value as the sort key value, as shown in the following table.

PK	SK	ItemType	Data Attributes
o#1	o#1	order	More attributes of this order
o#2	o#2	order	More attributes of this order
r#1	r#1	rider	More attributes of this rider
o#3	o#3	order	More attributes of this order
r#2	r#2	rider	More attributes of this rider
p#1	p#1	pickup	More attributes of this pickup
p#1	d#1	delivery	More attributes of this delivery
p#1	d#2	delivery	More attributes of this delivery

Query pattern 3: Retrieve all orders and riders by geographic area

Our operations team divides each geographic area we operate in into sub-areas, so we can use the following dispatch logic to find active orders and available riders in an area:

For each geographic area:
- For each sub-area in this area:
  - Retrieve all active orders
  - Retrieve all available riders
Offer orders to riders

This query pattern requires finding orders and riders by sub-area identifier. To support the new query pattern, we use a global secondary index (GSI), which is one of the principles of DynamoDB NoSQL design.

For the GSI’s partition key value, we use a two-part value comprising the sub-area identifier (for example, sa#1) and item type (for example, order). For order items, setting the GSI1’s sort key value to the time the order will be ready for pickup allows us to perform a range query that returns only orders that are ready for dispatch. For rider items, we don’t need further criteria beyond the geographic sub-area, so we repeat the rider identifier as the sort key value for the GSI1 because it’s mandatory.

The following table shows the order and rider items after adding the partition and sort keys for GSI1.

PK	SK	ItemType	GSI1_PK	GSI1_SK	Data Attributes
o#1	o#1	order	sa#1.order	2022-05-27T13:05:38.245	More attributes of this order
o#2	o#2	order	sa#1.order	2022-05-27T13:05:38.245	More attributes of this order
r#1	r#1	rider	sa#1.rider	r#1	More attributes of this rider
o#3	o#3	order	sa#2.order	2022-05-27T12:57:03.477	More attributes of this order
r#2	r#2	rider	sa#1.rider	r#2	More attributes of this rider

Note that unlike the base table, the GSIs aren’t required to have a unique primary key. As the preceding table shows, the first two order items refer to two orders that belong to the sub-area 1 and were captured and stored at the same time, so they have the same value for their GSI1 partition key and sort key.

Merging the queries

Because the application needs to load all relevant orders and riders in each sub-area, there is no need for separate queries, so we can merge them into one. Removing the item type from GSI1’s partition key allows us to load all orders and riders in a particular sub-area with a single query. The ability to perform a pre-join by storing the related entities under the same partition key is one of the benefits of using DynamoDB and NoSQL databases in general, because it removes the need for performing costly joins when querying the data.

PK	SK	ItemType	GSI1_PK	GSI1_SK	Data Attributes
o#1	o#1	order	sa#1	2022-05-27T13:05:38.245	More attributes of this order
o#2	o#2	order	sa#1	2022-05-27T13:05:38.245	More attributes of this order
r#1	r#1	rider	sa#1	r#1	More attributes of this rider
o#3	o#3	order	sa#2	2022-05-27T12:57:03.477	More attributes of this order
r#2	r#2	rider	sa#1	r#2	More attributes of this rider

The pseudocode for the application logic after this improvement is as follows:

For each geographic area:
- For each sub-area in this area:
  - Retrieve all available riders and active orders
Offer orders to riders

Switching to write-sharding

While testing the service with the new data model in the pre-production environment, we observed inconsistent performance in retrieving orders and riders for different geographic areas. We were using the sub-area identifier as the partition key for the GSI because it was the natural fit for a calculated write-sharding suffix. However, the number of sub-areas each geographic area was divided into was initially defined based on operational concerns, so the number and size of sub-areas varied across geographic areas, resulting in unbalanced shards.

After discussing options with the AWS Solutions Architects, we switched to a random shard suffix and made the number of shards configurable for each geographic area. This way, the number of shards became independent of the number of sub-areas in each geographic area, allowing us to optimize the number of shards while allowing the business teams to define the sub-areas based on the operational concerns. Making the number of shards in each area configurable allows us to use more shards in densely populated cities such as London or Paris, and fewer shards in sparsely populated rural areas.

PK	SK	ItemType	GSI1_PK	GSI1_SK	Data Attributes
o#1	o#1	order	a#1.1	2022-05-27T13:05:38.245	More attributes of this order
o#2	o#2	order	a#1.2	2022-05-27T13:05:38.245	More attributes of this order
r#1	r#1	rider	a#1.1	r#1	More attributes of this rider
o#3	o#3	order	a#1.2	2022-05-27T12:57:03.477	More attributes of this order
r#2	r#2	rider	a#1.1	r#2	More attributes of this rider

Finally, we can improve performance of the lookups in each area by retrieving the riders and orders in shards in parallel. Because we perform the lookups against different shards, the parallel run of queries doesn’t impact performance of reads from the DynamoDB table.

The pseudocode for the application logic after this last improvement is as follows:

For each geographic area:
- For each shard in the area (and in parallel):
  - Retrieve all available riders and active orders in the area shard
Offer orders to riders

Keeping the GSI small

The dispatch run uses the GSI to perform its logic, so keeping GSI1 small is important. We take advantage of sparse indexes and remove the GSI key as soon as a given item is no longer needed for the dispatch logic. For example, as soon as an order is delivered, we no longer need to retrieve that order in future dispatch runs, so we delete the values for GSI_1_PK and GSI_1_SK from that order.

Implementing constraints

In the original relational data model, we enforced uniqueness using unique indexes in PostgreSQL. DynamoDB doesn’t have the concept of unique indexes, but we can take advantage of the fact that a given primary key is unique in the base table.

One of the constraints we need to enforce is ensuring a rider doesn’t have more than one pickup offered at any point in time. Whenever we create a new pickup, we perform two operations in a single transaction: a PUT operation to insert the pickup, and another PUT operation to insert a fake item that makes sure there are no existing pickups requested for that rider. If another pickup is already registered against the rider, the attribute_not_exists condition doesn’t match, the transaction fails, and the new pickup isn’t created. If we didn’t add the attribute_not_exists condition, the PUT operation would replace the existing record. We also record a Time to Live (TTL) attribute, which we use to delete the fake item against the rider as a last resort if the item has not been deleted.

The following code snippet shows how we use a number of DynamoDB features, including transactional writes, the attribute_not_exists check, and the enforcement of uniqueness of primary keys in the base table to implement the constraint:

transact_write_items:[
 {
   put: {
     table_name: TABLE_NAME,
     item: {
       pk: "p#{pickup.id}",
       sk: "p#{pickup.id}",
       # additional attributes ...
     }
   }
 },
 {
     put: {
       table_name: TABLE_NAME,
       item: {
         pk: "r#{rider.id}.requested",
         sk: "r#{rider.id}.requested",
         ttl: request_uniqueness_ttl
       },
       condition_expression: "attribute_not_exists(pk) or #ttl < :current_time",
       expression_attribute_names: {
           "#ttl" => 'ttl'
       },
       expression_attribute_values: {
           ":current_time" => now
       }
     }
 },
 # ... additional items ...
]

We delete this item as soon as the rider accepts or declines the pickup:

transact_write_items: [
 {
   update: {
     table_name: TABLE_NAME,
     key: pickup.primary_key,
     condition_expression: "version = :current_version",
     update_expression: "SET #accepted_at = :now, #status = :accepted",
     expression_attribute_names:{
         "#accepted_at" => 'accepted_at',
         "#status" => 'status'
     },
     expression_attribute_values: {
         ":now" => now,
         ":accepted" => 'accepted',
     }
   }
 },
 {
   delete: {
     table_name: TABLE_NAME,
     key: {
       pk: "r#{rider.id}.requested",
       sk: "r#{rider.id}.requested"
     },
   }
 },
 # ... additional items ...
]

Publishing to Kafka

Deliveroo uses Amazon Managed Streaming for Apache Kafka (Amazon MSK) for asynchronous real-time communication between services. When the Dispatcher service creates assignments, it needs to publish a message to Kafka for other services to consume. In our old architecture, we published the message in an after-commit hook. To mimic this after-commit behavior, we decided to set up a DynamoDB stream and use an AWS Lambda trigger to invoke a Lambda function that publishes messages to Kafka. This approach has added benefits such as automatic retries and dead-letter queues. Additionally, we’re taking advantage of Lambda event filtering to filter out irrelevant messages before they even reach the Lambda function. The event filtering reduces the number of invocations of the Lambda function and reduces costs. It also allows us to migrate filtering logic from the application code out to the infrastructure definition.

Each message we publish to Kafka includes a timestamp that represents the time the message is published, to allow the consumers to sort the messages in the correct order. To generate this stamp, we depended on another PostgreSQL feature: the NOW timestamp. Using this function to generate the timestamp on the PostgreSQL writer node allowed us to use the time from a single source and avoid a clock skew. To achieve the ordering logic in the new architecture, we opted to include a version attribute in the DynamoDB items that we publish to Kafka. Incrementing the version on every update ensures that both the Dispatcher service and consuming applications act on the same order of events. By using the version attribute for conditional writes, we can also protect against race conditions when updating the same item.

Conclusion

This migration required a considerable level of planning and incremental improvements to the data model until we arrived at the final design, but the investment has paid off. We are now supporting all our query patterns with a single DynamoDB table and two GSIs, and we benefit from using a fully managed service that provides fast and predictable performance with seamless scalability.

If you are planning to migrate from a relational database to DynamoDB, allow enough time to review the query patterns and design an effective model that makes best use of DynamoDB. Instead of trying to replicate a relational model in the new design, work backward from query requirements and use NoSQL design patterns and DynamoDB features, similar to what you saw in this post, to implement a serverless data store that supports tables of virtually any size and can scale to support petabytes of data and tens of millions of read and write requests per second.

Additional reading

About the Authors

Florian Thomas is a Staff Software Engineer at Deliveroo.

Mehran Nikoo is a Senior Solutions Architect at AWS.

Mike Mackay was a Senior NoSQL Specialist Solutions Architect at AWS.

AWS Database Blog