AWS Database Blog
Data modelling for an internet-scale online transactional system using Amazon DynamoDB
Relational database management system (RDBMS) technology was the preferred choice for data storage for decades. In recent years, the rise of internet-scale applications and exponential data growth have challenged the way industries have designed their database strategies. Organizations are seeing clear benefits in scalability and performance by migrating mission-critical online transactional systems built on RDBMS to NoSQL technologies.
In this post, we discuss the benefits of designing a highly transactional online system on Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. Using a sample IMDb application, we examine single table design, a common data modelling approach for relational datasets on DynamoDB. This is the first part of a two-part blog series. In a subsequent post, we’ll cover the migration benefits and strategy of moving a sample IMDb application built on MySQL to DynamoDB.
Prerequisites
A basic understanding of DynamoDB is required because this post introduces data modelling concepts for highly transactional internet-scale applications built on an RDBMS. This post also includes a hands-on lab for building a sample IMDb application using an AWS CloudFormation template. An AWS account with the necessary AWS Identity and Access Management (IAM) privileges is required. If you use your AWS account, you’ll incur a cost of approximately $10 per day for the resources provisioned by the CloudFormation template. To avoid future charges, make sure you delete the CloudFormation stack when done.
DynamoDB as a data platform for internet-scale applications
RDBMS technologies have their limitations when it comes to supporting large-scale online transactional systems. Your application will be bound by the strict schema, and any iteration can require significant planning. An application evolves over time, as does its schema. Building an agnostic data strategy helps minimize schema disruptions caused by its evolving business requirements. Because DynamoDB is a key-value NoSQL database, schema changes have minimal impact on the DynamoDB data model, provided the access pattern remains unchanged. DynamoDB is highly scalable and supports consistent single-digit millisecond latency at any scale. By using enterprise features, such as ACID transactions, DynamoDB Accelerator (DAX), and global tables, organizations can benefit from improved performance and availability of internet-scale applications.
DynamoDB table design for internet-scale applications
A relational database is a collection of heterogeneous business entities that are stored across multiple tables and related to each other using foreign keys. During access, SQL queries can join datasets across multiple tables and respond back to the client application. With growing datasets, it becomes challenging to get consistent performance from relational databases. DynamoDB is designed to provide consistent query performance at any scale, provided that target tables are designed as closely to the access pattern as possible. The recommendation is that data that is queried together should stay together. In this design approach, heterogeneous business entities are stored as a collection of items in a single DynamoDB table. This reduces query joins and network roundtrips, resulting in faster and more consistent query performance.
Each item stored in a DynamoDB table has a mandatory partition key and an optional sort key, together called the primary key. In a single table design approach, you need to carefully choose the partition key and sort key attributes because they can significantly improve the selectivity (query performance) of the items retrieved by a query. Choose a high-cardinality partition key to ensure effective distribution of data across partitions. When choosing the sort key, you can use techniques like composite sort keys to define a hierarchical relationship in your data so that you can query any level of the hierarchy. This will allow the application to solve access patterns using Queries and avoid table scans. Table scans are resource intensive, can cause slower performance, and are costly due to the higher consumption of the read capacity unit (RCU) of the table. In this post, we go over a sample IMDb dataset to explain single table design and demonstrate how you can use a composite sort key to solve multiple access patterns.
DynamoDB single table design for the IMDb application
Now that we’ve covered single table design, we’ll use a real-world example of the IMDb application and its sample dataset to create the target data model.
Learn about the IMDb dataset
We use IMDb data, normalized across various tables, to demonstrate examples of effective single table design in DynamoDB. IMDb is well-known for its online database of movies, films, TV shows, and other media. In this post, we use six IMDb datasets from the year 2000 that are related to movies published in the United States.
Deploy the IMDb source environment using CloudFormation
We created a CloudFormation template to build the IMDb source environment. The template installs a MySQL database on Amazon Elastic Compute Cloud (Amazon EC2), downloads a sample IMDb dataset from a public Amazon Simple Storage Service (Amazon S3) bucket, creates an imdbdata database, and loads the sample dataset into tables inside the imdbdata database.
To deploy the IMDb source environment
- Download the CloudFormation template. (The CloudFormation User Data script is used for development and testing purposes to bootstrap the instance in support of the scenario).
- Open the AWS Management Console CloudFormation console.
- Select Choose file under Upload a template file. Select the downloaded template and then choose Next.
- Under the Stack name, enter
dynamodb-designlab
. Leave the remaining parameters at the default values and choose Next. Choose Next until you reach the stack settings review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names to create the IAM resources and then choose Create Stack. - Wait until the CloudFormation stack status is
CREATE_COMPLETE
. It will take about 10–15 minutes to create the source MySQL environment. - Go to the EC2 console and choose Instances from the navigation pane. Verify that the status check column for the instance is 2/2 checks passed. Select the instance named
MySQL-Instance
and choose Connect. - Select the Session Manager tab and choose Connect.
- Use the
sudo
command to elevate your privilege to root user.sudo su
- Navigate to
/var/lib/mysql-files
to access the file directory. The sample IMDb dataset was downloaded into themysql-files
directory by the CloudFormation launch script. Information about the cast, crew, and movies is normalized into six files. The relationships are discussed in detail in Explore the IMDb data model.cd /var/lib/mysql-files/ ls -lrt Output (6 files) title_basics.tsv title_crew.tsv title_principals.tsv title_ratings.tsv name_basics.tsv title_akas.tsv
- Go to the CloudFormation console and choose the stack you created earlier. Go to the Parameters tab and copy the username and password values in
DbMasterUsername
andDbMasterPassword
. - Go back to Session Manager in the EC2 console and log in to MySQL using the username and password you just copied.
mysql -u DbMasterUsername -pDbMasterPassword
- You’re now connected to a self-managed MySQL source database on Amazon EC2. In the next section, you’ll connect to the database and tables hosting the IMDb datasets. They’re loaded into the
imdbdata
database across six tables from the files downloaded by the CloudFormation template.use imdbdata; show tables; Output (6 tables) name_basics title_akas title_basics title_crew title_principals title_ratings
Explore the IMDb data model
The MySQL database has data in a normalized relational structure across six tables. The schema of the tables adheres to the schema and relationships of the files loaded to the table. Figure 9 that follows describes the entity-relationship model across movies cast and crews.
The tables in the imdbdata
database and shown in Figure 9 are:
title_basics
stores movies published in the United States after 2000.tconst
is an alphanumeric key uniquely assigned to each movie.title_akas
stores published regions, languages, and respective movie titles. It has a one-to-many relationship withtitle_basics
.title_ratings
stores movie ratings and vote counts. For this exercise, you can assume the information has high frequency updates after movie release. It has a one-to-one relationship withtitle_basics
.title_principals
stores cast and crew information. It has a one-to-many relationship withtitle_basics
.title_crew
stores writer and director information. It has a one-to-one relationship withtitle_basics
.name_basics
stores cast and crew details. Every entry has a uniquenconst
value assigned. It has one-to-many relationships withtitle_crew
andtitle_principals
.
To support the application layer access patterns, SQL queries are used to generate materialized views of the normalized data stored across the imdbdata
tables. This step requires a significant amount of processing and will impact application performance. In the next step, we discuss the details of designing the data model on DynamoDB for the IMDb sample application.
DynamoDB access patterns for the IMDb application
DynamoDB is a better choice to support low latency response for internet-scale applications. Remember that data that is queried together should stay together. This approach helps to support single millisecond query latency at any scale and provide consistent performance for internet-scale applications like IMDb. For more details, refer to the best practices guidelines for designing solutions on DynamoDB. We identified six access patterns for the IMDb application to illustrate DynamoDB table design.
- Find all cast and crew of a movie.
- Find details of a movie (title, publication year, runtime, genres, directors, writers, and other information).
- Find IMDb rating details of a movie.
- Find all published regions, languages, and titles of a movie.
- Find original title of a movie published across multiple languages and regions.
- Find all the movies by an actor between the years 2002 and 2010.
DynamoDB single table design for the IMDb application
We now go over the single table design and composite key design for the IMDb application. Earlier, we discussed that each movie is associated with a unique identifier: tconst
. The first five access patterns of the application are based on movies. Using tconst
as the partition key helps to distribute the data across multiple partitions and support the access patterns. In this example, we use a composite sort key technique to store and access related information. The sort key attribute is prefixed with constant characters known to the applications. The prefix groups together one or more related items as a single collection. The naming conventions for the prefix are custom defined, so feel free to use any as per your application standards. The following is a detailed list of prefixes used for target table design:
- DETL prefix item collection:
title_basics
contains movie metadata and title_principals has all cast and crew information. Information in both the tables is static once a movie is released. The access patterns require the movies and their cast and crew information to be fetched together. The movie details and associated cast and crew information are denormalized as single item in the table. This will help to access the information with single query.· - REGN prefix item collection:
title_akas
contains all published regions, languages, and titles of a movie. The information changes infrequently and has a one-to-many relationship totitle_basics
. These records are classified using REGN as the prefix and modeled as one-to-one in the DynamoDB table. - RTGN prefix item collection: Each movie receives ratings based on votes by IMDb users. Every time a new votes is received, the information in the
title_ratings
is updated with the latest rating and vote count. This is dynamic and frequently changing information. In order to minimize read and write capacity used during frequent update operations, the rating information is kept outside the DETL prefixed records.
By using a composite sort key, you can create a collection of heterogenous entities and store them in a single table. Figure 10 the follows shows the tconst
as mpkey
and composite sort key inside mskey
attributes in a single DynamoDB table.
Each movie has its own cast and crew, which includes a producer, director, actors, and other crew members. A unique identifier, nconst
, is used to identify each cast and crew member. The last access pattern in the list is the selection of all movies by an actor within a given year range. This access pattern can’t be resolved using query because nconst
is not part of the partition key of the base table. To resolve this access pattern, you can follow either of the approaches mentioned below:
- Use the adjacency list technique to handle many-to-many relationships between actors and movie. Both entities have unique identifiers and don’t overlap in values. This method requires additional processing logic during the ingestion to deal with partition key overloading.
- Create a global secondary index (GSI) on the
nconst
attribute of the base table and run the query on the GSI.
Because the access pattern only requires a movie list by an actor within a specific year, you only need to create a GSI on the base table. Figure 11 that follows represents a GSI model with nconst
as the partition key and year as the sort key to allow range queries.
Figure 12 represents the detailed demonstration of all access patterns, key conditions, and filter expressions.
This video demonstrates how the base table with a GSI data model was used to solve all six access patterns on DynamoDB.
Cleanup
To avoid ongoing charges for resources you created to complete this post, delete the CloudFormation stacks you created earlier.
Conclusion
In this post, we showed you a way to create a DynamoDB data model for an internet-scale transactional system. You can use this data model to store and query heterogeneous business entities using a single table design. We went over composite key methods for dealing with one-to-many relationships and discussed the role of adjacency lists and GSIs to support queries on non-partition key attributes of the base table. We hope this post was helpful in your journey with DynamoDB. In the next post, we’ll show you how to design and build a migration strategy to move the IMDb dataset from MySQL to DynamoDB. We will also include PartiQL query mapping for each of the access patterns discussed above.
About the Authors
Abhishek Srivastav is a Sr. Solutions Architect. He is passionate about enabling customers to accelerate their cloud adoption. He is a NoSQL enthusiast and helps customers find answers to complex problems by drawing on his in-depth understanding of these technologies. He has held lead positions for NoSQL Center of Excellence roles at various enterprise customers prior to joining AWS.
Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon, he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.