AWS Database Blog

Data modelling for an internet-scale online transactional system using Amazon DynamoDB

Relational database management system (RDBMS) technology was the preferred choice for data storage for decades. In recent years, the rise of internet-scale applications and exponential data growth have challenged the way industries have designed their database strategies. Organizations are seeing clear benefits in scalability and performance by migrating mission-critical online transactional systems built on RDBMS to NoSQL technologies.

In this post, we discuss the benefits of designing a highly transactional online system on Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. Using a sample IMDb application, we examine single table design, a common data modelling approach for relational datasets on DynamoDB. This is the first part of a two-part blog series. In a subsequent post, we’ll cover the migration benefits and strategy of moving a sample IMDb application built on MySQL to DynamoDB.

Prerequisites

A basic understanding of DynamoDB is required because this post introduces data modelling concepts for highly transactional internet-scale applications built on an RDBMS. This post also includes a hands-on lab for building a sample IMDb application using an AWS CloudFormation template. An AWS account with the necessary AWS Identity and Access Management (IAM) privileges is required. If you use your AWS account, you’ll incur a cost of approximately $10 per day for the resources provisioned by the CloudFormation template. To avoid future charges, make sure you delete the CloudFormation stack when done.

DynamoDB as a data platform for internet-scale applications

RDBMS technologies have their limitations when it comes to supporting large-scale online transactional systems. Your application will be bound by the strict schema, and any iteration can require significant planning. An application evolves over time, as does its schema. Building an agnostic data strategy helps minimize schema disruptions caused by its evolving business requirements. Because DynamoDB is a key-value NoSQL database, schema changes have minimal impact on the DynamoDB data model, provided the access pattern remains unchanged. DynamoDB is highly scalable and supports consistent single-digit millisecond latency at any scale. By using enterprise features, such as ACID transactions, DynamoDB Accelerator (DAX), and global tables, organizations can benefit from improved performance and availability of internet-scale applications.

DynamoDB table design for internet-scale applications

A relational database is a collection of heterogeneous business entities that are stored across multiple tables and related to each other using foreign keys. During access, SQL queries can join datasets across multiple tables and respond back to the client application. With growing datasets, it becomes challenging to get consistent performance from relational databases. DynamoDB is designed to provide consistent query performance at any scale, provided that target tables are designed as closely to the access pattern as possible. The recommendation is that data that is queried together should stay together. In this design approach, heterogeneous business entities are stored as a collection of items in a single DynamoDB table. This reduces query joins and network roundtrips, resulting in faster and more consistent query performance.

Each item stored in a DynamoDB table has a mandatory partition key and an optional sort key, together called the primary key. In a single table design approach, you need to carefully choose the partition key and sort key attributes because they can significantly improve the selectivity (query performance) of the items retrieved by a query. Choose a high-cardinality partition key to ensure effective distribution of data across partitions. When choosing the sort key, you can use techniques like composite sort keys to define a hierarchical relationship in your data so that you can query any level of the hierarchy. This will allow the application to solve access patterns using Queries and avoid table scans. Table scans are resource intensive, can cause slower performance, and are costly due to the higher consumption of the read capacity unit (RCU) of the table. In this post, we go over a sample IMDb dataset to explain single table design and demonstrate how you can use a composite sort key to solve multiple access patterns.

DynamoDB single table design for the IMDb application

Now that we’ve covered single table design, we’ll use a real-world example of the IMDb application and its sample dataset to create the target data model.

Learn about the IMDb dataset

We use IMDb data, normalized across various tables, to demonstrate examples of effective single table design in DynamoDB. IMDb is well-known for its online database of movies, films, TV shows, and other media. In this post, we use six IMDb datasets from the year 2000 that are related to movies published in the United States.

Deploy the IMDb source environment using CloudFormation

We created a CloudFormation template to build the IMDb source environment. The template installs a MySQL database on Amazon Elastic Compute Cloud (Amazon EC2), downloads a sample IMDb dataset from a public Amazon Simple Storage Service (Amazon S3) bucket, creates an imdbdata database, and loads the sample dataset into tables inside the imdbdata database.

To deploy the IMDb source environment

  1. Download the CloudFormation template. (The CloudFormation User Data script is used for development and testing purposes to bootstrap the instance in support of the scenario).
  2. Open the AWS Management Console CloudFormation console.
  3. Select Choose file under Upload a template file. Select the downloaded template and then choose Next.

    Upload the downloaded CloudFormation template

    Figure 1: Upload the downloaded CloudFormation template

  4. Under the Stack name, enter dynamodb-designlab. Leave the remaining parameters at the default values and choose Next. Choose Next until you reach the stack settings review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names to create the IAM resources and then choose Create Stack.

    Figure 2: Enter a stack name and input parameter values

    Figure 2: Enter a stack name and input parameter values

  5. Wait until the CloudFormation stack status is CREATE_COMPLETE. It will take about 10–15 minutes to create the source MySQL environment.
  6. Go to the EC2 console and choose Instances from the navigation pane. Verify that the status check column for the instance is 2/2 checks passed. Select the instance named MySQL-Instance and choose Connect.

    Figure 3: Connect to the source EC2 MySQL instance

    Figure 3: Connect to the source EC2 MySQL instance

  7. Select the Session Manager tab and choose Connect.

    Figure 4: Connect to the source EC2 MySQL instance using Session Manager

    Figure 4: Connect to the source EC2 MySQL instance using Session Manager

  8. Use the sudo command to elevate your privilege to root user.
    sudo su 
  9. Navigate to /var/lib/mysql-files to access the file directory. The sample IMDb dataset was downloaded into the mysql-files directory by the CloudFormation launch script. Information about the cast, crew, and movies is normalized into six files. The relationships are discussed in detail in Explore the IMDb data model.
    cd /var/lib/mysql-files/
    ls -lrt 
    
    Output (6 files)
    title_basics.tsv
    title_crew.tsv
    title_principals.tsv
    title_ratings.tsv
    name_basics.tsv
    title_akas.tsv

    Figure 5: Files copied from the IMDb dataset

    Figure 5: Files copied from the IMDb dataset

  10. Go to the CloudFormation console and choose the stack you created earlier. Go to the Parameters tab and copy the username and password values in DbMasterUsername and DbMasterPassword.

    Figure 6: Copy the username and password

    Figure 6: Copy the username and password

  11. Go back to Session Manager in the EC2 console and log in to MySQL using the username and password you just copied.
    mysql -u DbMasterUsername -pDbMasterPassword

    Figure 7: Log in to MySQL

    Figure 7: Log in to MySQL

  12. You’re now connected to a self-managed MySQL source database on Amazon EC2. In the next section, you’ll connect to the database and tables hosting the IMDb datasets. They’re loaded into the imdbdata database across six tables from the files downloaded by the CloudFormation template.
    use imdbdata; 
    show tables;
    
    Output (6 tables)
    name_basics        
    title_akas	
    title_basics    
    title_crew         
    title_principals   
    title_ratings	
    

    Figure 8: Connect to database and show the tables

    Figure 8: Connect to database and show the tables

Explore the IMDb data model

The MySQL database has data in a normalized relational structure across six tables. The schema of the tables adheres to the schema and relationships of the files loaded to the table. Figure 9 that follows describes the entity-relationship model across movies cast and crews.

Figure 9: imdbdata entity-relationship model

Figure 9: imdbdata entity-relationship model

The tables in the imdbdata database and shown in Figure 9 are:

  • title_basics stores movies published in the United States after 2000. tconst is an alphanumeric key uniquely assigned to each movie.
  • title_akas stores published regions, languages, and respective movie titles. It has a one-to-many relationship with title_basics.title_ratings stores movie ratings and vote counts. For this exercise, you can assume the information has high frequency updates after movie release. It has a one-to-one relationship with title_basics.
  • title_principals stores cast and crew information. It has a one-to-many relationship with title_basics.
  • title_crew stores writer and director information. It has a one-to-one relationship with title_basics.
  • name_basics stores cast and crew details. Every entry has a unique nconst value assigned. It has one-to-many relationships with title_crew and title_principals.

To support the application layer access patterns, SQL queries are used to generate materialized views of the normalized data stored across the imdbdata tables. This step requires a significant amount of processing and will impact application performance. In the next step, we discuss the details of designing the data model on DynamoDB for the IMDb sample application.

DynamoDB access patterns for the IMDb application

DynamoDB is a better choice to support low latency response for internet-scale applications. Remember that data that is queried together should stay together. This approach helps to support single millisecond query latency at any scale and provide consistent performance for internet-scale applications like IMDb. For more details, refer to the best practices guidelines for designing solutions on DynamoDB. We identified six access patterns for the IMDb application to illustrate DynamoDB table design.

  1. Find all cast and crew of a movie.
  2. Find details of a movie (title, publication year, runtime, genres, directors, writers, and other information).
  3. Find IMDb rating details of a movie.
  4. Find all published regions, languages, and titles of a movie.
  5. Find original title of a movie published across multiple languages and regions.
  6. Find all the movies by an actor between the years 2002 and 2010.

DynamoDB single table design for the IMDb application

We now go over the single table design and composite key design for the IMDb application. Earlier, we discussed that each movie is associated with a unique identifier: tconst. The first five access patterns of the application are based on movies. Using tconst as the partition key helps to distribute the data across multiple partitions and support the access patterns. In this example, we use a composite sort key technique to store and access related information. The sort key attribute is prefixed with constant characters known to the applications. The prefix groups together one or more related items as a single collection. The naming conventions for the prefix are custom defined, so feel free to use any as per your application standards. The following is a detailed list of prefixes used for target table design:

  • DETL prefix item collection: title_basics contains movie metadata and title_principals has all cast and crew information. Information in both the tables is static once a movie is released. The access patterns require the movies and their cast and crew information to be fetched together. The movie details and associated cast and crew information are denormalized as single item in the table. This will help to access the information with single query.·
  • REGN prefix item collection: title_akas contains all published regions, languages, and titles of a movie. The information changes infrequently and has a one-to-many relationship to title_basics. These records are classified using REGN as the prefix and modeled as one-to-one in the DynamoDB table.
  • RTGN prefix item collection: Each movie receives ratings based on votes by IMDb users. Every time a new votes is received, the information in the title_ratings is updated with the latest rating and vote count. This is dynamic and frequently changing information. In order to minimize read and write capacity used during frequent update operations, the rating information is kept outside the DETL prefixed records.

By using a composite sort key, you can create a collection of heterogenous entities and store them in a single table. Figure 10 the follows shows the tconst as mpkey and composite sort key inside mskey attributes in a single DynamoDB table.

Figure 10: DynamoDB base table design for IMDb application

Figure 10: DynamoDB base table design for IMDb application

Each movie has its own cast and crew, which includes a producer, director, actors, and other crew members. A unique identifier, nconst, is used to identify each cast and crew member. The last access pattern in the list is the selection of all movies by an actor within a given year range. This access pattern can’t be resolved using query because nconst is not part of the partition key of the base table. To resolve this access pattern, you can follow either of the approaches mentioned below:

  1. Use the adjacency list technique to handle many-to-many relationships between actors and movie. Both entities have unique identifiers and don’t overlap in values. This method requires additional processing logic during the ingestion to deal with partition key overloading.
  2. Create a global secondary index (GSI) on the nconst attribute of the base table and run the query on the GSI.

Because the access pattern only requires a movie list by an actor within a specific year, you only need to create a GSI on the base table. Figure 11 that follows represents a GSI model with nconst as the partition key and year as the sort key to allow range queries.

Figure 11: DynamoDB GSI design for IMDb application

Figure 11: DynamoDB GSI design for IMDb application

Figure 12 represents the detailed demonstration of all access patterns, key conditions, and filter expressions.

Figure 12: Access patterns, Key conditions and Filter condition

Figure 12: Access patterns, Key conditions and Filter condition

This video demonstrates how the base table with a GSI data model was used to solve all six access patterns on DynamoDB.

Cleanup

To avoid ongoing charges for resources you created to complete this post, delete the CloudFormation stacks you created earlier.

Conclusion

In this post, we showed you a way to create a DynamoDB data model for an internet-scale transactional system. You can use this data model to store and query heterogeneous business entities using a single table design. We went over composite key methods for dealing with one-to-many relationships and discussed the role of adjacency lists and GSIs to support queries on non-partition key attributes of the base table. We hope this post was helpful in your journey with DynamoDB. In the next post, we’ll show you how to design and build a migration strategy to move the IMDb dataset from MySQL to DynamoDB. We will also include PartiQL query mapping for each of the access patterns discussed above.


About the Authors

Abhishek Srivastav is a Sr. Solutions Architect. He is passionate about enabling customers to accelerate their cloud adoption. He is a NoSQL enthusiast and helps customers find answers to complex problems by drawing on his in-depth understanding of these technologies. He has held lead positions for NoSQL Center of Excellence roles at various enterprise customers prior to joining AWS.

Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon, he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.