Refactoring to Amazon DynamoDB
Do you want to move from a relational database to NoSQL? In the following post, I walk you through reading, transforming, and writing SQL Server data from an Amazon EC2 instance to Amazon DynamoDB. I use AWS Glue to transform the source data model of multiple tables into two target tables in DynamoDB.
There are other options for data model transformation like AWS DMS or an AWS Marketplace tool in place of AWS Glue. For this one-time migration, I chose AWS Glue and Scala code for the transformation of several tables into one.
This demo shows you how to refactor a commercial relational database. The chosen database generates and sells sporting event tickets. I explain best practices for refactoring the database to DynamoDB and how to configure DynamoDB tables and transfer data using AWS Glue. I also show how to set up VPC endpoints and a secure IAM role for AWS Glue, how to crawl the source database and run an Apache Spark ETL job.
To get started right away, you can access the code in the Refactoring to DynamoDB repository.
The problem and proposed solution
When you spin up the AWS CloudFormation stack in your environment, the stack creates an EC2 instance with a resident ticketing database.
The database, DMS_SAMPLE, contains a set of configuration data for sporting venues, leagues, teams, and players. The data in these employee-managed tables don’t change often. The dedicated configuration application writes the data to these tables. You can also see tables for ticket sales in the SPORTING_EVENT_TICKET, TICKET_PURCHASE_HIST database, which catalogs, sells, and transfers sporting event tickets.
Because employees and consumers continually buy and sell new event tickets, this data changes frequently. Peak traffic volumes flow to the site after ticket sales launch. The database must scale to meet this fluctuating demand. The following diagram shows the database schema.
The event ticketing system examined in this demo includes two distinct use cases: ticket sales and retrievals. Each use case demonstrates distinct access patterns.
Ticket sales activity fluctuates sharply. The system’s steady-state predictably consists of a few hundred requests per hour. However, during peak events, requests can spike to hundreds of thousands of requests a minute. The service also sells tickets to resellers who then resell them to other individuals. The automated reseller systems can also cause infrequent, unpredictable activity spikes.
Users typically retrieve tickets based on the following criteria and access patterns:
- What tickets do I own?
- Get all tickets I own
- Get all tickets available for my venue and date
- Purchase tickets
- Transfer tickets to another user
- What tickets have I transferred to someone else?
- Get tickets I transferred to another user
- Get tickets transferred to me from another user
The application writing the configuration data performs well at baseline. However, peak ticket sales traffic for new events strains the relational database system. As users add more and more events, they have to increase database resources to handle the load.
On the other hand, the subsequent slowdown in ticket sales results in excess computing capacity. Because traffic is challenging to predict, users need a more flexible way to add and subtract capacity than the current system.
In general, a relational database system scales poorly for the following reasons:
- The storage of data in multiple tables, requiring multiple queries to read/write to disk
- The performance costs of an ACID-compliant relational transaction system at scale
- The expensive joins required to reassemble views of query results
So, the problem appears to be, How can users scale more efficiently to meet fluctuating demand?
The proposed solution, DynamoDB, is fully managed NoSQL database service with seamless scalability. DynamoDB provides fast scale-out for peaks and fast scale-in for valleys. It provides consistent performance and automatically scales with increased traffic to the application. DynamoDB automatic scaling dynamically adjusts provisioned throughput capacity in response to actual traffic patterns.
DynamoDB scales well because of the following reasons:
- Schema flexibility lets DynamoDB store complex hierarchical data within a single item.
- Composite key design lets DynamoDB store related items close together on the same table.
- Automatic sharding permits horizontal scale, so your data is handled concurrently by many servers with more resource transparency.
You can manage high-performance reads and writes with DynamoDB. After you design the right schema for your workload, you can expect consistent performance across widely varying loads.
Transferring data with DynamoDB
In this section, I show you how to:
- Configure DynamoDB table structure.
- Set up the files for your environment.
- Create resources using AWS CloudFormation.
- Generate some data in your source database.
- Extract, transform, and load data with AWS Glue.
To follow along and run the code provided with this demo, you should have:
- An AWS account
- A working knowledge of relational databases
- The following AWS resources:
To follow along in your environment, download the files from the repository, and run the steps in order. Depending on how much data you are moving, you may be able to use Free Tier resources.
In general, make sure to do the following before moving a relational database to DynamoDB:
- Understand the use case or business functionality that you plan to support.
- Understand the typical data access patterns.
- Agree on the goal, problem, or aspect of the system that you are trying to optimize.
For additional information, see Best Practices for DynamoDB.
Configure DynamoDB table structure
With DynamoDB, you can combine the tickets that generate with the tickets that sell into a single table and use status/owner to track the current state of the ticket. The following screenshot demonstrates a combined ticket information table.
For the configuration data, use an overloaded table structure—that is, a structure that combines multiple objects into a single table. This table structure provides data to track ticket generation.
The following screenshot shows how a few of the tables model this data. I use the partition key to group together similar data and a sort key for unique values in the configuration data. You can use the descriptive name field to find items by name (GSI) and any other fields that have keywords in them as sparse GSI values for targeted inquiries.
Set up the files for your environment
Next, set up your environment for refactoring.
- If you haven’t already, download the files from the GitHub repository to your local machine. Then, edit the
parameters.jsonfile, updating the following values with information from your AWS account environment.
- Replace the KeyName parameter with your EC2 key pair name. Remember that you use your EC2 key to log into the EC2 instance launched by AWS CloudFormation to store your SQL Server database.
- Replace the IPAddressForRDP parameter (default of 0.0.0/0) with your IP address. You can use https://www.whatsmyip.org to retrieve your IP address.
- Replace the S3ScriptLocation parameter value of <bucket name> with your S3 bucket name.
- Replace the S3LocationForJarFiles parameter value of <bucket name> with your S3 bucket name.
- Upload the following files to an S3 bucket in your environment under the key <S3 bucketname>/scripts:
- To obtain the dependencies required to follow along, clone the repository awslabs/emr-dynamodb-connector using the following command:
- After cloning completes, use the following command to build your repository:
mvn clean install
- Using the same S3 bucket from the previous step, upload the following JAR files under the key <S3 bucketname>/jars:
- After cloning completes, use the following command to build your repository:
You are now ready to create the resources using AWS CloudFormation.
Create resources using AWS CloudFormation
Create a stack from the AWS CloudFormation template using the following command from CLI. Replace stack name and S3 Location with the appropriate information from your environment.
Generate some data in your source database
To generate data for your use case, you must log in to the EC2 instance and run a query. Here are the steps to obtain the IP address of your EC2 instance and connect to it:
- Obtain the Instance ID using the following command:
- Decrypt and retrieve the password data using the following command:
- Obtain the EC2 instance public IP address from the Output section using the following command:
You can use Remote Desktop to sign into the EC2 instance using your IP address with User Name as administrator and password data from the previous commands.
- Launch SQL Server Management Studio from the instance taskbar with the following credentials:
- Run the following query:
This query should generate ticketing data and allow you to proceed with the next steps.
- Confirm ticket activity using the following queries:
Extract, transform, and load data with AWS Glue
AWS Glue is a fully managed ETL service. You can create and run an ETL job with a few steps in the AWS Management Console or with code using the AWS CLI or AWS SDK.
In this section, I show you how to use AWS Glue to connect to your DynamoDB target and transfer data privately through S3:
- Launch an AWS Glue crawler to crawl the dms_sample database and get metadata. The process populates the AWS Glue Catalog with information about your source database and source tables.
The crawler uses JDBC connection to connect to SQL Server, and in a few minutes, it populates the AWS Glue database with tables of metadata.
- Use the following get tables command to get details about your tables:
- Run the AWS Glue job GlueTicketSaleJob for ticket sale. The job reads from the SQL Server source database and extract data from SPORTING_EVENT_TICKET and TICKET_PURCHASE_HIST. The job then uses Scala code in ticketsale.scala to transform the two tables into the single DynamoDB table, ticket_sales, as outlined at the beginning of this post:
You should receive a JobRunId value as output from this command:
You can also start the second AWS Glue job, GlueConfigJob, for the supporting configuration data in parallel. The job reads from the SQL Server source database and extracts data from all the other configuration tables in the SQL Server database. The job uses the Scala code config.scala to transform the config tables into the overloaded single DynamoDB table, config:
You receive a new JobRunId as output from this command:
You can use the AWS CLI to monitor the status of your job:
The output lists the job runs for this job. It may take some time for the jobs to complete, but when they finish, the JobRunState says COMPLETE:
After the job completes, you can review your results in DynamoDB. The following screenshot represents sample config data.
The following screenshot shows sample ticket sales data.
AWS CloudFormation overview
Review the items that you created for your AWS Glue job and consider how all your components fit together in AWS CloudFormation.
The following diagram shows your EC2 instance inside your public subnet, protected by a security group. The security group has a route to the CIDR range specified in the AWS CloudFormation template and a self-referencing rule for AWS Glue. You can access DynamoDB and S3 through VPC endpoints. Data flows from the SQL Server instance on your EC2 instance through AWS Glue and S3 to your DynamoDB tables.
VPC endpoints for Amazon S3 and DynamoDB
The next portion of the AWS CloudFormation template provides a VPC endpoint to connect your VPC to Amazon S3 privately. The VPC endpoint for S3 allows AWS Glue to use private IP addresses to access S3 with no exposure to the public internet. Any requests to S3 within this Region route to a private S3 endpoint within the AWS network:
The following portion of the AWS CloudFormation template creates a VPC endpoint for DynamoDB. It’s required for AWS Glue to write to DynamoDB.
Setting an ingress rule
For AWS Glue to communicate between its components, a self-referencing Ingress rule is required to restrict the source to the same security group in the VPC and prevent traffic on other networks. To set the Ingress rule, run the following code:
Provisioning an IAM role
The AWS Glue job writes logs to Amazon CloudWatch and saves temporary logs to an S3 bucket. AWS Glue also must access the supporting jar files in your S3 bucket. AWS Glue crawler needs permissions to access your EC2 instance to crawl through Source data in SQL Server in EC2.
The role also has AWS Glue full access, allowing AWS Glue components to communicate with each other (for example, the AWS Glue crawler needs access to the AWS Glue connection). To provide IAM role access, run the following code:
Connecting AWS Glue with SQL
Your AWS Glue connection creates a new JDBC connection to connect to the SQL Server database on EC2. For more details, see AWS Glue connections.
The next portion of the AWS CloudFormation template creates an AWS Glue database where the crawler saves the crawled metadata:
Crawling the source database
AWS Glue creates an AWS Glue crawler to crawl through the dms_sample database. The crawled tables are prefixed with Crawler_ss and saved in the AWS Glue database created in the previous section:
Running an Apache Spark ETL job
The AWS Glue job takes the script in given S3 location and runs on demand. Use the command name glueetl for your Apache Spark ETL job:
You’ve now seen how to move a ticketing database from your SQL Server RDBMS to DynamoDB using AWS Glue. To do the same for your application, see Best Practices for DynamoDB. Download code examples from the Refactoring-to-DynamoDB GitHub repo and modify them for your application and environment.
As always, AWS welcomes feedback. Please submit comments or questions below.
About the Authors
Wendy Neu has worked as a Data Architect with Amazon since January 2015. Prior to joining Amazon, she worked as a consultant in Cincinnati, OH helping customers integrate and manage their data from different unrelated data sources.
Soumya Vanga is an Associate Cloud Developer with AWS Professional Services in New York, NY helping customers design solutions, workloads, and to adopt Cloud Native services.