Reduce database cost and improve availability when you migrate to the AWS Cloud
Let’s say you have an application that uses database tables to store log and clickstream data. You might store your data in a relational database to ease development and management tasks. When you launch your application, the database is manageable at first, but it grows to hundreds of gigabytes per week. Data storage and retrieval alone are consuming 20 percent of IOPS and CPU in your relational database instance. In addition, applications are storing XML, JSON, and binary documents in database tables along with transactional data. Historical data continues growing every month. Your traditional on-premises database licensing and infrastructure costs are increasing, and scaling the database has become a big challenge. What can you do?
In this blog post, we explain strategies to reduce your database costs and improve availability as you migrate to the AWS Cloud.
Types of data stores
- Cause performance issues. Bigger tables need more time to query the data. High read/write traffic on a specific set of tables slows down other queries in a database.
- Increase your total cost of ownership (TCO). To serve more read/write traffic, you have to scale the database vertically by adding more compute and storage capacity. Even though this might be to address seasonal traffic increases, you have to make large upfront investments to serve your application traffic. Also, you cannot scale down the database server, which increases your TCO.
Instead of using only relational database storage, your application can store its data in purpose-built data stores when migrating to AWS. The data store that your application uses depends on its access pattern. It also depends on the scale at which you anticipate its data to grow and how readily you require access to the data. These data store types, their purposes, and examples of each type are shown in the following table.
|Unstructured data such as ad serving, gaming data, IoT sensor data, and user profile
|Amazon DynamoDB, MongoDB, Apache Cassandra
|Store and analyze petabyte-scale data such as Twitter feeds, clickstream data, and logs
|Amazon EMR, Hadoop, Apache Spark, Apache Hive, Apache HBase
|Caching layer between application server and database such as for gaming scoreboards
|Store unlimited amounts of data and unlimited objects and files such as log files, images, Twitter feeds, and backups
|Amazon Simple Storage Service (Amazon S3)
Use these data stores in your data processing architecture to improve the performance and availability of your application and reduce costs. In this post, we explain how to minimize the TCO of high-volume, on-premises traditional databases by using the following AWS services:
- Amazon S3
- Amazon Relational Database Service (Amazon RDS)
The following diagram shows the refactoring of a large relational database into multiple data stores such as DynamoDB and Amazon S3.
For two reasons, this approach is different than migrating a database to a database engine (such as Oracle to PostgreSQL) on Amazon RDS or Amazon EC2.
First, many people build applications that use commercial database engines. These applications might use a feature that is specific to a relational database engine. The engine might not be available in open-source database engines and require a significant amount of rework. Keeping the same relational database engine with a smaller footprint in the end-state architecture helps you move to the cloud faster. This way, you also minimize the amount of change required in the application or database layer. Like-to-like database migration is always easiest because code and database objects are migrated using native database migration tools or AWS Database Migration Service (DMS).
Second, in this approach, we suggest that you identify a portion of the database that can be migrated to Amazon S3 and Amazon DynamoDB. You can see this in the earlier diagram. This helps you address the performance and TCO issues caused by rapidly growing data in logs, clickstream data, large objects, and the like.
In the remainder of this post, we discuss the approach in three parts:
- Moving a subset of the relational database to Amazon EC2 or Amazon RDS.
- Moving large objects from the relational database to Amazon S3.
- Moving a subset of the relational database to a NoSQL data store such as DynamoDB.
Moving a subset of data to Amazon EC2 or Amazon RDS
The easiest part of this approach is identifying the part of your database you want to remain unchanged. You can migrate this part of your database to the same database engine on Amazon RDS, or you can run it as a self-managed database on Amazon EC2.
Amazon RDS is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud. Amazon RDS automatically does the work involved in setting up a relational database, from provisioning the infrastructure capacity to installing the database software. Amazon RDS also takes care of common administrative tasks such as performing backups and patching the software that powers your database. With multiple Availability Zone deployments, Amazon RDS manages synchronous data replication across Availability Zones with automatic failover. By migrating your database to Amazon RDS, you can save on TCO, reduce operational overhead, and achieve higher levels of availability and reliability for your relational database.
To determine which part of your database to leave unchanged, consider these questions:
- Is most of the data in the dataset a transactional data that is read and written frequently as single rows or large numbers of small batches?
- Does the processing and storing of this data use database-specific features such as database options?
- Was procedural code written for this dataset that is difficult to change? For example, millions or lines of PL/SQL code in Oracle database.
- Was application and user interface code written for this dataset that is difficult to change?
- Does your application issue complex SQL queries against this data and join data from multiple tables?
- Are table and entity definitions (the number of columns, data types, and the like) in your database schema going to remain fixed as your application evolves? Would you like to enforce constraints across different tables in your data model while storing the data?
If you answered “Yes” to all or most of the preceding questions about a subset of your relational database, that subset is more suitable to be migrated to the relational database in Amazon RDS or Amazon EC2. The subset of data you identify remains unchanged and can be the majority, say 50 – 60 percent of your database. To determine the size of this data subset, database administrators or developers can query the database’s data dictionary to get the size of data that would be migrated to Amazon RDS or Amazon EC2.
To migrate the data, you can use your database engine’s native tools, AWS DMS, AWS Schema Conversion Tool, or third-party tools. AWS DMS supports homogeneous and heterogeneous database migrations. It supports full-load, one-time migrations or ongoing migrations of databases to help reduce application downtime. For more information about how to migrate commercial database solutions to AWS by using native tools and AWS DMS, see Best Practices for AWS Database Migration Service.
Moving large objects from a relational database to Amazon S3
Many customers design applications that store Character Large Object (CLOB) and Binary Large Object (BLOB) data in relational database tables. It is natural to do this because most of the relational database solutions allow these data types and provide ways to store and access large objects (LOBs).
Storing CLOBs and BLOBs in a database includes the following benefits:
- Point-in-time consistency. When you restore a database to a specific time, all data and objects return to the same state as they were previously.
- Ease of access.
- Security and access controls.
If your application is generating these LOBs in gigabytes per second, storing and processing the data in a relational database can consume a high amount of network throughput, disk I/O, CPU, and memory. For applications that can tolerate transactional latency in seconds while storing and retrieving large objects, offloading this data to Amazon S3 can improve performance and availability. Amazon S3 gives you access to a highly scalable, reliable, fast, inexpensive data storage infrastructure. Because Amazon S3 is highly scalable, you can start small and grow your application as you want, with no compromise in performance or reliability.
For achieving point-in-time consistency for LOBs, you have to modify your application logic to store the version ID of the modified object in Amazon S3 along with the transaction in the relational database. This change in the application is minimal compared to the TCO savings and performance improvement of the overall application.
Moving historical data to Amazon S3
Many large databases contain historical data that is not often read but consumes storage and CPU cycles. You might need to retain this data, for example, for compliance and regulatory purposes. Instead of keeping this data in a relational database, you can create archive files from data in your relational database for a particular period and offload it to Amazon S3 or Amazon S3 Glacier.
When your application needs access to the data in those archive files, you can use Amazon S3 features such as Amazon S3 Select or Amazon S3 Glacier Select. This approach can reduce data storage and retrieval costs significantly while providing durability (long-term data protection) of 99.999999999% and availability (system uptime) of 99.99% over a given year.
Moving a subset of data to a NoSQL data store such as DynamoDB
As a last part of the overall strategy, consider DynamoDB for storing a subset of data from your relational database. Amazon DynamoDB is a fully managed NoSQL database that supports key-value and document data structures. DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database so that you don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
To identify the data that can be stored in a NoSQL data store such as DynamoDB, consider the following factors:
- Is data written frequently or at high velocity?
- Do individual records have variable numbers of fields and attributes that cannot be determined as part of your schema design?
- Is data written independently of other tables in the schema?
- Can data be read as eventually consistent? Eventual consistency is when data written to the database may not be available immediately for reads, and will be available eventually for reads in seconds.
- Do your relational database tables store JSON documents and applications that you can query with simple queries?
- Do you need a data store that supports ACID transactions?
- Do you need to replicate database tables globally (in other words, across AWS Regions) to serve your global user base with low latency?
If you answer “Yes” to all or most of the preceding questions for a subset of your relational database, then that subset of data is more suitable to be migrated to a NoSQL data store such as Amazon DynamoDB.
In addition to the preceding considerations, do not move data to a NoSQL data store that:
- Is aggregated from many tables for your application’s user interface.
- Is used heavily as a source for online analytical processing (OLAP).
- Is treated as BLOB data types.
- Has individual record sizes larger than 400 KB.
- Would require significant amounts of change in the application layer.
Understanding the condition of your data store helps you decide which data to move to the NoSQL data store and which data you should continue to store in a relational database. After making the decision to move relevant data to a NoSQL data store, designing a data model in DynamoDB is key to performance with future data growth. DynamoDB distributes data in a table across multiple partitions on different physical servers. This data is distributed based on the hash value of primary key (the partition key) of a table in DynamoDB. Depending on the data read/write pattern, you should select a partition key that allows DynamoDB to distribute your data evenly across the partitions. This improves the read/write throughput of your tables. For more information about choosing the right partition key in DynamoDB, see Choosing the Right DynamoDB Partition Key.
After you complete the design of your DynamoDB table, you can use AWS DMS to migrate data from your relational database to DynamoDB. For more information, see AWS Database Migration Service and Amazon DynamoDB: What You Need to Know.
In addition to the benefits mentioned earlier in this section, by migrating a subset of data from your relational database to DynamoDB, you only pay for the read/write capacity that you provision for your table and achieve significant cost savings. Also, you now can use DynamoDB on-demand, which allows for serving thousands of requests per second on a DynamoDB table without capacity planning. For information, see Amazon DynamoDB Pricing.
In this blog post, you read about breaking up your monolithic databases into purpose-built data stores as you migrate your databases to AWS. You can realize significant licensing and infrastructure cost savings as well as improved availability by identifying the right data store for your data.
If you have comments about this blog post, submit them in the Comments section below. If you have questions about implementing the solutions here, you can leave them in the Comments section too.
About the authors
Ejaz Sayyed is a partner solutions architect with the Global System Integrator team at Amazon Web Services. His focus areas include AWS database services as well as database and data warehouse migrations on AWS.
Karthik Thirugnanasambandam is a partner solutions architect at Amazon Web Services. He works with large Global System Integrators on AWS Cloud adoption. He is a DevOps and serverless enthusiast. When not working, he enjoys reading and spending time with his family.