Structured Data Storage

Articles & Tutorials>Amazon RDS>Structured Data Storage
Most applications developed today need structured data stores, i.e. data stores that allow storage of data records that have multiple attributes, and allow the application to query its data based on one or more of these attributes.The most popular example of a structured data store is a relational database. Relational databases allow customers to define their data through a pre-defined relational schema and to query their data using SQL.In addition to relational databases, there are several non-relational structured data stores that allow customers to store data records with multiple attributes and query based on one or more of these attributes.These non-relational data stores differ in how they structure data: all the way from enforcing no schema structure to enforcing a semi structured schema.

Details

Submitted By: Brett@AWS
AWS Products Used: Amazon EC2, Amazon RDS, Amazon SimpleDB
Created On: November 24, 2009 10:45 PM GMT
Last Updated: November 24, 2009 10:52 PM GMT

Amazon Web Services — Structured Data Storage Options

Amazon Web Services provides a number of options for storing structured data.

  • Amazon RDS enables you to run a fully featured relational database while offloading database administration
  • Amazon SimpleDB provides simple index and query capabilities with totally seamless scalability
  • Using one of our many relational database AMIs on Amazon EC2 and Amazon EBS allows you to operate your own relational database in the cloud.

There are important differences between these alternatives that may make one more appropriate for your use case.

Amazon EC2 - Relational Database AMIs

You can use any of a number of leading relational databases on Amazon EC2. You can use an Amazon EC2 instance to run a database, and store the data within an Amazon Elastic Block Store (Amazon EBS) volume. Amazon EBS is a fast and reliable persistent storage feature of Amazon EC2. By designing, building, and managing your own relational database on Amazon EC2, you avoid the friction of provisioning and scaling your own infrastructure, while gaining access to a variety of standard database engines over which you can exert full administrative control. Available AMIs include IBM DB2, MySQL, Oracle, PostgreSQL, SQL Server, Sybase, and Vertica.

Amazon RDS

If your application requires relational storage, but you want to reduce the time you spend on database management, Amazon RDS automates common administrative tasks to reduce the complexity and total cost of ownership. Amazon RDS automatically backs up your database and maintains your database software, allowing you to spend more time on application development.  With the native database access Amazon RDS provides, you get the programmatic familiarity, tooling and application compatibility of a traditional RDBMS. You also benefit from the flexibility of being able to scale the computing resources or storage capacity associated with your relational database instance using a single API call. 

With Amazon RDS, you still control the database settings that are specific to your business. This includes building a relational schema to fit your use case, creating indices, and tuning the performance of your database to your application's workflow.  You also take an active role in the scaling decisions for your database; you tell the service when you want to add more storage or change to a larger or smaller DB Instance class.

Amazon SimpleDB

For database implementations that do not require a relational model, and that principally demand index and query capabilities, Amazon SimpleDB eliminates the administrative overhead of running a highly-available production database, and is unbound by the strict requirements of a RDBMS. With Amazon SimpleDB, you store and query your data items with simple web service requests, and Amazon SimpleDB does the rest. In addition to handling infrastructure provisioning, software installation and maintenance, Amazon SimpleDB automatically indexes your data, creates geo-redundant replicas of your data to ensure high availability, and performs database tuning on your behalf. Amazon SimpleDB also provides no-touch scaling.  That is, there is no need to anticipate and respond to changes in request load or database utilization. The service simply responds to traffic as it comes and goes, charging you only for the resources you consume.   Finally, Amazon SimpleDB doesn't enforce a rigid schema for your data. This gives you flexibility; if your business changes, you can easily reflect these changes in Amazon SimpleDB without any schema updates or changes to your database code.

Amazon SimpleDB, however, is not a relational database, and does not offer some features needed in certain applications, such as complex transactions or joins. For this you need an RDBMS.

Functional Overview

Build your own database on Amazon EC2/EBS

You can use an Amazon EC2 instance to run a database, and store the data within an Amazon Elastic Block Store (Amazon EBS) volume. An Amazon Machine Image (AMI) is an encrypted machine image stored in Amazon S3. It contains all the information necessary to boot instances of your software. Many existing AMIs come packaged with relational databases, including:

IBM

Microsoft SQL Server

MySQL

Oracle

PostgreSQL

Sybase

Once you've launched one of these pre-built AMIs (or deployed some other database software on an Amazon EC2 instance), you'll want to create an Amazon Elastic Block Storage (Amazon EBS) volume to persist your structured data. Amazon EBS is storage designed specifically for Amazon EC2 instances—allowing you to create volumes that can be mounted as devices by EC2 instances. Amazon EBS volumes behave as if they were raw unformatted external hard drives. They have user supplied device names and provide a block device interface. You can create up to 20 Amazon EBS volumes of any size (from one gigabyte up to one terabyte); whatever is appropriate for your data set.

Amazon EBS provides the ability to create snapshots of your Amazon EBS volumes to Amazon S3. You can use these snapshots as the starting point for new Amazon EBS volumes and to protect your data for long term durability.

Amazon EBS Volumes

  • Persist beyond the lifetime of instances, protecting against data loss in the unlikely event of Amazon EC2 instance failure
  • Provide high availability and reliability
  • Attach to and detach from a running instance—allowing you, for example, to snapshot your data set, instantiate a new instance, and deploy a test or development database

Amazon EBS Snapshots

  • Capture the current state of a volume
  • Provide backup protection
  • Can be used to instantiate new volumes, which contain the exact data of the snapshot

Amazon RDS

Getting started with Amazon RDS begins with the creation of a database instance (generally referred to as a DB Instance). This DB Instance is a fully functional database that you can access and interact with much like any stand-alone database server. An Amazon RDS DB Instance can contain multiple user-created databases, and can be accessed using the same command line tools and utilities used for stand-alone database servers.

Amazon RDS DB Instances are created using either command line tools or the APIs. Using the rds-create-db-instance command or the CreateDBInstance API, you can create your own Amazon RDS DB Instance by specifying:

  • Instance identifier—a unique name for your Amazon RDS DB Instance
  • Database engine—the underlying relational database engine (MySQL 5.1 currently supported)
  • Compute class—the class whose memory and compute power meet your requirements
  • Storage—amount of storage allocated to the DB Instance (from 5 GB to 1 TB)
  • Master user—user with permission to create databases, manage users, etc.
  • Master user password—password associated with the master user account

You can check the status of your create instance request with the rds-describe-dbinstances command or the DescribeDBInstances API, and can start using your Amazon RDS DB Instance as soon as the instance status is "available."

Next, to protect against data loss, Amazon RDS enables point-in-time recovery— automatically creating a backup of your database. This backup occurs during a daily user-configurable 2-hour period of time known as the backup window. Backups created during this period of time are retained for a user-configurable number of days (the retention period).

You can enable point-in-time recovery for an Amazon RDS DB Instance by setting the backup-retention-period parameter to a non-zero value using the rds-create-db-instance or rds-modify-db-instance commands, or the CreateDBInstance or ModifyDBInstance APIs. When a backup retention period is changed to a non-zero value, the first backup occurs immediately. Changing the backup retention period to 0 turns off automatic point-in-time backups for the instance, and will delete all automated backups for the instance. Turning off automated backups is discouraged.

Finally, if demand for your database grows beyond the capacity of your initial DB Instance, you can scale the computing resources and storage capacity with the ModifyDBInstance API. You can change memory and CPU resources by changing your DB Instance class, and change available storage when you modify your storage allocation. Your requested changes are applied during your specified maintenance window, or you can use the "apply-immediately" flag. Bear in mind that using this flag will apply any other pending system changes as well.

Amazon SimpleDB

Amazon SimpleDB provides a simple web service interface to create and store multiple data sets, query your data easily, and return the results. Your data is automatically indexed, making it easy to quickly find the information that you need. There is no need to pre-define a schema or change a schema if new data is added later. And scaling out is as simple as creating new domains, rather than building out new servers.

The first step in storing data in Amazon SimpleDB is to create one or more domains. Domains are similar to database tables, except that you cannot perform functions across multiple domains, such as querying multiple domains or using foreign keys. You should note, however, that although the Amazon SimpleDB API cannot perform queries across multiple domains, you can design your applications to perform queries across multiple domains. Regardless, you should plan an Amazon SimpleDB data architecture that will meet the needs of your project.

After creating a domain, you are ready to start putting data into the domain. The PutAttributes operation creates or replaces attributes in an item. The attributes are specified using the Attribute.X.Name and Attribute.X.Value parameters. The first attribute is specified by the parameters Attribute.1.Name and Attribute.1.Value, the second attribute by the parameters Attribute.2.Name and Attribute.2.Value, and so on. The PutAttributes operation creates or replaces attributes for one item at a time. To create or replace attributes for multiple items in a single call, which can increase throughput and add efficiency, you should use the BatchPutAttributes API.

To retrieve your data, simply issue a GetAttributes call to retrieve a specific item, or use Select, a query syntax very similar to the SQL Select, to query your data set for items that meet specified criteria.

Service Differences & Implications

A primary difference between the services is the data model. For relational databases built on Amazon EC2/EBS or for Amazon RDS, the data model is, quite clearly, relational, as depicted below:

Figure 1. Simple relational data model.

In simple terms, the data is normalized into separate tables, with primary key/foreign key relationships associating the tables to one another. With Amazon SimpleDB, however, there is no notion of relations, and no requirement to develop a (sometimes complex) schema to represent your data. Rather, you organize your data set into domains, and can run queries across all of the data stored in a particular domain. Domains are collections of items that are described by attribute/value pairs. While some developers choose to mimic the relational model (e.g. creating a Products domain, an Orders domain, and so on) all of the data could in fact be co-mingled. In addition, Amazon SimpleDB allows you to easily go back later and add new attributes that only apply to certain items. Thus, Amazon SimpleDB provides the developer with greater flexibility in data storage, but at the cost of less embedded functionality.

An example of the flexibility/functionality trade-off is complex transactions. Amazon RDS (and self-managed relational databases on Amazon EC2), by nature of their relational models, allow for set-based updates and deletes. For an example, refer again to the simple data model in Figure 1. An insert that creates a new order will flow through the various tables to update values like InStockQuantity in the Products table. Amazon SimpleDB, due to its flat data model, cannot support such cascading updates. Each item is treated autonomously, with no relation to other items in the same or different domains.

Relational databases on Amazon EC2 and Amazon RDS are strictly consistent, meaning that any attempt to read an item will always return the very latest update to that item. To ensure this strict consistency, the relational databases lock each record during an update, making it unavailable to be read. Conversely, Amazon SimpleDB uses eventual consistency and allows what are called, dirty reads—returning a response from whichever replica can respond the fastest. If an update is in process on another replica, the user is effectively reading an outdated value for the item. In general, however, most inserts and updates propagate across the multiple Amazon SimpleDB replicas in a matter of 1-2 seconds, lowering the probability of reading an outdated value.

A final primary difference between these solutions is in the service model. Each product provides some automation to make it much simpler to provision than an on-premise solution. However, the amount of administration that the service handles varies, as shown below:

Build your own on Amazon EC2/EBS

Amazon RDS

Amazon SimpleDB

Automated hardware provisioning

Automated hardware provisioning

Automated hardware provisioning

User-controlled software updates/patching

Automated software updates/patching

Automated software updates/patching

User initiated backups or snapshots

Automated backups (administered by user) and user initiated snapshots

Automated geo-redundant replication

User responsible for indexing, query tuning

User responsible for indexing, query tuning

Automated indexing, query tuning

©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.