Relating to Amazon SimpleDB

Do you have data in a relational database you'd like to migrate to Amazon SimpleDB? AWS developer David Kavanagh describes the considerations to make and shows you how to use his DBLoader tool to help with the migration.

Submitted By: D. Kavanagh
AWS Products Used: Amazon SimpleDB
Language(s): Java
Created On: March 04, 2008

By David Kavanagh

The Amazon SimpleDB service has the ability to store data and perform queries on that data. At first blush, you'd think a database as a web service. This is partially true. For some applications, Amazon SimpleDB can very easily handle data storage needs, especially when coupled with the Amazon S3 service for large data files. This article will focus on ways in which existing data in a relational database can be imported into Amazon SimpleDB and things to be aware of when undertaking this task.

Background

I'd been working with SimpleDB during a private beta, before it was announced. While there have been some minor changes to the APIs as Amazon fine-tuned the service, it has remained a convenient and reliable way to store metadata for some applications I've worked on. To understand how data is stored in Amazon SimpleDB, here's some info from the FAQ:

You organize your structured data into domains and can run queries across all of the data stored in a particular domain. Domains are comprised of items, and items are described by attribute-value pairs. To understand these elements, consider the metaphor of data stored in a spreadsheet table. An Amazon SimpleDB domain is like a worksheet, items are like rows of data, attributes are like column headers, and values are the data entered in each of the cells.

However unlike a spreadsheet, Amazon SimpleDB allows for multiple values to be associated with each "cell" (e.g., for item "123," the attribute "color" can have both value "blue" and value "red"). Additionally, in Amazon SimpleDB, each item can have its own unique set of associated attributes (e.g., item "123" might have attributes "description" and "color" whereas item "789" has attributes "description," "color" and "material"). Amazon SimpleDB automatically indexes your data, making it easy to quickly find the information that you need. There is no need to pre-define a schema or change a schema if new data is added later.

There are some limits the Amazon SimpleDB service has in place that are good to understand before you attempt to load data into it. These are:

10GB of storage per domain
100 domains per customer account
256 attribute values per item (this is supposed to be raised at some point)
Attribute names and attribute values can be up to 1024 bytes in length each

Table Structure and Foreign Keys

Anyone who's used a database will know that tables contain a primary key and often foreign key relationships. A database that is heavily normalized will have many foreign key relationships and often many levels of indirection. In Amazon SimpleDB, you could store all of your tables in one domain. If you did that, you might run into the domain storage limit. Alternatively, you could use one domain per table to maximize the amount of data allowed per table you move over. If you have no more than 100 tables, this would work fine.

For a little background, database normalization refers to a relational database design technique that tries to minimize duplication of data by breaking data into smaller chunks and using references to the data, instead of the data itself. One example is below, where one table stores people, and another stores addresses. While storing all of that information in one table is possible, the information is stored more efficiently by keeping the address information in only 1 row in a separate table. The act of de-normalizing a database involves putting data back together (using fewer tables, but having more data redunancy).

With normalized databases, joins are often used combine data from different tables into one result set. In Amazon SimpleDB, there isn't any join equivalent. To get data from multiple tables (domains) and combine it, you'd need to perform multiple queries (and get attribute operations). In many cases, it might be better to de-normalize your data which would put more columns in some domains. In fact, this approach would help you fit more tables into less domains, and possibly reduce your table count enough to be under the 100 domain limit. Take this database with the following tables:

People Table

id	name	address_id	timestamp
1	David Kavanagh	1	2008-02-10 16:16:30
2	Jeff Rice	1	2008-02-10 16:16:33

Address Table

id	street	city	state	zip	timestamp
1	29 Goodway Drive	Rochester	NY	14623	2008-02-10 16:16:26

You may want to combine both tables into one domain in Amazon SimpleDB. That way, the complete mailing address can be pulled with one GetAttributes call. Another alternative would be to use some client side caching of addresses, though you'd need to determine when the cache is dirty (which would be complicated by a distributed application using the same domain data). For the purposes of this article, we'll assume that we're OK with the current tables, which will be loaded into two separate domains in Amazon SimpleDB. Also assume that the 'id' field for each record will become the item identifier which equates pretty well to a primary key. Given that, you'd need to take the following steps to extract my address from the two domains.

Issue Query on 'people' domain : ['name' = 'David Kavanagh']
GetAttributes on 'people' domain for item 1
GetAttributes on 'address' domain for item 1
Aggregate the results from both GetAttributes calls

Notice that none of these steps can be done in parallel since each depends on the results from the previous. Through clever application coding, one could provide a time limited cache of addresses that would limit consistency issues and eliminate one GetAttributes call. For that matter, the cache key might be item id, but just as easily be the query string, which would also allow caching of the "people" query.

To make a determination on just how much to de-normalize, you really have to understand your data and queries. If you have some tables that contain ancillary information about some other table data, and don't get accessed very often, it might not make sense to combine it with that other table. You would be paying for extra storage in Amazon SimpleDB, and if you don't use the data very often, you might as well pay the extra costs for issuing another GetAttributes call from time to time.

Data Conversion

Numeric

Data stored in Amazon SimpleDB is all character strings. The comparative operators for queries will do lexicographical comparisons. What this means is that numeric and date/time data must be stored specially to allow for a sensible comparison in queries you'll want to do later. For example, a number "1" will compare with the number "3" in the manner you'd expect. The number "10" is larger than "3", but with a lexicographical comparison, the "10" would be less than "3" since the first character does not match and causes the comparison to terminate. If the "3" is zero padded to "03", it will be less than "10" in a lexicographical comparison. The question, then, is how much to pad your numerical data. If you can say what the upper limit for the numbers will be, you can pad out that many digits.

That works just fine for natural numbers. What happens with negative numbers? If I had "-5" and "-2", the "-5" would be seen as higher than the "-2", so that won't work for us. The solution is to provide an offset for number that can be negative. If the range is going to be -127 to 128, we'd add 127 to the value before storing it in Amazon SimpleDB. We'd also need to subtract this offset when pulling the data out, but the comparison would work in our queries. Here are some examples of queries that you might use for numeric comparisons;

['mynumber' < '000003']
['mynumber' > '000200' and 'mynumber' < '001000']

It is important to remember that while you need to encode your numeric data when storing it, and decode it when retrieving it, you also need to encode any literals uses in query strings!

Since Amazon SimpleDB charges for storage (based on published formulas), one might say it's only worth it to do data conversion on those fields that are used in query comparisons. If you had to zero-pad every number, you could really increase the size of your storage! As an example, let's say you had 100,000 attribute values, an average of 5 digits long, but could be up to 20 digits. You'd zero-pad to 20 digits and your costs (just considering value storage) would look like this:

Padded Values	Non-Padded Values
Raw Storage: 100,000 items X 20 bytes = 2,000,000 bytes Overhead: 100,000 items X 45 bytes per item = 4,500,000 bytes (2,000,000 + 4,500,000) bytes = 0.065 GB 0.065 GB X $1.50 = $0.0925/mo	Raw Storage: 100,000 items X 5 bytes = 500,000 bytes Overhead: 100,000 items X 45 bytes per item = 4,500,000 bytes (500,000 + 4,500,000) bytes = 0.050 GB 0.050 GB X $1.50 = $0.075/mo

Padded Values

Non-Padded Values

Raw Storage:
100,000 items X 20 bytes = 2,000,000 bytes

Overhead:
100,000 items X 45 bytes per item = 4,500,000 bytes
(2,000,000 + 4,500,000) bytes = 0.065 GB
0.065 GB X $1.50 = $0.0925/mo

Raw Storage:
100,000 items X 5 bytes = 500,000 bytes

Overhead:
100,000 items X 45 bytes per item = 4,500,000 bytes
(500,000 + 4,500,000) bytes = 0.050 GB
0.050 GB X $1.50 = $0.075/mo

The difference is almost 2 cents, so no big deal, right? Consider having 100 attribute values, and 10,000,000 items with these attributes

Padded Values	Non-Padded Values
Raw Storage: 10,000,000 items X 20 bytes = 200,000,000 bytes Overhead: 10,000,000 items X 45 bytes per item = 450,000,000 bytes (200,000,000 + 450,000,000) bytes = 6.5 GB 6.5 GB X $1.50 = $9.75/mo	Raw Storage: 10,000,000 items X 5 bytes = 50,000,000 bytes Overhead: 10,000,000 items X 45 bytes per item = 450,000,000 bytes (50,000,000 + 450,000,000) bytes = 5.0 GB 5.0 GB X $1.50 = $7.5/mo

Padded Values

Non-Padded Values

Raw Storage:
10,000,000 items X 20 bytes = 200,000,000 bytes

Overhead:
10,000,000 items X 45 bytes per item = 450,000,000 bytes
(200,000,000 + 450,000,000) bytes = 6.5 GB
6.5 GB X $1.50 = $9.75/mo

Raw Storage:
10,000,000 items X 5 bytes = 50,000,000 bytes

Overhead:
10,000,000 items X 45 bytes per item = 450,000,000 bytes
(50,000,000 + 450,000,000) bytes = 5.0 GB
5.0 GB X $1.50 = $7.5/mo

The difference is $2.25, so I think I've talked myself out of zero padding only some of the data! The reason is that unless your value is quite long, the 45 bytes of overhead assessed for each value overshadows the value length. So, I'd recommend padding all of your numbers, because the cost isn't that high and it would be simpler than trying to go back and correct the padding, should you decide to use different fields in your queries.

Dates

Dates present another challenge. Some common date formats such as "Wed, 4 Jul 2007 12:08:56 -0700" or "04-19-1967 11:32:03" won't work for lexicographical comparisons. The reason is that the most significant units of time must come before the less significant units of time. One option could be to represent time as a long value of seconds since some epoch (like Jan 1st, 1970). That might work just fine, but you can also use an ISO 8601 format date and get the same results, plus the data is a bit more readable and doesn't have the limit of the epoch being the beginning of time. ISO 8601 dates look something like this: "2007-07-04T12:08:56.235-0700". A date query is pretty simple to implement with ISO 8601 format dates:

['mydate' < '2008-01-01T00:00:00.000-500']
['mydate' > '2008-01-01' and 'mydate' < '2008-03-01']

This is an example of some datatypes and how they would be encoded. Consider this table of values from a database.

Miscellany Table

id	u_float	s_float	u_double	s_double	u_integer	s_integer	u_small	s_small	timestamp
1	-1.0	1.0	-2.0	2.0	-5	5	-10	10	2008-02-10 17:52:01

This is how they would look when properly mapped for Amazon SimpleDB. Notice that the unsigned values do not get an offset applied. The maximum must be known for both signed and unsigned numbers. For signed numbers, the minimum value is also important so that the offset can be applied to bring the lowest value above zero.

Miscellany Domain

id	000000000001
u_float	00000001.0
s_float	00999999000
u_double	0000000000000002.0
s_double	0000000999999998000
u_integer	0000000005
s_integer	0000999995
u_small	00010
s_small	09990
timestamp	2008-02-10T16:52:01.000-05:00

For additional information on data conversion, I recommend Aleksey's excellent article Query 201: Tip & Tricks for Amazon SimpleDB Query

Conversion Tool

I've written a tool to pull data from a database and insert it into Amazon SimpleDB. This tool currently works with MySQL and SQLServer, though other databases can be added by implementing a class that extends BaseDataAccess. This tool creates a domain for each table and does threaded insertions into SimpleDB. This tool uses the typica library to access Amazon SimpleDB. Typica contains a DataUtils class to help in encoding/decoding integers, decimals and dates. The DBLoader tool makes some assumptions about ranges of numbers, so you might need to adjust values to suit your own needs. Here's some pseudo code that describes the operation of the DBLoader.

Query database for table/column details

For each table found
  Create a domain
  Select all from table
  For each row
    For each column
      Encode value
    Done
    Thread-off -> Store item/values in domain
  Done
Wait for threads to complete

The DBLoader converts data using a method in the data access class. The BaseDataAccess class implements JDBC connection handling and statement execution code. It requires 2 methods be implemented by each database specific class that extends it;

public abstract Map> getTableStructure() throws SQLException;

public abstract String convertValue(ColDetails d, ResultSet rs, int idx) throws SQLException;

The first method executes a query to get the specific table structure information (a map of tables, and a list of their column details). The convertValue method extracts a value from a given index in the result set and handles the conversion into a string to be inserted into Amazon SimpleDB. The name of the data access class is read out of the properties file mentioned below. For example, to support a MySQL database, set "db.class=com.directthought.simpledb.MYSQLAccess".

Running it

The DBLoader isn't meant to work as a generic tool because of the range and offset values which will likely be unique for your data. The intention is to get a full source download and build it yourself. You will need to edit the config/loader.properties file to supply your AWS login details as well as the database connection information. To build and run it, you can type;

ant run -Dargs="[starting table] [starting pk]"

The starting values are helpful if you find you need to restart a conversion. The results of the conversion are idempotent, but to save time, the option is there to skip some amount of data. If you'd like to exclude some tables from being moved at all, those can be listed in the excludeTables array in the DBLoader.java file.

I created a test database in MySQL that incldes the three tables mentioned above. I got the following output when converting that data and putting it into Amazon SimpleDB:

$ ant run  -Dargs=
Buildfile: build.xml
init:
compile:
run:
     [java] table name=address : 1 rows
     [java] table name=miscellany : 1 rows
     [java] table name=people : 2 rows
     [java] query = select id, street, city, state, zip, timestamp from address
     [java] query = select id, s_float, u_float, s_double, u_double, s_integer,
u_integer, s_small, u_small, timestamp from miscellany
     [java] address Record : 000000000001
     [java] query = select id, name, address_id, timestamp from people
     [java] miscellany Record : 000000000001
     [java] people Record : 000000000002
     [java] people Record : 000000000001

BUILD SUCCESSFUL
Total time: 18 seconds

Using the sdbShell command also available in typica, this is what the data in Amazon SimpleDB looks like (attributes have been re-ordered for clarity):

sdbShell> sd people
sdbShell> gi
Item : 000000000002
  id = 000000000002
  name = Jeff Rice
  address_id = 000000000001
  timestamp = 2008-02-10T16:16:33.000-05:00
Item : 000000000001
  id = 000000000001
  name = David Kavanagh
  address_id = 000000000001
  timestamp = 2008-02-10T16:16:30.000-05:00
Time : 0.531
Number of items returned : 2
sdbShell> sd address
sdbShell> gi
Item : 000000000001
  id = 000000000001
  street = 29 Goodway Drive
  city = Rochester
  state = NY
  zip = 14623
  timestamp = 2008-02-10T16:16:26.000-05:00
Time : 0.5
Number of items returned : 1
sdbShell> sd miscellany
sdbShell> gi
Item : 000000000001
  id = 000000000001
  s_float = 00999999000
  u_float = 00000001.0
  s_double = 0000000999999998000
  u_double = 0000000000000002.0
  s_integer = 0000999995
  u_integer = 0000000005
  s_small = 09990
  u_small = 00010
  timestamp = 2008-02-10T17:52:01.000-05:00
Time : 0.515
Number of items returned : 1
sdbShell>

Performance

To get the best performance out of Amazon SimpleDB, the best thing you can do in your application is to use threading for any operations that can be done in parallel. The DBLoader tool runs up to 30 threads to insert attributes into the domains. Another common function is fetching query results and item attributes. Typica has methods built in to run the Query and GetAttribute calls in threads which can greatly improve performance as shown below. I ran a test with 1,000 items and 14 attributes per item.

	seconds/item
threads	My Laptop	EC2 m1.small
1	0.3786	0.1213
10	0.0537	0.0151
30	0.0390	0.0092
100	0.0356	0.0081

The results show an improvement with more threads, though I found 30 to be a reasonable number of threads to run. Running the test from an Amazon EC2 instance gave a pretty decent performance boost, likely due to lower network latency.

Additional Resources

Amazon Web Services: https://aws.amazon.com
Amazon SimpleDB: https://aws.amazon.com/simpledb
typica: https://code.google.com/p/typica/ (also listed in the Resource Center; see below)
DBLoader: https://code.google.com/p/simpledbloader/

David Kavanagh is a software consultant for Direct Thought in Upstate New York. He has been designing and developing software for 15 years. For the past year and a half, his focus has been on leveraging Amazon Web Services. He is the author of the open source typica library which provides a Java interface for an expanding set of Amazon Web Services and has been developing AWS-based applications for a variety of customers.

Select your cookie preferences