By David Kavanagh
The Amazon SimpleDB service has the ability to store data and perform queries on that data. At first blush, you'd think a database as a web service. This is partially true. For some applications, Amazon SimpleDB can very easily handle data storage needs, especially when coupled with the Amazon S3 service for large data files. This article will focus on ways in which existing data in a relational database can be imported into Amazon SimpleDB and things to be aware of when undertaking this task.
I'd been working with SimpleDB during a private beta, before it was announced. While there have been some minor changes to the APIs as Amazon fine-tuned the service, it has remained a convenient and reliable way to store metadata for some applications I've worked on. To understand how data is stored in Amazon SimpleDB, here's some info from the FAQ:
You organize your structured data into domains and can run queries across all of the data stored in a particular domain. Domains are comprised of items, and items are described by attribute-value pairs. To understand these elements, consider the metaphor of data stored in a spreadsheet table. An Amazon SimpleDB domain is like a worksheet, items are like rows of data, attributes are like column headers, and values are the data entered in each of the cells.
However unlike a spreadsheet, Amazon SimpleDB allows for multiple values to be associated with each "cell" (e.g., for item "123," the attribute "color" can have both value "blue" and value "red"). Additionally, in Amazon SimpleDB, each item can have its own unique set of associated attributes (e.g., item "123" might have attributes "description" and "color" whereas item "789" has attributes "description," "color" and "material"). Amazon SimpleDB automatically indexes your data, making it easy to quickly find the information that you need. There is no need to pre-define a schema or change a schema if new data is added later.
There are some limits the Amazon SimpleDB service has in place that are good to understand before you attempt to load data into it. These are:
- 10GB of storage per domain
- 100 domains per customer account
- 256 attribute values per item (this is supposed to be raised at some point)
- Attribute names and attribute values can be up to 1024 bytes in length each
Table Structure and Foreign Keys
Anyone who's used a database will know that tables contain a primary key and often foreign key relationships. A database that is heavily normalized will have many foreign key relationships and often many levels of indirection. In Amazon SimpleDB, you could store all of your tables in one domain. If you did that, you might run into the domain storage limit. Alternatively, you could use one domain per table to maximize the amount of data allowed per table you move over. If you have no more than 100 tables, this would work fine.
For a little background, database normalization refers to a relational database design technique that tries to minimize duplication of data by breaking data into smaller chunks and using references to the data, instead of the data itself. One example is below, where one table stores people, and another stores addresses. While storing all of that information in one table is possible, the information is stored more efficiently by keeping the address information in only 1 row in a separate table. The act of de-normalizing a database involves putting data back together (using fewer tables, but having more data redunancy).
With normalized databases, joins are often used combine data from different tables into one result set. In Amazon SimpleDB, there isn't any join equivalent. To get data from multiple tables (domains) and combine it, you'd need to perform multiple queries (and get attribute operations). In many cases, it might be better to de-normalize your data which would put more columns in some domains. In fact, this approach would help you fit more tables into less domains, and possibly reduce your table count enough to be under the 100 domain limit. Take this database with the following tables:
|1||David Kavanagh||1||2008-02-10 16:16:30|
|2||Jeff Rice||1||2008-02-10 16:16:33|
|1||29 Goodway Drive||Rochester||NY||14623||2008-02-10 16:16:26|
You may want to combine both tables into one domain in Amazon SimpleDB. That way, the complete mailing address can be pulled with one GetAttributes call. Another alternative would be to use some client side caching of addresses, though you'd need to determine when the cache is dirty (which would be complicated by a distributed application using the same domain data). For the purposes of this article, we'll assume that we're OK with the current tables, which will be loaded into two separate domains in Amazon SimpleDB. Also assume that the 'id' field for each record will become the item identifier which equates pretty well to a primary key. Given that, you'd need to take the following steps to extract my address from the two domains.
- Issue Query on 'people' domain : ['name' = 'David Kavanagh']
- GetAttributes on 'people' domain for item 1
- GetAttributes on 'address' domain for item 1
- Aggregate the results from both GetAttributes calls
To make a determination on just how much to de-normalize, you really have to understand your data and queries. If you have some tables that contain ancillary information about some other table data, and don't get accessed very often, it might not make sense to combine it with that other table. You would be paying for extra storage in Amazon SimpleDB, and if you don't use the data very often, you might as well pay the extra costs for issuing another GetAttributes call from time to time.
Data stored in Amazon SimpleDB is all character strings. The comparative operators for queries will do lexicographical comparisons. What this means is that numeric and date/time data must be stored specially to allow for a sensible comparison in queries you'll want to do later. For example, a number "1" will compare with the number "3" in the manner you'd expect. The number "10" is larger than "3", but with a lexicographical comparison, the "10" would be less than "3" since the first character does not match and causes the comparison to terminate. If the "3" is zero padded to "03", it will be less than "10" in a lexicographical comparison. The question, then, is how much to pad your numerical data. If you can say what the upper limit for the numbers will be, you can pad out that many digits.
That works just fine for natural numbers. What happens with negative numbers? If I had "-5" and "-2", the "-5" would be seen as higher than the "-2", so that won't work for us. The solution is to provide an offset for number that can be negative. If the range is going to be -127 to 128, we'd add 127 to the value before storing it in Amazon SimpleDB. We'd also need to subtract this offset when pulling the data out, but the comparison would work in our queries. Here are some examples of queries that you might use for numeric comparisons;
- ['mynumber' < '000003']
- ['mynumber' > '000200' and 'mynumber' < '001000']
Since Amazon SimpleDB charges for storage (based on published formulas), one might say it's only worth it to do data conversion on those fields that are used in query comparisons. If you had to zero-pad every number, you could really increase the size of your storage! As an example, let's say you had 100,000 attribute values, an average of 5 digits long, but could be up to 20 digits. You'd zero-pad to 20 digits and your costs (just considering value storage) would look like this:
|Padded Values||Non-Padded Values|
The difference is almost 2 cents, so no big deal, right? Consider having 100 attribute values, and 10,000,000 items with these attributes
|Padded Values||Non-Padded Values|
The difference is $2.25, so I think I've talked myself out of zero padding only some of the data! The reason is that unless your value is quite long, the 45 bytes of overhead assessed for each value overshadows the value length. So, I'd recommend padding all of your numbers, because the cost isn't that high and it would be simpler than trying to go back and correct the padding, should you decide to use different fields in your queries.
Dates present another challenge. Some common date formats such as "Wed, 4 Jul 2007 12:08:56 -0700" or "04-19-1967 11:32:03" won't work for lexicographical comparisons. The reason is that the most significant units of time must come before the less significant units of time. One option could be to represent time as a long value of seconds since some epoch (like Jan 1st, 1970). That might work just fine, but you can also use an ISO 8601 format date and get the same results, plus the data is a bit more readable and doesn't have the limit of the epoch being the beginning of time. ISO 8601 dates look something like this: "2007-07-04T12:08:56.235-0700". A date query is pretty simple to implement with ISO 8601 format dates:
- ['mydate' < '2008-01-01T00:00:00.000-500']
- ['mydate' > '2008-01-01' and 'mydate' < '2008-03-01']
This is an example of some datatypes and how they would be encoded. Consider this table of values from a database.
This is how they would look when properly mapped for Amazon SimpleDB. Notice that the unsigned values do not get an offset applied. The maximum must be known for both signed and unsigned numbers. For signed numbers, the minimum value is also important so that the offset can be applied to bring the lowest value above zero.
For additional information on data conversion, I recommend Aleksey's excellent article Query 201: Tip & Tricks for Amazon SimpleDB Query
I've written a tool to pull data from a database and insert it into Amazon SimpleDB. This tool currently works with MySQL and SQLServer, though other databases can be added by implementing a class that extends BaseDataAccess. This tool creates a domain for each table and does threaded insertions into SimpleDB. This tool uses the typica library to access Amazon SimpleDB. Typica contains a DataUtils class to help in encoding/decoding integers, decimals and dates. The DBLoader tool makes some assumptions about ranges of numbers, so you might need to adjust values to suit your own needs. Here's some pseudo code that describes the operation of the DBLoader.
Query database for table/column details For each table found Create a domain Select all from table For each row For each column Encode value Done Thread-off -> Store item/values in domain Done Wait for threads to complete
The DBLoader converts data using a method in the data access class. The BaseDataAccess class implements JDBC connection handling and statement execution code. It requires 2 methods be implemented by each database specific class that extends it;
public abstract Map<String, List<ColDetails>> getTableStructure() throws SQLException; public abstract String convertValue(ColDetails d, ResultSet rs, int idx) throws SQLException;The first method executes a query to get the specific table structure information (a map of tables, and a list of their column details). The convertValue method extracts a value from a given index in the result set and handles the conversion into a string to be inserted into Amazon SimpleDB. The name of the data access class is read out of the properties file mentioned below. For example, to support a MySQL database, set "db.class=com.directthought.simpledb.MYSQLAccess".
The DBLoader isn't meant to work as a generic tool because of the range and offset values which will likely be unique for your data. The intention is to get a full source download and build it yourself. You will need to edit the config/loader.properties file to supply your AWS login details as well as the database connection information. To build and run it, you can type;
ant run -Dargs="[starting table] [starting pk]"The starting values are helpful if you find you need to restart a conversion. The results of the conversion are idempotent, but to save time, the option is there to skip some amount of data. If you'd like to exclude some tables from being moved at all, those can be listed in the excludeTables array in the DBLoader.java file.
I created a test database in MySQL that incldes the three tables mentioned above. I got the following output when converting that data and putting it into Amazon SimpleDB:
$ ant run -Dargs= Buildfile: build.xml init: compile: run: [java] table name=address : 1 rows [java] table name=miscellany : 1 rows [java] table name=people : 2 rows [java] query = select id, street, city, state, zip, timestamp from address [java] query = select id, s_float, u_float, s_double, u_double, s_integer, u_integer, s_small, u_small, timestamp from miscellany [java] address Record : 000000000001 [java] query = select id, name, address_id, timestamp from people [java] miscellany Record : 000000000001 [java] people Record : 000000000002 [java] people Record : 000000000001 BUILD SUCCESSFUL Total time: 18 secondsUsing the sdbShell command also available in typica, this is what the data in Amazon SimpleDB looks like (attributes have been re-ordered for clarity):
sdbShell> sd people sdbShell> gi Item : 000000000002 id = 000000000002 name = Jeff Rice address_id = 000000000001 timestamp = 2008-02-10T16:16:33.000-05:00 Item : 000000000001 id = 000000000001 name = David Kavanagh address_id = 000000000001 timestamp = 2008-02-10T16:16:30.000-05:00 Time : 0.531 Number of items returned : 2 sdbShell> sd address sdbShell> gi Item : 000000000001 id = 000000000001 street = 29 Goodway Drive city = Rochester state = NY zip = 14623 timestamp = 2008-02-10T16:16:26.000-05:00 Time : 0.5 Number of items returned : 1 sdbShell> sd miscellany sdbShell> gi Item : 000000000001 id = 000000000001 s_float = 00999999000 u_float = 00000001.0 s_double = 0000000999999998000 u_double = 0000000000000002.0 s_integer = 0000999995 u_integer = 0000000005 s_small = 09990 u_small = 00010 timestamp = 2008-02-10T17:52:01.000-05:00 Time : 0.515 Number of items returned : 1 sdbShell>
To get the best performance out of Amazon SimpleDB, the best thing you can do in your application is to use threading for any operations that can be done in parallel. The DBLoader tool runs up to 30 threads to insert attributes into the domains. Another common function is fetching query results and item attributes. Typica has methods built in to run the Query and GetAttribute calls in threads which can greatly improve performance as shown below. I ran a test with 1,000 items and 14 attributes per item.
|threads||My Laptop||EC2 m1.small|
The results show an improvement with more threads, though I found 30 to be a reasonable number of threads to run. Running the test from an Amazon EC2 instance gave a pretty decent performance boost, likely due to lower network latency.
- Amazon Web Services: http://aws.amazon.com
- Amazon SimpleDB: http://aws.amazon.com/simpledb
- typica: http://code.google.com/p/typica/ (also listed in the Resource Center; see below)
- DBLoader: http://code.google.com/p/simpledbloader/
David Kavanagh is a software consultant for Direct Thought in Upstate New York. He has been designing and developing software for 15 years. For the past year and a half, his focus has been on leveraging Amazon Web Services. He is the author of the open source typica library which provides a Java interface for an expanding set of Amazon Web Services and has been developing AWS-based applications for a variety of customers.