by Mitch Garnaat
SimpleDB? Tell Me More...
Amazon recently announced the availability of a new web service, SimpleDB. Like all of the Amazon Web Services, SimpleDB is designed to be highly available, and highly scalable. Also, like other AWS offerings it has a pay-as-you-go pricing model that is very attractive. The SimpleDB service, as the name implies, offers database capabilities. When most developers hear the word database they think relational database but there are other types of databases in the world and SimpleDB is definitely one of those other types.
The first difference you will notice in SimpleDB is that it is schemaless. In relational databases you have to define your data model up front and changing that model later can be painful. Not so, in SimpleDB. In SimpleDB you have items (think rows) which can have up to 256 attributes (think columns) where each attribute can have up to 256 values, all of which must be strings. You can add and remove items and attributes easily and there's not a schema in sight.
The second main difference you will notice is that you won't need that handy little pocket reference to SQL with SimpleDB. SimpleDB provides it's own simple query language. Rather than the SQL reference book, you might want to brush up on your set theory because SimpleDB and it's query language are very set oriented; think unions and intersections and you'll be in the right ballpark.
Let's Get Active
An object carries both data and behavior. Much of this data is persistent and needs to be stored in a database. Active Record uses the most obvious approach, putting data access logic in the domain object. This way all people know how to read and write their data to and from the database.
Wikipedia defines Active Record as:
...an approach to accessing data in a database. A database table or view is wrapped into a class, thus an object instance is tied to a single row in the table. After creation of an object, a new row is added to the table upon save. Any object loaded gets its information from the database; when an object is updated, the corresponding row in the table is also updated. The wrapper class implements accessor methods or properties for each column in the table or view.
Probably the best known implementation of the Active Record design pattern is the ActiveRecord library used in Ruby on Rails however examples of the pattern can be seen in virtually all of the modern web frameworks such as Django, TurboGears and others.
The goal of this article is to describe a simple persistence package that implements the Active Record pattern on top of SimpleDB rather than a relational database such as MySQL. This package is Python-based and builds on the basic SimpleDB interface provided by the boto library. It allows you to wrap SimpleDB persistence into Python classes and to hide most of the details of how that persistence works. This software is still very much a work in progress and will undoubtedly continue to change and evolve based on user feedback but the code described here is currently available in the boto library as the module boto.sdb.persist.
The first thing we are going to look at in our persistence layer is Properties. If we were mapping to a relational database, the Properties would be used to define the columns within a table in our relational model. The analogy to columns in SimpleDb is an attribute so in the boto.sdb.persist module we use Properties to define what SimpleDB attributes will be associated with our Python classes. Because all attribute values in SimpleDB are strings, we could just define a single Property type called StringProperty and be done. But in the real world, we encounter data types other than strings so part of the responsibility of the Property classes is to map common data types (e.g. integers, booleans, etc.) to the underlying string data type provided by SimpleDB. In mapping these data types to strings, we want to keep in mind that all comparisons done in queries in SimpleDB are done lexicographically. So, if we store two integer values, say 42 and 9 in SimpleDB and we want to be able to query later about which value is greater or less than the other, we need to do some extra work when we convert those values to strings in SimpleDB. The Property classes take care of all of that conversion for you and allow your Python classes to deal with the attributes as native data types.
Currently, the persistence layer provides the following types of properties or fields that can be included within a Python class definition:
Accepts any string as a value. The field is limited to 1024 characters by SimpleDB but can be further constrained using the maxlength parameter to the constructor. For example, this:
short = StringProperty(maxlength=10)
Would limit the size of the property short to 10 characters.
Accepts a positive integer value in the range of 0-65535. This value is converted to a string before storing in SimpleDB. To convert, a zero-padded string is constructed. For SmallPositiveInteger, the size of that string is 5 characters. So, storing a value of 42 would result in:
being stored in SimpleDB.
Accepts an integer value in the range of -32768 - +32767. This value is converted to a string before storing in SimpleDB. To convert, an offset value of 32768 is added to the value of the property. The resulting positive integer value is then zero-padded in a manner similar to that described above. So, storing a value of -42 would result in:
being stored in SimpleDB.
Identical to the SmallPositiveInteger property except the range of value is 0 - 4294967295 and the string stored in SimpleDB is padded to a length of 10 characters.
Identical to the SmallInteger property except the range of value is -21474836480 - +2147483647 and the string stored in SimpleDB is padded to a length of 10 characters.
Identical to the SmallPositiveInteger property except the range of value is 0 - 18446744073709551615 and the string stored in SimpleDB is padded to a length of 20 characters.LargeIntegerProperty
Identical to the SmallInteger property except the range of value is -9223372036854775808 - +9223372036854775807 and the string stored in SimpleDB is padded to a length of 10 characters.
Accepts a boolean True or False value. True values are stored as the string "true" while False values are stored in SimpleDB as the string "false".
Accepts a Python datetime object as a value. Values are converted to strings by converting the datetime object to it's ISO8601 representation.
This kind of property stores a reference to another Python object. The referenced Python object must be a subclass of the SDBObject defined in the persist module. It can be further constrained using the `ref_class` parameter to the constructor. The value stored in SimpleDB will be a string consisting of information about the module the object's class is contained in as well as the name of the class and name of the object. This module and class information must be loadable by Python at the time this property is referenced.
This kind of property stores a reference to an object stored in S3. The size of attribute values in SimpleDB is limited to 1024 bytes so it's obviously not going to be possible to store the equivalent of BLOB's directly in SimpleDB. Fortunately, AWS provides the S3 service which does a fantastic job of storing as much data as you want to throw at it. This property allows you to take advantage of that. The value of this property is a Boto Key object which represents one object in S3. If you already have an instance of a Key object you can just store it here. Or you can just store a string of the form bucketname/keyname and the S3KeyProperty will make the reference to that existing S3 object. We'll see some examples of how this works a bit later on.
One of the really nice features of SimpleDB is that you can store multiple values with any attribute. To do that in a relational database you would have to create some sort of mapping table but in SimpleDB it just happens. So, to support his cool feature the boto.sdb.persist offers list versions of all of the Properties described above. These list versions do the exact same value conversions and validation as the scalar Properties, they just allow multiple values to be stored. The list Properties are:
- StringListProperty - a list of StringProperty values
- SmallIntegerListProperty - a list of SmallIntegerProperty values
- SmallPostiveIntegerListProperty - a list of SmallPositiveIntegerProperty values
- IntegerListProperty - a list of IntegerProperty values
- PostiveIntegerListProperty - a list of PositiveIntegerProperty values
- LargeIntegerListProperty - a list of LargeIntegerProperty values
- LargePostiveIntegerListProperty - a list of LargePositiveIntegerProperty values
- BooleanListProperty - a list of BooleanProperty values
- ObjectListProperty - a list of ObjectProperty values
To give a flavor of what the library is able to do, I have chosen to take an existing MySQL database that contains information about my music library and attempt to convert that data and load it into SimpleDB. The original MySQL schema consists of three tables; disc, genre, and track. The original MySQL definitions of the tables are shown below.
CREATE TABLE `genre` ( `genre_pk` int(11) NOT NULL, `genre_name` varchar(32) NOT NULL, PRIMARY KEY (`genre_pk`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8_bin; CREATE TABLE `disc` ( `disc_pk` int(11) NOT NULL auto_increment, `disc_id` varchar(255) character set latin1 NOT NULL, `disc_num_tracks` int(11) default NULL, `disc_title` longtext character set latin1 NOT NULL, `disc_genre_fk` int(11) NOT NULL, `disc_year` varchar(10) character set latin1 NOT NULL, `disc_seconds` int(11) default NULL, `disc_access_cnt` int(10) unsigned NOT NULL, `disc_access_ts` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (`disc_pk`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin; CREATE TABLE `track` ( `track_pk` int(11) NOT NULL auto_increment, `track_num` int(11) NOT NULL, `track_frames` int(11) default NULL, `track_title` longtext NOT NULL, `track_genre_fk` int(11) NOT NULL, `track_extra` varchar(128) default NULL, `track_rating` int(11) NOT NULL, `track_disc_fk` int(11) NOT NULL, `track_access_cnt` int(11) NOT NULL, `track_access_ts` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (`track_pk`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8_bin;
This is a pretty simple schema but there are foreign key relationships between disc and genre, track and genre, and track and disc. To simplify this example, I'm going to ignore some of the columns like access counts, access timestamps and CD-related information like the disc_id and frames count. So, to convert this database into SimpleDB, I need to do the following:
- Dump MySQL data out to Comma-Separated Value files, one for each table
- Define the Python classes that will represent each of these entities
- Write a utility to read the CSV files and use the data to create instances of the Python objects
Because the Python classes will include persistant properties that automatically store values in SimpleDB, by instantiating instances of the classes and populating them with the data I will automatically be storing the data in SimpleDB.2
For step 1 above, I just used standard MySQL commands to dump the data from each table into a comma-separated value file. The three resulting files are:
The Python Classes
Given those brief descriptions of the types of properties that are available, let's define the Python classes we will need to capture this music database in SimpleDB. The basic process will be to define a Python class for each of the tables in our relational data model.
While we are at it, I'm going to address a couple of shortcomings in the relational module right up front: The relational model had no explict representation for Artists. The Artist and Disc Title were combined into a single field separated by the / character. I'm going to fix that in the new SimpleDB model and create an explicit Artist class.
So, let's look at the code!
from boto.sdb.persist.object import SDBObject from boto.sdb.persist.property import * class Genre(SDBObject): number = PositiveIntegerProperty() title = StringProperty() class Artist(SDBObject): name = StringProperty() class Disc(SDBObject): title = StringProperty() num_tracks = SmallPositiveIntegerProperty() year = StringProperty(max_length=4) genre = ObjectProperty(ref_class=Genre) artist = ObjectProperty(ref_class=Artist) class Track(SDBObject): number = SmallPositiveIntegerProperty() title = StringProperty() genre = ObjectProperty(ref_class=Genre) disc = ObjectProperty(ref_class=Disc) mp3_file = S3KeyProperty() flac_file = S3KeyProperty()
Okay, that's pretty short and sweet. First, let's look at the Genre class. We can see that it is a subclass of SDBObject. The fields in genre table in the relational model consists of a primary key which is really a widely accepted numeric equivalent of the genre title. For example, the ID tags written in MP3 files will reference the genre number, not the name. The other field, of course, is the name or title of the genre. So, for our Genre class we have added two properties to hold those values. We can, of course, add other things to this class. We can have regular attributes that are not persisted in SimpleDB as well as methods and other things that Python allows in a class. The only rules are that it must subclass SDBObject and that any data that should be stored in SimpleDB must be represented as a Property within the class definition.
The Artist class is next and is even easier than the Genre class. All we have is a StringProperty to hold the name of the Artist. Easy peasy.
Now let's try the Disc class. We have a few more Properties in this class but some are things we have already seen, strings, integers, etc. One interesting new property type here is the ObjectProperty. The original MySQL data model used a foreign key relationship between the Disc table and the Genre table so each row in the Disc table could reference the primary key of the Genre that it was associated with. We need a similar capability here for both Genre and Artist. That's where the ObjectProperty comes in. An ObjectProperty is a reference to another object persisted in SDB. By default, the reference can be to any valid subclass of SDBObject but you can further limit the type of objects that can be referenced using the ref_class parameter. In our example, the genre property must be an instance of the Genre class and the artist property must be an instance of the Artist class.
Finally, we have the Track class. Most of the properties are similar to ones we have already seen but this class does include two properties of type S3KeyProperty. These two properties will be used to store the two music file that can be associated with each Track in our system; a lossless FLAC format file and a lossey MP3 format file. We'll see how those properties work in the next section.
And a One, And a Two...
To kick the tires on our new persistent classes, I wrote a simple little import utility that reads the CSV files dumped out of the relational database and creates the corresponding objects. In the process, of course, all of the data gets stored in SimpleDB. The import tool is here. Having run the import script, let's try our objects out in Python. I'll intersperse the Python commands with comments to explain what's going on.
Python 2.5 (r25:51918, Sep 19 2006, 08:49:13) [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from boto.sdb.persist import set_domain >>> set_domain('music')
SimpleDB let's me create up to 100 domains (currently) and each of those domains can hold up to 10 GB of data. And, because SimpleDB is schemaless, you can store all sorts of data in a single Domain. The set_domain command simply tells the boto.sdb.persist module which Domain we are using at the moment. The name I'm using here is the same one I passed as the -d option to the import tool.
>>> from music import *
This just imports my music classes defined above.
>>> rs = Artist.list() >>> a = rs.next() >>> a.name u'Beck' >>> a.id u'c708ef11-fabc-4ea1-a445-67a0e4399396'
The first command says Return an iterator containing all instances of the Artist class. The second command assigns the first object in that iterator to the variable a and then we print the value of the name attribute of a. Also, if we access the id attribute of the artist object we can see the UUID that is used as the item name in SimpleDB. All of the attributes of the persisted objects are stored as regular attribute name/values in SimpleDB so you can access them outside of the boto.sdb.persist module if you want to.
So, what if we wanted to find all of the Discs associated with our Artist? Simple!
>>> discs = a.get_related_objects('artist') >>> for disc in discs: ... print disc.title, disc.year ... Mutations 1998 Sea Change 2002 Guero 2005 >>>
The get_related_objects method is available for all persisted objects and allows you to find any objects that reference this object. The argument passed in is the name of the attribute that actually references the object. In our case, we called the ObjectProperty attribute artist. Similarly, we can find all of the Tracks related to one of the Discs.
>>> disc.title u'Guero' >>> disc.genre Genre
>>> disc.genre.title u'Rock' >>> for track in disc.get_related_objects('disc'): ... print track.title, track.number ... Broken Drum 8 Missing 4 Black Tambourine 5 Rental Car 12 Emergency Exit 13 Girl 3 Go It Alone 10 Earthquake Weather 6 Scarecrow 9 Farewell Ride 11 E-Pro 1 Hell Yes 7 >>>
Finally, let's assume that the FLAC and MP3 files for the Track Black Tambourine were stored in the in the files /opt/music/bt.flac and /opt/music/bt.flac, respectively. If we wanted to associate those files with the Track object, t, which represents that Track, we could do the following.
>>> t.title u'Black Tambourine' >>> import boto >>> s3 = boto.connect_s3() >>> b = s3.get_bucket('my_music') >>> k = b.new_key('bt.flac') >>> k.set_contents_from_filename('/opt/music/bt.flac') >>> t.flac_file = k
Or, we could do it this way.
>>> t.mp3_file = 'my_music/bt.mp3' >>> k = t.mp3_file >>> k.set_contents_from_filename('/opt/music/bt.mp3')Hopefully this has given you a flavor for one approach to object-level persistence with SimpleDB. The boto.sdb.persist module is still in the early stages and there are lots of other things that can be done with it such as:
- Automatically add attributes to find related objects rather than rely on the get_related_objects method.
- Full-text search on all or some Properties within objects
- Better integration with S3 and, potentially, SQS
- A richer set of Properties
- Amazon Web Services: http://aws.amazon.com
- Amazon SimpleDB: http://aws.amazon.com/sdb
- boto: http://code.google.com/p/boto/
- Python: http://www.python.org
Mitch Garnaat is a software guy living in upstate New York.