Taming the AWS Datastore Hydra with Python

Which data store should you use in your Python application? AWS community developer Reza Lotun discusses the merits of each option available on AWS.


Submitted By: Craig@AWS
AWS Products Used: Amazon SimpleDB, Amazon RDS, Amazon EC2
Language(s): Python
Created On: July 26, 2010


By Reza Lotun, AWS Community Developer

Modern Web applications often employ a variety of datastores in their design to support different application requirements. This article takes you through a whirlwind tour of setting up an Amazon Elastic Compute Cloud (Amazon EC2) instance with a Django installation, ready to interface with three varieties of datastores—MySQL using Amazon Relational Database Service (Amazon RDS), Amazon SimpleDB, and Redis hosted on Amazon EC2 backed by Amazon Elastic Block Store (Amazon EBS). After reading this article, you should be able to incorporate any or all of these stores into your own application.

Introduction

So, you want to build a complex Web application? What does that mean, and how is this article different from any other of the thousands of articles that revolve around the time-tested Linux, Apache, MySQL, Python/Perl/PHP/etc. (LAMP) approach? Modern Web applications such as Twitter or Facebook employ "killer features"—social networks, support for huge user bases, and real-time elements like chat. Although you may not be looking to build a service operating at the massive scale these sites offer, it is often useful to design your architecture in a way that can incorporate similar concepts. The goal is to maximize flexibility by using the right datastores for the job.

The beauty of using Amazon Web Services (AWS) is that your application does not have to conform to a rigid architecture. You can take what you'd like from the following examples and adapt them to your own needs. The basic mantra you'll be adopting, though, is "why choose one?" You'll be using all three!

The following isn't meant to be a polemic that alternative, nonrelational datastores are the be-all and end-all. On the contrary, this article makes the point that each datastore has specific strengths and weaknesses. This article presents each datastore in an area that highlights its strength and shows you how to leverage it in a Python setting. It also shows that AWS offers all the building blocks you need to incorporate any sort of datastore you need into your architecture. You'll learn about:

  • MySQL. Using Amazon RDS, you can have a managed MySQL installation in which you have full control over the machine type and storage it uses. With a few application programming interface (API) calls, you can scale resources up or down as you require. MySQL and other relational databases excel at storing highly structured data that need to be modified in a transactional way. Also, you can take advantage of the plethora of Django applications that exist already to add functionality to your application.
  • Amazon SimpleDB. Amazon SimpleDB is an example of a distributed key-value store. It is specifically designed for high scalability and is intended as a store for small pieces of indexable data, or rather meta-data. Unlike MySQL, Amazon SimpleDB is schema-less, and operations on it are eventually consistent. MySQL and Amazon SimpleDB are completely different beasts—two tools that solve very different problems. You'll be using Amazon SimpleDB to store user account information.
  • Redis. Another key-value store, Redis is similar to memcached in its ease of use and speed. Unlike memcached, it supports varying levels of persistence and advanced atomic operations on data structures such as lists, sets, and ordered sets. Some applications for Redis include a fast persistent session back end for Django, a real-time statistics tracker, and a URL shortener service. Although Redis is something you would host yourself on Amazon EC2, this article shows you how best to do it by backing its persistence store with an Amazon EBS partition.

Note: Notice that Amazon SimpleDB is "eventually consistent." What does that mean exactly? Without getting into too much detail, it's important to understand one fact about storing data on Amazon SimpleDB: Your data are spread across many machines. The basic problem, then, is this: If you make a write to one machine, does the "knowledge" of that write propogate instantaneously to all other machines? Intuitively, you would say no—some amount of time has to exist before all copies of your data on multiple machines are "consistent" in some way. Although Amazon SimpleDB is eventually consistent—that is, you usually have to wait on the order of seconds for a read immediately issued on changed data to reflect the previous write—a new feature on reads has just been introduced, allowing consistent reads. What this means is that you can issue a read, and the result will block until all writes affecting the data that are read propagate. This functionality can be useful in some situations, as you'll see later.

Two Hats, One Head

Okay, enough philosophizing—time for some code. A lot of people who venture into cloud computing platforms such as AWS come from a mixed background of development and operations. The lines between these two areas are blurring, but it's still useful to make a distinction between them.

On the development side, you'll wear your software engineering hats and think in terms of architecture and algorithms, systems, data flow, and requirements and test. At this level of abstraction, you'll be working mostly at the framework level and your architecture.

On the operations side, you often think in terms of deployments, servers, and metrics. You have code that you need to put onto machines. You have tools and other systems that need to monitor those machines. There are issues of maintenance and performance measuring as well as load testing. Another requirement is thinking of failure—you have to plan for it to happen rather than be surprised when it does.

This article touches on both of these areas. As you can imagine, both influence each other in a tight feedback loop. You'll wear two hats and touch on two different toolsets when dealing with the following ideas. The tools in your arsenal will be:

  • Development:
    • Django. A popular Python-based Web framework
  • Operations:
    • Boto. A Python-based library over AWS
    • Fabric. A Python-based systems administration tool that works over Secure Shell (SSH)

Installation and Setup

This article assumes that you have a working Python and Django environment, with easy_install or pip available. If not, please consult the relevant documentation to set up one of these two packages (preferably pip). installation of Boto and Fabric is simple (if you're using easy_install, replace pip install with easy_install):

$ pip install boto
$ pip install fabric

Assume that your AWS access key and secret key are located as environment variables:

$ export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY
$ export AWS_SECRET_KEY=YOUR_SECRET

Note: It's important never to give away your secret key. Keep it safe—try not to hard-code it into any source files. There are tools to mitigate this risk slightly. Use them: You won't regret it.

Environment

This article assumes that you're running on Amazon EC2 to take advantage of lower latency and all the free bandwidth within the cloud. Launch an Amazon EC2 instance on which to run your Django and Redis installation. You'll be using Alestic's Ubuntu 9.10 32-bit Amazon Machine Image (AMI) ami-bb709dd2, the default security group, and an existing key denoted by "key."

You collect all of your operations steps in a fabfile. A fabfile is Fabric's way of collecting a number of commands. It starts to become really useful when you're executing these commands over many servers, but for now, just stick your Boto commands in there. You might find it useful to adopt the same approach to any of your Python projects on AWS— a fabfile with routines to deploy code and manage servers.

To start, save the following code into a file called fabfile.py:

import sys
import time
import boto

AMI = 'ami-bb709dd2'

def launch_ec2_instance():
    c = boto.connect_ec2()
    # get image corresponding to this AMI
    image = c.get_image(AMI)
    print 'Launching EC2 instance ...'
    # launch an instance using this image, key and security groups
    # by default this will be an m1.small instance
    res = image.run(key_name='key',
                    security_groups=['default'])
    print res.instances[0].update()
    instance = None
    while True:
        print '.',
        sys.stdout.flush()
        dns = res.instances[0].dns_name
        if dns:
            instance = res.instances[0]
            break
        time.sleep(5.0)
        res.instances[0].update()
    print 'Instance started. Public DNS: ', instance.dns_name

You've defined a fabric command called launch_instance, which is simply a function. To run it, use the following code:

$ fab launch_ec2_instance
Launching EC2 instance...
u'pending'
.
.
.
Instance started. Public DNS: ec2-72-44-40-153.z-2.compute-1.amazonaws.com

Done.

You now have an Amazon EC2 instance started that you can use SSH to access using your key. Do that using Fabric. Add the following code to your fabfile.py:

from fabric.api import env, sudo

# this should be a path to your SSH key for the EC2 instance
key_path = '/home/me/keys/key'

def live():
    # DNS entry of our instance
    env.hosts = ['ec2-72-44-40-153.z-2.compute-1.amazonaws.com']
    env.user = 'ubuntu'
    env.key_filename = key_path

def setup_packages():
    sudo('apt-get -y update')
    sudo('apt-get -y dist-upgrade')
    sudo('apt-get install -y python python-django')

You've now defined two commands: one to define your "live environment" and one to execute a series of commands that update your instance to the newest software and install Django. If you had a development and staging environment, you would add corresponding functions for those, as well. To execute the setup on your live environment, use the code:

$ fab live setup_packages

Fabric runs the list of commands in sequence. The live command simply defines a list of machines to run subsequent commands over. By adding to your list of instances, you can code one task once and have it executed over many machines.

Throughout the rest of this article, you'll be developing locally and deploying your code to the instance. For example, you can have a code repository with a clone on your instance, and the deployment step could simply be a Fabric command that would update to the latest code on the remote instance and restart your server.

Now that you have your basic environment, you can set up your MySQL database.

Setting Up Amazon RDS

Although you have the option of running MySQL yourself on Amazon EC2 with Amazon EBS, Amazon RDS offers you the ability to let Amazon do all the heavy lifting and management for you. You'll use an Amazon RDS back end for a Django project. An Amazon RDS instance is like an Amazon EC2 instance, except that its sole purpose is to run MySQL.

def initialize_db():
    """ Initializes a one-time RDS instance.
    This should only be a one-time call.
    """
    rds = boto.connect_rds()
    sg = rds.create_dbsecurity_group('dbsg',
                                     'My database security group.')
    groups = rds.get_all_dbsecurity_groups()
    print 'All database security groups: ', groups
    ec2 = boto.connect_ec2()
    ec2_groups = ec2.get_all_security_groups(['default'])
    for g in ec2_groups:
        sg.authorize(ec2_group=g)

    pg = rds.create_parameter_group(name='dbparamgrp',
                                    description='My DB parameter group.')

    inst = rds.create_dbinstance(id='dbid',
                                 allocated_storage=10,
                                 instance_class='db.m1.small',
                                 master_username='data_muncher',
                                 master_password='mysecretpassword',
                                 param_group='dbparamgrp',
                                 security_groups=['dbsg'])
    print 'Launching instance...'
    time.sleep(5)


def db_status():
    rds = boto.connect_rds()
    rs = rds.get_all_dbinstances()
    if rs:
        for inst in rs:
            print 'DB instance %s, status: %s, endpoint: %s' % (
                            inst.id, inst.status, inst.endpoint)
    else:
        print 'No RDS instances.'

The first command spawns a small Amazon RDS instance with 10 GB of disk capacity from scratch. Notice that there are three steps:

  1. Create a database security group.

    Here, your security group has a name and description. These are important concepts and allow you to restrict access to your database instances either to specific IP addresses, IP ranges, or other Amazon EC2 groups. You've decided to allow the default Amazon EC2 group to connect to this instance.

  2. Create a parameter group.

    This is an abstraction of the mysql.conf file.

  3. Create an Amazon RDS instance.

    Spawn the actual Amazon RDS instance with a given ID, resources, configuration, and user options.

Now, you can run it.

$ fab initialize_db
Launching instance...

Done.

You've begun the process of launching the instance. It takes a few seconds to a few minutes, although you can check the process manually by running the status command:

$ fab status
DB instance dbid, status: creating, endpoint: None

Done.

$ fab status
DB instance dbid, status: available, endpoint: (u'dbid.cuhbotbit2pb.us-east-1.rds.amazonaws.com', 3306)

Done.

Your Amazon RDS instance has started! You have the full Domain Name System (DNS) entry for it, and it's ready to accept data. Before you can do anything, however, you need to create the actual MySQL databases, which you can do from the MySQL command-line client:

mysql -u root -h dbid.cuhbotbit2pb.us-east-1.rds.amazonaws.com -p
CREATE DATABASE dbname;
GRANT ALL ON dbname.* TO 'dbuser'@'localhost' IDENTIFIED BY 'mypass';

Now that your MySQL environment has been completely initialized, you can tell Django to start using it by editing your settings.py file:

DATABASE_ENGINE = 'mysql'
DATABASE_NAME = 'dbname'
DATABASE_USER = 'dbuser'
DATABASE_PASSWORD = 'mypass'
DATABASE_HOST = 'dbid.cuhbotbit2pb.us-east-1.rds.amazonaws.com'
DATABASE_PORT = ''             # Set to empty string for default.

At this point, you can run a complete Django installation:

 $ python manage.py syncdb

Although you have code to reproduce launching your Amazon RDS instance, you can add other tools to your fabfile to aid in administration. One such tool is CloudWatch, Amazon's metrics API that often gets bundled with certain Web services. Luckily, Amazon RDS is one of them. Here's an example plug-in to use CloudWatch and Amazon RDS:

from datetime import datetime

def get_cpu_stats():
    c = boto.connect_cloudwatch()
    end = datetime.now()
    start = end - timedelta(days=4)
    data = c.get_metric_statistics(60, start, end, 'CPUUtilization',
                                    'AWS/RDS', ['Average'])
    points = [(d['Average'], d['Timestamp']) for d in sorted(data, key=lambda x: x['Timestamp'])]

    print '\nCPU utilization on average for each of the past 20 minutes: '
    for p in points[-20:]:
        av, ts = p
        print '\tCPU: %.2s\t\t%s' % (av, ts)

Running this plug-in would produce:

$ fab get_cpu_stats

CPU utilization on average for each of the past 20 minutes:
    CPU: 10		2010-05-25T23:05:00Z
         ...
    CPU: 13		2010-05-25T23:31:00Z

Done.

Next, you can integrate a completely different datastore into your Django setup, allowing you to introduce an extra dimension of scalability.

Setting Up Amazon SimpleDB

Amazon SimpleDB is a key-value datastore. It was designed to be scalable and redundant—particularly useful for applications involving heavy writes of small pieces of indexable data. This makes it particularly suited to storing metadata rather than the full data itself. For example, say that you have multiple client implementations of a product that exist on the Web, desktop, and mobile devices. To implement lightweight user accounts, it might be useful to offload authentication data and other lightweight metadata about user accounts—such as user name, password hashes, creation time, and last updated time—onto a system like Amazon SimpleDB. In this example, you'll do just that, allowing you to implement an alternative authentication back end on Django.

One thing to note about Amazon SimpleDB is its terminology and structure. At the top level, you have domains. Think of domains as namespaces with specific limits on size. Users are allowed 100 domains to name and store data as they please. Within these domains, data are stored in items. Items have an item name and attribute-to-value mappings. More than one value can be associated with an attribute. Think of Amazon SimpleDB as a sort of large distributed hash map or dictionary.

One of the limits of Amazon SimpleDB is that domains can only be 10 GB. Also, keep in mind that operations on different domains interact with different servers within the Amazon SimpleDB system. You'll use a trick that those who've used memcached are familiar with: using a set of domains to store data. If you have N domains—say, 16—you'll take a hash of the item name to get a 32-bit number, and take the modulus of that by 16 to get the domain for that item.

Look at an example of how Amazon SimpleDB could work by implementing an authentication back end for Django. Although Django comes packaged with a robust model back end intended for use over a relational database like MySQL, you'll do it over Amazon SimpleDB. Why? Perhaps you have a variety of clients, such as mobile devices, desktop applications, and Web users who routinely authenticate into your system or create accounts. That's a lot of writes that you might want to parallelize as much as possible. Also, because this is highly sensitive data, you want it available and redundantly stored.

To begin, create those 16 domains. You can modify this number to suit your needs. Again, stick this initialization code in your fabfile:

NUM_USER_DOMAINS = 16

def create_user_domains():
    conn = boto.connect_sdb()
    domain_name_template = 'user_%s'
    for i in xrange(NUM_USER_DOMAINS):
        name = domain_name_template % str(i).zfill(3)
        dom = conn.create_domain(name)
        print 'Created ', dom

Domain creation is idempotent, meaning that you can safely run this command as many times as you want without affecting already-created domains. Now that you have your domains, you can implement a Django authentication back end:

""" auth.py

Custom authentication backend that delegates authentication to accounts
stored in SimpleDB.
"""

import hashlib

import boto

from django.conf import settings
from django.contrib.auth.models import User

NUM_USER_DOMAINS = 16
sdb_conn = boto.connect_sdb()

# domain_map will be a map of domain number to domain
domain_map = {}
for i in xrange(NUM_USER_DOMAINS):
    domain_name = 'user_%s' % str(i).zfill(3)
    domain_map[i] = sdb_conn.get_domain(domain_name)

def get_domain(username):
    # prepend username with formatter string
    user_hash = '!u:%s' % username
    # use Python hash, since it's fast
    bucket = hash(user_hash) % NUM_USER_DOMAINS
    return domain_map[bucket]

class SimpleDBBackend(object):
    def authenticate(self, username=None, password=None):
        if not username or not password:
            return None

        dom = get_domain(username)
        user_entry = dom.get_item(username)
        if user_entry is None:
            # the user doesn't exist
            return None

        # you'll probably also want to store a salt with the
        # password hash
        phash = hashlib.sha1(password).hexdigest()
        # assume we have an attribute 'pass' on the item which
        # maps to password hashes
        stored_phash = user_entry.get('pass', '')

        if phash == stored_phash:
            # valid password
            try:
                user = User.objects.get(username=username)
            except User.DoesNotExist:
                # create a new user object
                user = User(username=username, password='empty')
                user.save()
            return user
        else:
            return None

    def get_user(self, user_id):
        try:
            return User.objects.get(pk=user_id)
        except:
            return None

Next, edit your Django settings.py file:

AUTHENTICATION_BACKENDS = ('mydjangoproject.auth.SimpleDBBackend',
                           'django.contrib.auth.backends.ModelBackend',
                           )

This file first uses your Amazon SimpleDB back end to authenticate users, and then falls back on the built in Django ModelBackend to deal with user permissions. Of course, you can do everything in Amazon SimpleDB as well by expanding the SimpleDBBackend: It's up to you.

Setting Up Redis on Amazon EBS

Redis is an advanced key-value store—better thought of as a data structure server. Redis excels at use cases that involve many quick operations where it would also be nice to have some persistence. Though it exists as a separate software package, perhaps the best way to deploy it on Amazon EC2 is to have its persistence layer hosted on Amazon EBS.

Note: For detailed information on setting up and configuring Redis, see Simon Willison's Redis tutorial.

Think of Amazon EBS as a mountable disk partition. The beauty of it is that unlike local Amazon EC2 instance storage, you won't lose the data when the instance is terminated. You can also periodically take snapshots of your Amazon EBS partition and store them on Amazon Simple Storage Service (Amazon S3).

So, begin by creating a 10-GB Amazon EBS volume and attaching it to /dev/sdb on your Amazon EC2 instance. You can then mount this volume (under /volume) and start saving data on it:

def create_ebs_volume():
    c = boto.connect_ec2()
    reservations = c.get_all_instances()
    # since we only have one instance this will do
    instance = res.instances[0]
    # create EBS volue in the same area where our instance lives
    volume = c.create_volume(10, inst.placement)
    print volume.attach(inst.id, '/dev/sdh')

def mount_ebs_volue():
    sudo('mkdir -p /volume')
    sudo('mount -t ext3 /dev/sdh /volume')

Any time your instance is terminated, the persistent side of Redis will always be available for the next instance you launch. This is especially useful if your Redis instance crashes and needs to be quickly recreated or whether the resources used need to be vertically scaled up to a more powerful instance.

Conclusion

This article demonstrated how you would work with various datastores on AWS. There's a lot of them to choose from, but each has a potential place in your application stack. Using Boto and Fabric, you saw how you could reuse and save your configuration steps, replaying them in the future.

Also, this article showed how a tool like Fabric (combined with Boto) could allow you to manage all your instances on Amazon EC2. Given a Django application, you learned how you can get started using Amazon SimpleDB to implement user accounts. Even using a NoSQL-type datastore like Redis is possible on Amazon EC2, especially when used in conjunction with Amazon EBS.

About the Author

Reza Lotun lives in London and spends his time architecting and developing the back-end systems for TweetDeck as well as contributing code to Boto. You can follow him on Twitter at @rlotun.