Getting Started with AWS and Python
New to AWS? This article walks through three popular Amazon services (Amazon S3, Amazon SQS, and Amazon EC2) using the boto library for Python.
Submitted By: Craig@AWS
AWS Products Used: Amazon SQS, Amazon EC2, Amazon S3
Language(s): Python
Created On: July 26, 2010
By Patrick Altman, AWS Community Developer
Bar none, boto is the best way to interface with Amazon Web Services (AWS) when using Python. After all, It has been around for years, has grown up alongside AWS, and is still actively maintained.
The modules in the boto package track with the services that Amazon offers, so it is a fairly intuitive package to learn. Documentation for the project is online at https://boto.cloudhackers.com and is rebuilt with every commit to the project. The mailing list, https://groups.google.com/group/boto-users, is active enough to get help on questions but not so active as to overwhelm your inbox.
This article walks you through three popular Amazon services: Amazon Simple Storage Service (Amazon S3), Amazon Simple Queue Service (Amazon SQS), and Amazon Elastic Compute Cloud (Amazon EC2). By the end of this article, you should be comfortable with:
- Booting up nodes using prebuilt images on Amazon EC2.
- Bootstrapping Amazon EC2 nodes with software you want running when your nodes boot up.
- Uploading, copying, downloading, and deleting files using Amazon S3.
- Using Amazon SQS to queue up work to run asynchronously on distributed Amazon EC2 nodes.
So, let's get started by first walking through the installation and configuration of the boto library. It's quick, I promise.
Installation and Configuration
You can find the project page for boto on Google Code at https://boto.googlecode.com. There, you can log bug reports and feature requests as well as download release distributions. Releases are also published to PyPI, so that they can easily be installed and upgraded using tools like pip
.
To install boto, use the following command:
pip install -U boto
Or, if you download the tarball package, you can call the setup.py
script directly:
tar xzf https://boto.googlecode.com/files/boto-1.9b.tar.gz cd boto-1.9b python setup.py install
Now that you have the library installed, set up the optional configuration to make working with the boto framework easier. I mention that it is optional, because you can bypass this step if you want to explicitly pass as parameters your AWS access key ID and secret access key strings, which can be a hassle and lead to having this sensitive data scattered about your code.
These are the only two configuration parameters you need to be concerned with. The file should look like this:
[Credentials] aws_access_key_id = {ACCESS KEY ID} aws_secret_access_key = {SECRET ACCESS KEY}
You will find these values after logging in to your AWS account and clicking Account. Click the Security Credentials link:
and then scroll down the page a bit:
You can place this file either at /etc/boto.cfg
for system-wide use or in the home directory of the user executing the commands as ~/.boto
.
That's it: You are ready to start using the boto library to drive AWS!
Working with Amazon S3
Amazon S3 is one of the most foundational and basic of all the Amazon services. You can store, retrieve, delete, and set permissions and metadata on objects—Amazon S3 parlance for what you would think of as files. You can version your objects as well as prevent deletions using multi-factor authentication. This article focuses on the basics of storing and retrieving files.
Let's jump right in with some code using your favorite Python library. Start by creating a bucket in which to store files (the name of the bucket must be globally unique, or else you'll get an error upon creation):
import boto s3 = boto.connect_s3() bucket = s3.create_bucket('media.yourdomain.com') # bucket names must be unique key = bucket.new_key('examples/first_file.csv') key.set_contents_from_filename('/home/patrick/first_file.csv') key.set_acl('public-read')
After creating the bucket, create a key in which to store the data. This key name can be anything you decide, however, I like to use a forward slash (/
) and organize my keys similar to how I create folders in a file system. Calling the new_key
method returns a new key object, but it hasn't done anything on Amazon S3, yet. That's where the next call, set_contents_from_filename
, comes in. This call opens a file handle to the specified local file, and buffers read and write the bytes from the file into the key object on Amazon S3. Finally, you set the access control list (ACL) on the file so that you can access it publicly without permissions.
Now, say that you want to download that file to another machine. You could simply use the following command:
import boto s3 = boto.connect_s3() key = s3.get_bucket('media.yourdomain.com').get_key('examples/first_file.csv') key.get_contents_to_filename('/myfile.csv')
As you can see, this code is just the reverse of what you did to upload the file in the first place. But what if you wanted to move the file to a new bucket and prefix name? Simply copy and delete:
import boto s3 = boto.connect_s3() key = s3.get_bucket('media.yourdomain.com').get_key('examples/first_file.csv') new_key = key.copy('media2.yourdomain.com', 'sample/file.csv') if new_key.exists: key.delete()
Doing so copies the file to the new bucket called media2.yourdomain.com
and gives it the key name sample/file.csv
. Then, before deleting the origin, you verify that the new key exists, which tells you that the copy finished successfully.
Working with Amazon SQS
Now that you know how to store and retrieve files in and from the cloud, let's take a look at Amazon SQS so that you can send work to the cloud, as well. Amazon SQS is a message-queuing service that acts as glue for building applications in the cloud. To fully appreciate Amazon SQS's simplicity, you have to think about it within the context of a working application.
The service allows you to write and read messages. In addition, you can get an approximate count of existing messages and deleted messages from the queue. That's about it. You are constrained to 8 kilobytes in the message, so I typically store my message data on Amazon S3 and pass into the message the bucket and key to the actual message data, like so:
import simplejson, boto, uuid sqs = boto.connect_sqs() q = sqs.create_queue('my_message_pump') data = simplejson.dumps() s3 = boto.connect_s3() bucket = s3.get_bucket('message_pump.yourdomain.com') key = bucket.new_key('2010-03-20/%s.json' % str(uuid.uuid4()) key.set_contents_from_string(data) message = q.new_message(body=simplejson.dumps({'bucket': bucket.name, 'key': key.name})) q.write(message)
This code stores your message on Amazon S3, and then puts a message on your named queue, where that message contains the pointer to your message data. Reading a message is just as simple:
import simplejson, boto, do_some_work sqs = boto.connect_sqs() q = sqs.get_queue('my_message_pump') message = q.read() if message is not None: # if it is continue reading until you get a message msg_data = simplejson.loads(message.get_body()) key = boto.connect_s3().get_bucket(msg_data['bucket']).get_key(msg_data['key']) data = simplejson.loads(key.get_contents_as_string()) do_some_work(data) q.delete_message(message)
Here, you read the message, use the data in the message to get the full contents from Amazon S3, run your imaginary worker function using the data for the input, and delete the message from the queue. If something blew up in the do_some_work()
function, the message would not have been deleted and could be processed again later.
Now that you know how to write, read, and delete a message when you're through with it, tie it all together in a functioning system. You can add messages all day long, but unless you have something responding to messages in the queue, what good is that going to do?
Working with Amazon EC2
Your next step is to boot up nodes on Amazon EC2 and execute a script to read these messages and use the data as input to process into output leveraging these cloud nodes. However, before you can boot up a node, you need to decide on an image. You can either build your own image or go with a prebuilt one. Many images are available through the community, and I recommend at least starting with one of them. If you want to store your own private customizations, you can create a new image based on your customizations. Then, you can simply bootstrap what you need so that you can easily use a stock image and let someone else maintain updating the images.
Note: When a new Ubuntu version comes out and new images are released, you can simply boot with the new images without changing any (or at least much) of your code.
Images are available in many different Linux distributions as well as Windows. This article uses Ubuntu images maintained by Canonical—specifically, its 32-bit Ubuntu 9.10 Server.
To boot an image, use the following code:
import boto ec2 = boto.connect_ec2() key_pair = ec2.create_key_pair('ec2-sample-key') # only needs to be done once key_pair.save('/Users/patrick/.ssh') reservation = ec2.run_instances(image_id='ami-bb709dd2', key_name='ec2-sample-key') # Wait a minute or two while it boots for r in ec2.get_all_instances(): if r.id == reservation.id: break print r.instances[0].public_dns_name # output: ec2-184-73-24-97.compute-1.amazonaws.com $ chmod 600 ~/.ssh/ec2-sample-key.pem $ ssh -i ~/.ssh/ec2-sample-key.pem ubuntu@ec2-184-73-24-97.compute-1.amazonaws.com
First, you activate an Amazon EC2 connection object. If you did not have your boto.cfg
file set up, you would need to pass your key ID and secret key into the connect_ec2()
method as parameters. Because the image only accepts public/private key authentication, you need to write your public key to the image. Amazon EC2 supports this functionality by having AWS actually generate the key pair for you. When this is done, the name ec2-sample-key
is attached to your account, so you can reference it in the future by name. You should only have to create this key pair once as long as you don't loose the private key that you save in this example.
Next, launch the Amazon Machine Image (AMI) based on the image ID found in the previous link to the Canonical image information page. It returns a reservation object, which at the present time has only one bit of useful information: the reservation ID. You need to requery the Amazon EC2 application programming interface (API) through a call to get_all_instances()
to fetch updated information. You do this one or two times after waiting about a minute until r.instances[0].state
(you only launched one instance, but you could have launched as many as you wanted with parameters to run_instances()
) equals running
. You must find the public Domain Name System (DNS) name so that you can use Secure Shell (SSH) to access it. Make sure your private key has the permission bits set to 600
, or the public/private key authentication will fail.
That's it. You are running in the cloud.
Now, let's make it more usable. When you boot this instance, you'll have it install what you want installed automatically, saving you time as well as getting things ready to run a script to process the queue without your having to log on:
reservation = ec2.run_instances( image_id = "ami-bb709dd2", key_name = "ec2-sample-key", user_data = """#!/bin/bash apt-get update apt-get install -y imagemagick """)
Putting It All Together: A Sample Application
To demonstrate how everything fits together in a real-world application, take a look at an application I wrote to process Portable Document Format (PDF) documents into preview images for display on a Web site. You can grab the code from Github. This is a Django-based application that enables users to upload a PDF and a short time later have PNG previews of every page in that document.
The PDF is moved to Amazon S3 from the local Web server. Messages are queued up to process the uploaded PDFs; nodes are booted up in response to the queue count; and the nodes bootstrap themselves with software necessary to process the PDF. The nodes then write a result message to a queue that a background task is checking on the Web server, which updates the database with the appropriate record.
Conclusion
By now, you should have enough to get started leveraging the power of the cloud in building your next application. By using Amazon SQS and Amazon S3 along with on-demand compute nodes from Amazon EC2, applications can be built to achieve massive scale with little to no up-front investment, thereby dramatically lowering the barriers to entry in building the next big thing. Futhermore, using boto, you can focus more of your time on your business logic, not the fine details of interacting with the Web services.