Son of Monster Muck Mashup - Mass Video Conversion Using AWS

Mitch Garnaat boosts his massively scalable Monster Muck video conversion service by ripping out complexity and plugging in a turbo-charged logging feature, powered by Amazon SimpleDB.

Submitted By: M. Garnaat
AWS Products Used: Amazon SimpleDB, Amazon EC2, Amazon SQS, Amazon S3
Language(s): Python
Created On: April 14, 2008

by Mitch Garnaat
2008-03-31

Just about a year ago I wrote an article called Monster Muck Mashup that showed how to combine the Big Three of Amazon Web Services:

Elastic Compute Cloud(EC2) for scalable compute resources
Simple Storage Service(S3) for unlimited, reliable storage
Simple Queue Service(SQS) for reliable messaging and loose coupling

to create a video conversion service that was inexpensive, efficient and could scale. That article generated a lot of interest and I still get people contacting me about using that basic architecture to perform batch processing like video conversion.

Because of the level of interest the original article generated, it seemed like a good idea to update that article. After all, lot's of things have changed since then. New features have been added to all of the three services involved and whole new services such as SimpleDB have been added. And, the boto library used in the original article has also progressed significantly in the past year.

So, in this article we're going to revisit our video conversion service. We are going to address some of shortcomings of the previous version and add some new functionality by leveraging new services and features from Amazon Web Services. Let's get started!

We Can Make It Better

Based on the feedback I have received over the past year, I would like to focus the improvements in the following areas:

Make it easier to use. The original system required a bunch of boto utilities to make it work. We wil still use some boto tools to enable some of the more advanced features but it should be possible to start up the AMI and convert videos without having anything other than the standard EC2 command line tools.
Allow the actual ffmpeg command to be passed to the AMI upon startup rather than hardcoding it into the AMI. The ffmpeg package supports an enormous number of options and conversions so having all of that power reduced to a single conversion is frustrating and silly.
Provide better error handling. Running the conversion as a remote service makes debugging a challenge. We can't solve all of those problems but we can certainly make things a bit easier.
Take advantange of AWS's newest service, SimpleDB to provide an alternative to logging status messages to another queue. We will allow the status information to be stored in SimpleDB so we can query our database later to generate status reports and track individual files and batches.
Update the ffmpeg code on the AMI to the latest code base.

Updating FFMPEG

The last item in our list above is the easiest so let's get that over with. The original AMI has been rebundled as ami-dc799cb5 and it contains the following ffmpeg version.

FFmpeg version SVN-r12488, Copyright (c) 2000-2008 Fabrice Bellard, et al.
  configuration: --enable-libmp3lame --enable-libfaac --enable-libvorbis --enable-libfaad --enable-libxvid
                 --enable-libx264 --enable-liba52 --enable-liba52bin --disable-shared --enable-static --enable-gpl
  libavutil version: 49.6.0
  libavcodec version: 51.51.0
  libavformat version: 52.10.0
  libavdevice version: 52.0.0

In addition to updating the ffmpeg program the new AMI also includes the latest version of boto which will allow us to tackle some of the other items on our todo list.

With Great Power Comes Great Responsibility...

The ffmpeg program provides a daunting array of options and supports multitudes of different input and output formats. In the original MMM article, we nullified all of that by basically hard-coding the command to convert our AVI format videos to MP4's that would work on our iPod. We're going to remedy that in our update and allow the full capabilities of ffmpeg to be utilized in our conversion service. So, the good news is that you will have the full power of ffmpeg at your disposal. The bad news is, you are responsible for determining the right set of command line options to achieve your desired conversions.

Probably the easiest way to determine which command line options work best for your videos is to fire up the sonofmmm AMI with a SSH keypair, login and actually run the command on the instance. You can use scp or sftp to transfer videos to and from the instance. Once you have a command that produces the output you need to copy that command down. You will need it later!

Let's Start Simple

We are going to start with the simplest scenario. Let's assume that you have a bunch of videos sitting in a bucket in S3 (let's call it my_source_videos. Let's also assume that you want to perform the same conversion on all of those files and store the results in another bucket in S3 (let's call this one my_converted_videos. We can accomplish this task using only the EC2 command-line utilities. The only thing we need to do is to put together the instructions that will tell our sonofmmm AMI what to do when it starts up. You may recall that in the original article, we passed data to the instance by stringing together name/value pairs, using a pipe ("|") character as a separator. That works okay for simple values but would be a pretty awkward way to pass full ffmpeg command lines. To address this, and other issues, the underlying boto system now uses config files to specify the options passed to new instances. That means you can edit the files with any text editor and you can also manage them as files, i.e. they can be checked into your version control system. An example of a config file for the new sonofmmm AMI is shown below.

#
# Your AWS Credentials
# You only need to supply these in this file if you are not using
# the boto tools to start your service
#
[Credentials]
aws_access_key_id = AWS Access Key Here
aws_secret_access_key = AWS Secret Key Here

#
# Fill out this section if you want emails from the service
# when it starts and stops
#
#[Notification]
#smtp_host = your smtp host
#smtp_user = your smtp username, if necessary
#smtp_pass = your smtp password, if necessary
#smtp_from = email address for From: field
#smtp_to = email address for To: field

[Pyami]
scripts = boto.services.sonofmmm.SonOfMMM

[SonOfMMM]
ami_id = ami-dc799cb5
ffmpeg_args = -y -i %%s -f mov -r 29.97 -b 1200kb -mbd 2 -flags +4mv+trell -aic 2 -cmp 2 -subcmp 2 -ar 48000 -ab 19200 -s 320x240 -vcodec mpeg4 -acodec libfaac %%s
output_mimetype = video/quicktime
output_ext = .mov
input_bucket = my_input_bucket
output_bucket = my_output_bucket
input_queue = my_input_queue

First, let's look at the values that are being passed to the conversion service itself. These can be found in the section title [SonOfMMM].

ffmpeg_args - These are the actual options passed to the ffmpeg command. Any valid ffmpeg options can be specified here. Note the %%s characters in the list of args. These symbols will be substituted with the actual input filename and output filename at run time, based on the message read from the queue.
output_mimetype - The mime type of the file being produced as an output of the ffmpeg command.
output_ext - The file extension used for the output file generated by ffmpeg. Note that the extension must include the period character.
input_bucket - The S3 bucket that contains source videos to be converted. This is also used later as the place to upload videos to be converted when we are using the boto tools to manage our services. In our simple example, the AMI, upon startup, will iterate over all videos in the bucket and create an input message for each file. It will then start reading the messages from the input_queue and performing the conversion.
input_queue - The SQS queue from which messages will be read. Each message represents one unit of work or, in our case, one video file to be converted.
output_bucket - The S3 bucket in which the converted videos will be stored. If this bucket does not exist, the converter will attempt to create it. This can be the same bucket as the input_bucket.

Now, let's look at those other sections in the config file. The [Pyami] section contains a single entry, scripts. This basically tells the newly started instance the name of a Python class that should be run upon startup. In our case, that's our SonOfMMM class. The latest version of that code can be found here.

The [Credentials] section contains, obviously, your AWS credentials. When you are using the boto tools to manage your services, this section will be added automatically by the tools. That's kind of nice because it means you don't have to put those valuable credentials in yet another file on your system where they could be compromised. However, if you are starting the services using the AWS command line tools you need to supply these credentials explicitely in the config file. These are the credentials that will be used by the service to read the videos to be converted and to write the output videos so make sure the credentials you supply here have sufficient access to the resources to perform their work.

The entire [Notification] section is optional but can be useful for helping you understand what's going on with your remote service. If you include the section and supply the appropriate SMTP values the conversion service will send email messages when an instance starts up and when it terminates.

So, the idea is to create your own version of this config file or possibly many versions of this config file. Each version of the file would represent a different kind of conversion or a different set of inputs and outputs. Then you would pass the appropriate config file to the new conversion service instance when it starts. For our purposes, let's assume that we have done that and our config file now exists on our local machine as the file ~/sonofmmm.cfg. Now, we want to start up a new instance of the SonOfMMM AMI and turn it loose on our videos. The command to do that is:

mitch$ ec2-run-instances -f ~/sonofmmm.cfg ami-bc6184d5

This will start up a new instance and pass the data in our config file to it so it's available to the instance once it starts up. The instance will then iterate over all of the keys in input_bucket, create an SQS message for each one and then start reading the messages from the queue and processing the videos. The results will be stored in output_bucket and when all of the conversions have been performed the instance will write a log file to the output_bucket that gives details about the work performed by the instance and any errors encountered. Then the instance will shut itself down.

Same As It Ever Was

The previous example represents the simplest possible scenario and didn't require any boto tools to be installed locally. That's a new capability available in this version of the AMI that, hopefully, will prove useful to some folks. For the next scenario I would like to recreate the same example used in the original MMM article. In that example, we did the following:

Submitted files to the conversion process by copying them to an S3 bucket using the submit_files.py command.
Launched our conversion service(s) using the start_service.py command
Retrieved our results using the get_results.py command

It all worked but it was far more complicated than it needed to be. For starters, each of those commands took a bunch of different options and each command was run with no context of the service being called. So, each time we had to tell the command things like what bucket to use for inputs, what bucket to use for outputs what queue to use, etc. Fortunately, that's exactly the information we have now captured in our config file. So, all of those commands have now been replaced with a single command, bs.py which stands for boto service, of course. The basic syntax for the command is:

bs.py [options] config_file command

So, let's recreate the example from the first article using the new tools. Since the original example used a separate SQS queue to store results, we need to make a small edit to our config file to specify which queue to use.

...            
[SonOfMMM]
ami_id = ami-bc6184d5
ffmpeg_args = -y -i %%s -f mov -r 29.97 -b 1200kb -mbd 2 -flags +4mv+trell -aic 2 -cmp 2 -subcmp 2 -ar 48000 -ab 19200 -s 320x240 -vcodec mpeg4 -acodec libfaac %%s
output_mimetype = video/quicktime
output_ext = .mov
input_bucket = my_input_bucket
output_bucket = my_output_bucket
input_queue = my_input_queue
output_queue = my_output_queue

The first step is to submit the videos to be converted. Let's assume we have the videos sitting on your local machine in the directory ~/movies. We would use the following command to submit those files for processing.

mitch$ bs.py -p ~/movies ~/sonofmmm.cfg submit
Submitting /Users/mitch/movies/MVI_3110.AVI
...
Submitting /Users/mitch/movies/MVI_4902.AVI
A total of 50 files were submitted
Batch Identifier: 2008_4_11_13_53_37_4_102_0

The -p option tells the command where to find the video files to submit to the service. You can either point the command at a single video file or a directory containing lots of videos. The next thing on the command line is the path to the config file for the service to which we are submitting the videos. The config file specifies what bucket the files should be stored in and which queue to write the messages to so there is no need to pass that information on the command line. The final element is the command we are performing. In our case we are submitting files so the command name is The submit. This command will upload all of the files in the specified directory to the appropriate bucket in S3 (based on info in the config file). If the files already exist in the bucket the upload operation will be skipped. The command will also queue up a message in the appropriate SQS queue for each file that is submitted. Finally, the command prints some stats and a The Batch Identifier. You should record that batch identifier somewhere, we'll need it later.

Now that we have submitted the files to be converted, we can start up our conversion service. We'll use the same The bs.py command to do that but with a different command.

mitch$ bs.py ~/my_sonofmmm.cfg start
Starting AMI: ami-bc6184d5
Reservation r-1502fd7c contains the following instances:
	i-00dd1b69

There are some options I could have passed to the command, such as a The -k option to specify an SSH keypair which would allow me to login to the instance or a The -n option to control how many instances are started. You can get a complete list of available options by calling the command with the The -h option. For our purposes here, though, the simplest form will suffice. The The bs.py command takes care of adding our AWS credentials to the config file if they aren't there already and also serializes the config file and passes it to the instance so all of the same data is available to the conversion service when it starts.

The final step is to retrieve the results of our conversion. We can't do that until the conversion process is complete. If you have filled in the The [Notification] section of your config file you will get email when the service starts and stops. If not, just wait until the instance terminates. Once the conversions are complete, you can retrieve the generated files using the The retrieve command.

mitch$ bs.py -b 2008_4_11_13_53_37_4_102_0 -p ~/movies ~/sonofmmm.cfg retrieve
retrieving file: MVI_3330.mov to /Users/mitch/movies/MVI_3330.mov
...
50 results successfully retrieved.
Minimum Processing Time: 2
Maximum Processing Time: 39
Average Processing Time: 13.607843
Elapsed Time: 709
Throughput: 4.315938 transactions / minute

So What Have You Done For Me Lately?

Okay, I can hear the comments now. Sure, the config files make it cleaner and easier to define the parameters of my service. And, yes, I now have complete control over the ffmpeg. And I suppose the bs.py command is an easier way to control my services. But, where's the sizzle? I mean, come on, show me something new and shiny, already!

So, to address those anticipated comments the new Son Of Monster Muck Mashup now includes, at no extra cost (aside from any AWS fees, of course!), access to a highly-available, highly-scalable database in which to store all of your status messages. Storing all of this information in SimpleDB means it can be persisted for as long as you want to keep it around and you can use the SimpleDB query language to generate reports and statistics to your hearts content. To enable logging to SimpleDB, we need to change one line of our config file.

[SonOfMMM]
ami_id = ami-bc6184d5
ffmpeg_args = -y -i %%s -f mov -r 29.97 -b 1200kb -mbd 2 -flags +4mv+trell -aic 2 -cmp 2 -subcmp 2 -ar 48000 -ab 19200 -s 320x240 -vcodec mpeg4 -acodec libfaac %%s
output_mimetype = video/quicktime
output_ext = .mov
input_bucket = my_input_bucket
output_bucket = my_output_bucket
input_queue = my_input_queue
output_domain = my_output_domain

And that's it. Notice that I have removed the output_queue option in the config file. It is actually possible to specify both in your config file but it's not clear why you would want to. Once I have made that change, all of my status info will be stored in SimpleDB. When storing the data in SimpleDB, boto creates an item for each file that is processed. The name of the item is of the form:

2008-04-11T15:08:57Z/my_input_bucket/MVI_3632.AVI

This has the following values:

Timestamp for when the processing completed
Name of the bucket in which the source file was stored
Key of the source file

The values are concatentated with the "/" character used as a separator. Boto then stores each of the name/value pairs of the message as attribute names and values associated with that item. Here's an example of the data that would be stored in SimpleDB for a single processed file.

OutputKey = MVI_3632.mov;type=video/quicktime
OriginalLocation = /Users/mitch/movies
FileCreateDate = 2007-05-09T00:28:56Z
Bucket = my_input_bucket
Batch = 2008_4_11_13_53_37_4_102_0
Server = SonOfMMM
FileAccessedDate = 2008-04-11T13:54:28Z
Host = unknown
Instance-ID = i-00dd1b69
InputKey = MVI_3632.AVI
OutputBucket = my_output_bucket
Service-Read = 2008-04-11T15:08:41Z
FileModifiedDate = 2005-04-13T23:17:22Z
Content-Type = video/x-msvideo
Service-Write = 2008-04-11T15:08:57Z
OriginalFileName = MVI_3632.AVI
Size = 8365258

As I mentioned before, the biggest benefit of having this data in SimpleDB is that it is all indexed and searchable. That opens up a ton of possibilities and it's beyond the scope of this update to explore that fully but a couple quick examples should be enough to get you going in the right direction. So, what if I wanted to find all batches that include transactions that used the file MVI_3632.AVI as the source file. We could do that in boto like this:

>>> import boto
>>> domain = boto.lookup('sdb', 'my_output_domain')
>>> rs = d.query("['InputKey'='MVI_3632.AVI']")
>>> for item in rs:
   ...:     print item['Batch']
   ...: 
2008_4_9_17_47_11_2_100_0
2008_4_10_2_59_44_3_101_0
2008_4_9_22_13_26_2_100_0
2008_4_11_13_53_37_4_102_0

How about if I wanted to find out what bucket was used for output with batch 2008_4_9_22_13_26_2_100_0, how many input files there were and the total size, in bytes of all input files?

>>> rs = d.query("['Batch'='2008_4_9_22_13_26_2_100_0']")
>>> num_files = total_size = 0
>>> buckets = []
>>> for item in rs:
...     num_files += 1
...     total_size += int(item['Size'])
...     if item['OutputBucket'] not in buckets:
...         buckets.append(item['OutputBucket'])
>>> num_files
50
>>> total_size
271362454
>>> buckets
[u'my_output_videos']

Obviously, there are a lot of variations on this theme and I'm sure you can come up with a bunch on your own. You can also use other tools to access the data in SimpleDB since it's stored in a very language-independent way. If you are going to be processing lots of videos, SimpleDB can really help you keep track of them.

One Last Thing

We've seen that this updated version of the video conversion tools provides better control over ffmpeg, better tools to control and manage the service, some improved debugging tools and the ability to store your status messages in SimpleDB. But what if you want to customize this even further? What if you want to store some additional metadata in SimpleDB or generate multiple outputs for each input file or... well, you get the picture. The final thing I want to show in this update is a brief glimpse of how easy it is to customize the service and do your own thing. To accomplish this, you will need to do the following.

First, make a copy of the current sonofmmm.py source file. The original file can be found in your boto distribution in boto/boto/services/sonofmmm.py. Let's call our copy mysonofmmm.py.

Edit your new copy of the file and change the class name. Let's call our modified class MySonOfMMM. So,

class SonOfMMM(Service):

    def __init__(self, config_file=None):

becomes

class MySonOfMMM(Service):

    def __init__(self, config_file=None):

Copy your new Python file to a bucket in S3. Let's say our bucket is called my_scripts so after copying the file we can retrieve it as my_scripts/mysonofmmm.py
Finally, edit your config file to reference your new script rather than the standard one in boto. To do that, you need to change this line:
```
[Pyami]
scripts = boto.services.sonofmmm.SonOfMMM
```
to this:
```
[Pyami]
packages = s3:my_scripts/mysonofmmm.py
scripts = mysonofmmm.MySonOfMMM
```
This tells boto to download your script from the specified location in S3, put it somewhere on your instance where it can be imported by Python and then run it rather than the standard Boto class. Once you have the ability to run your own Python code on the instance, the possibilities are endless, really. I'll leave the rest to your imagination.

Till Next Time

There's still lots more things we could do with our conversion service and maybe one of these days we'll have another update. But for now, I hope these improvements help those who are using this code as is, and also inspire others to do something even better.

Additional Resources

Amazon Web Services: https://aws.amazon.com
Amazon S3: https://aws.amazon.com/s3
Amazon SQS: https://aws.amazon.com/sqs
Amazon SimpleDB: https://aws.amazon.com/simpledb
Amazon EC2: https://aws.amazon.com/ec2
boto: https://code.google.com/p/boto/
Python: https://www.python.org
See the sample under Related Documents to download the code from this article.

Mitch Garnaat is a software guy living in upstate New York.

Select your cookie preferences