s3funnel - Multi-threaded command-line tool for S3

Sample Code & Libraries>Python>s3funnel Multi threaded command line tool for S3
Community Contributed Software

  • Amazon Web Services provides links to these packages as a convenience for our customers, but software not authored by an "@AWS" account has not been reviewed or screened by AWS.
  • Please review this software to ensure it meets your needs before using it.

Written in Python, easy_install the package to install as an egg. Supports multithreaded operations for large volumes. Put, get, or delete many items concurrently, using a fixed-size pool of threads; Unix-friendly input and output. Pipe things in, out, and all around.

Details

Submitted By: Andrey Petrov
AWS Products Used: Amazon S3
Language(s): Python
License: MIT License
Source Control Access: http://code.google.com/p/s3funnel/
Created On: August 28, 2008 3:37 PM GMT
Last Updated: November 3, 2008 4:17 PM GMT
Download

s3funnel is a command line tool for Amazon's Simple Storage Service (S3).

  • Written in Python, easy_install the package to install as an egg.
  • Supports multithreaded operations for large volumes. Put, get, or delete many items concurrently, using a fixed-size pool of threads.
  • Built on workerpool for multithreading and boto for access to the Amazon S3 API.
  • Unix-friendly input and output. Pipe things in, out, and all around.

Perfect for anything from quickly inspecting a bucket and throwing in a new file, to doing batch backups of millions of files. Multithreading helps speed things up by an order of magnitude -- finally you can max out that free bandwidth from your EC2 instance to S3.

This tool has been used in a production environment for several months now, but there hasn't been much public exposure until now. Any feedback and bug reports are welcome! Please create tickets!

Installation

(Assuming a Unix environment with Python and easy_install)

$ easy_install s3funnel

Remember to export your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, or pass in the --aws_key and --aws_secret_key arguments to each command.

Usage

Usage: s3funnel BUCKET OPERATION [OPTIONS] [FILE]...

s3funnel is a multithreaded tool for performing operations on Amazon's S3.

Key Operations:
    DELETE Delete key from the bucket
    GET    Get key from the bucket
    PUT    Put file into the bucket (key corresponds to filename)

Bucket Operations:
    CREATE Create a new bucket
    DROP   Delete an existing bucket (must be empty)
    LIST   List keys in the bucket. If no bucket is given, buckets will be listed.


Options:
  -h, --help            show this help message and exit
  -a AWS_KEY, --aws_key=AWS_KEY
                        Overrides AWS_ACCESS_KEY_ID environment variable
  -s AWS_SECRET_KEY, --aws_secret_key=AWS_SECRET_KEY
                        Overrides AWS_SECRET_ACCESS_KEY environment variable
  -t N, --threads=N     Number of threads to use [default: 1]
  -T SECONDS, --timeout=SECONDS
                        Socket timeout time, 0 is never [default: 0]
  --start_key=KEY       (`list` only) Start key for list operation
  --acl=ACL             (`put` only) Set the ACL permission for each file
                        [default: public-read]
  -i FILE, --input=FILE
                        Read one file per line from a FILE manifest.
  -v, --verbose         Enable verbose output. Use twice to enable debug
                        output.

Examples

Note: Appending the -v flag will print useful progress information to stderr. Great for learning the tool and keeping track of progress.

$ s3funnel mybukkit create
$ s3funnel list
mybukkit
$ touch 1 2 3
$ s3funnel mybukkit put 1 2 3
$ s3funnel mybukkit list
1
2
3
$ rm 1 2 3
$ s3funnel mybukkit get 1 2 3 --threads=2
$ ls -1
1
2
3
$ s3funnel mybukkit list | s3funnel mybukkit delete
$ s3funnel mybukkit list
$ s3funnel mybukkit drop
$ s3funnel list

Comments

Really good code, hard to find on google
We have been looking for a solution to upload a large number (on the order of 10K - 50K) files that are relatively small. The underlying problem that we are solving involves generating forecasts and optimizing price and inventory for a large number of products for retailers. Instead of storing the data in a relational database, we realized that we could split the data, and generate a file for every product. The analysis that we do allows us to treat each product, hence each file, independently. We did not have much luck in loading up these large number of files into S3. Trying to break up the files in chunks and running s3cmd (both ruby and python versions) in parallel gave no improvement. We were testing a sample of about 1400 files, and it took us 7 minutes to upload all of them into a single bucket. With s3funnel, which was really easy to set up (using easy_install), we got this down to 14 seconds, using 10 threads. Running the job on a small ec2 instance. We are still testing it to make sure we can run it in production. I am surprised that the more popular s3cmd tools do not have multi-threaded support. Perhaps this problem is not as common, and may be deviating from the core use-case of S3. Thank you very much for developing this tool. Let me know if we can do anything to support and maintain this.
fhikatari on May 30, 2010 11:06 PM GMT
We are temporarily not accepting new comments.
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.