Getting the Most Out of the Amazon S3 CLI

Editor’s note: For the latest information on Amazon S3, visit the Amazon S3 website.

By Scott Ward and Michael Ruiz, Partner Solutions Architects at AWS

Amazon Simple Storage Service (Amazon S3) makes it possible to store unlimited numbers of objects, each up to 5 TB in size. Managing resources at this scale requires quality tooling. When it comes time to upload many objects, a few large objects or a mix of both, you’ll want to find the right tool for the job.

This post looks at one option that is sometimes overlooked: the AWS Command Line Interface (AWS CLI) for Amazon S3.

Some of the examples in this post take advantage of more advanced features of the Linux/UNIX command line environment and the bash shell. We included all of these steps for completeness, but won’t spend much time detailing the mechanics of the examples in order to keep the post at reasonable length.

What is Amazon S3?

Amazon S3 is global online object store and has been a core AWS service offering since 2006. Amazon S3 was designed for scale: it currently stores trillions of objects with peak load measured in millions of requests per second. The service is designed to be cost-effective—you pay only for what you use—durable, and highly available. See the Amazon S3 product page for more information about these and other features.

Data uploaded to Amazon S3 is stored as objects in containers called buckets and identified by keys. Buckets are associated with an AWS region and each bucket is identified with a globally unique name. See the S3 Getting Started guide for a typical Amazon S3 workflow.

Amazon S3 supports workloads as diverse as static website hosting, online backup, online content repositories, and big data processing, but integrating Amazon S3 into an existing on-premises or cloud environment can be challenging. While there is a rich landscape of tooling available from AWS Partners and open-source communities, a great place to start your search is the AWS CLI for Amazon S3.

WS Command Line Interface (AWS CLI)

The AWS CLI is an open source, fully supported, unified tool that provides a consistent interface for interacting with all parts of AWS, including Amazon S3, Amazon Elastic Compute Cloud (Amazon EC2), Amazon Virtual Private Cloud (Amazon VPC), and other services. General information about the AWS CLI can be found in the AWS CLI User Guide.

In this post we focus on the aws s3 command set in the AWS CLI. This command set is similar to standard network copy tools you might already be familiar with, like scp or rsync, and is used to copy, list, and delete Amazon S3 buckets and objects. This tool supports the key features required for scaled operations with Amazon S3, including multipart parallelized uploads, automatic pagination for queries that return large lists of objects, and tight integration with AWS Identity and Access Management (IAM) and Amazon S3 metadata.

The AWS CLI also provides the aws s3api command set, which exposes more of the unique features of Amazon S3 and provides access to bucket metadata, like lifecycle policies designed to migrate or delete data automatically.

There are two pieces of functionality built into the AWS CLI for Amazon S3 tool that help make large transfers (many files and large files) into Amazon S3 go as quickly as possible:

First, if the files are over a certain size, the AWS CLI automatically breaks the files into smaller parts and uploads them in parallel. This is done to improve performance and to minimize impact due to network errors. Once all the parts are uploaded, Amazon S3 assembles them into a single object. See the Multipart Upload Overview for much more data on this process, including information on managing incomplete or unfinished multipart uploads.

Second, the AWS CLI automatically uses up to 10 threads to upload files or parts to Amazon S3, which can dramatically speed up the upload.

These two pieces of functionality can support the majority of your data transfer requirements, eliminating the need to explore other tools or solutions.

For more information on installation, configuration and, usage of the AWS CLI and the s3 commands, see the following AWS documentation:

AWS S3 Data Transfer Scenarios

Let’s take a look at using the AWS CLI for Amazon S3 in the following scenarios and dive into some details of the Amazon S3 mechanisms in play, including parallel copies and multipart uploads.

Example 1: Uploading a large number of very small files to Amazon S3
Example 2: Uploading a small number of very large files to Amazon S3
Example 3: Periodically synchronizing a directory that contains a large number of small and large files that change over time
Example 4: Improving data transfer performance with the AWS CLI

Environment Setup

The source server for these examples is an Amazon EC2 m3.xlarge instance located in the US West (Oregon) region. This server is well equipped with 4 vCPUs and 15 GB RAM, and we can expect a sustained throughput of about 1 Gb/sec over the network interface to Amazon S3. This instance will be running the latest Amazon Linux AMI (Amazon Linux AMI 2015.03 (HVM).

The example data will reside in an Amazon EBS 100 GB General Purpose (SSD) volume, which is an SSD-based, network-attached block storage device attached to the instance as the root volume.

The target bucket is located in the US East (N. Virginia) region. This is the region you will specify for buckets created using default settings or when specifying us-standard as the bucket location. Buckets have no maximum size and no object-count limit.

All commands that are represented in this document are run from the bash command line. All command-line instructions will be represented by a $ as the starting point for the command.

We will be using the aws s3 command set throughout the examples. Here is an explanation for several common commands and options used in these examples:

The cp command initiates a copy operation to or from Amazon S3.
The --recursive option instructs the AWS CLI for Amazon S3 to descend into subdirectories on the source.
The --quiet option instructs the AWS CLI for Amazon S3 to print only errors rather than a line for each file copied.
The --sync option instructs the AWS CLI for Amazon S3 to initiate a copy to or from Amazon S3.
The Linux time command is used with each AWS CLI call in order to get statistics on how long the command took.
The Linux xargs command is used to invoke other commands based on standard output or output piped to it from other commands.

Example 1 – Uploading a Large Number of Small Files

In this example we are going to simulate a fairly difficult use case: moving thousands of little files distributed across many directories to Amazon S3 for backup or redistibution. The AWS CLI can perform this task with a single command, s3 cp --recursive, but we will show the entire example protocol for clarity. This example will utilize the multithread upload functionality of the aws s3 commands.

Create the 26 directories named for each letter of the alphabet, then create 2048 files containing 32K of pseudo-random content in each

$ for i in {a..z}; do
    mkdir $i
    seq -w 1 2048 | xargs -n1 -P 256 -I % dd if=/dev/urandom of=$i/% 
bs=32k count=1
done

Confirm the number of files we created for later verification:

$ find . -type f | wc -l
53248

Copy the files to Amazon S3 by using aws s3 cp, and time the result with the time command:

$ time aws s3 cp --recursive --quiet . s3://test_bucket/test_smallfiles/

real    19m59.551s
user    7m6.772s
sys     1m31.336s

The time command returns the ‘real’ or ‘wall clock’ time the aws s3 cp took to complete. Based on the real output value from the time command, the example took 20 minutes to complete the copy of all directories and the files in those directories.

Notes:

Our source is the current working directory (.) and the destination is s3://test_bucket/test_smallfiles.
The destination bucket is s3://test_bucket.
The destination prefix is test_smallfiles/. Note that this is not a directory in the usual sense, but rather a key prefix that will be prepended to the file name of each object to build the final key name.

TIP:

In many real-world scenarios, the naming convention you use for your Amazon S3 objects will have performance implications. See this blog post and this document for details about object key naming strategies that will ensure high performance as you scale to hundreds or thousands of requests per second.

We used the Linux lsof command to capture the number of open connections on port 443 while the above copy (cp) command was running:

$ lsof -i tcp:443
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME

aws     22223 ec2-user    5u  IPv4 119954      0t0  TCP ip-10-0-0-37.us-west-2.com
pute.internal:48036->s3-1-w.amazonaws.com:https (ESTABLISHED)

aws     22223 ec2-user    7u  IPv4 119955      0t0  TCP ip-10-0-0-37.us-west-2.com
pute.internal:48038->s3-1-w.amazonaws.com:https (ESTABLISHED)

<SNIP>

aws     22223 ec2-user   23u  IPv4 118926      0t0  TCP ip-10-0-0-37.us-west-2.com
pute.internal:46508->s3-1-w.amazonaws.com:https (ESTABLISHED)

...10 open connections

You may be surprised to see there are 10 open connections to Amazon S3 even though we are only running a single instance of the copy command (we truncated the output for clarity, but there were ten connections established to the Amazon S3 endpoint ‘s3-1-w.amazonaws.com’). This demonstrates the native parallelism built into the AWS CLI.

Here is an example of a similar command that gives us the count of open threads directly:

$ lsof -i tcp:443 | tail -n +2 | wc -l

10

Let’s also peek at the CPU load during the copy operation:

$ mpstat -P ALL 10
Linux 3.14.35-28.38.amzn1.x86_64 (ip-10-0-0-37)     05/04/2015  
_x86_64_    (4 CPU)

<SNIP>
09:43:18 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
09:43:19 PM  all    6.33    0.00    1.27    0.00    0.00    0.00    0.51    0.00   91.90
09:43:19 PM    0   14.14    0.00    3.03    0.00    0.00    0.00    0.00    0.00   82.83
09:43:19 PM    1    6.06    0.00    2.02    0.00    0.00    0.00    0.00    0.00   91.92
09:43:19 PM    2    2.04    0.00    0.00    0.00    0.00    0.00    1.02    0.00   96.94
09:43:19 PM    3    2.02    0.00    0.00    0.00    0.00    0.00    1.01    0.00   96.97

The system is not seriously stressed given the small file sizes involved. Overall, the CPU is 91.90% idle. We don’t see any %iowait, %sys, or %user activity, so we can assume that almost all of the CPU time is spent running the AWS CLI commands and handling file metadata.

6. Finally, let’s use the aws s3 ls command to list the files we moved to Amazon S3 and get a count to confirm that the copy was successful:

$ aws s3 ls --recursive s3://test_bucket/test_smallfiles/ | wc -l
53248

This is the expected result: 53,248 files were uploaded, which matches the local count in step 2.

Summary:

Example 1 took 20 minutes to move 53,248 files at a rate of 44 files/sec (53,248 files / 1,200 seconds to upload) using 10 parallel streams.

Example 2 – Uploading a Small Number of Large Files

In this example we will create five 2-GB files and upload them to Amazon S3. While the previous example stressed operations per second (both on the local system and in operating the aws s3 upload API), this example will stress throughput. Note that while Amazon S3 could store each of these files in a single part, the AWS CLI for Amazon S3 will automatically take advantage of the S3 multipart upload feature. This feature breaks each file into a set of multiple parts and parallelizes the upload of the parts to improve performance.

Create five files filled with 2 GB of pseudo-random content:

$ seq -w 1 5 | xargs -n1 -P 5 -I % dd if=/dev/urandom of=bigfile.% b
s=1024k count=2048

Since we are writing 10 GB to disk, this command will take some time to run.

List the files to verify size and number:

$ du -sk .
10485804

$ find . -type f | wc -l
5

This is showing that we have 10 GB (10,485,804 KB) of data in 5 files, which matches our goal of creating five files of 2 GB each.

Copy the files to Amazon S3:

$ time aws s3 cp --recursive --quiet . s3://test_bucket/test_bigfiles/

real    1m48.286s
user    1m7.692s
sys     0m26.860s

Notes:

Our source prefix is the current working directory (.) and the destination is s3://test_bucket/test_bigfiles.
The destination bucket is s3://test_bucket.
The destination prefix is test_bigfiles/. Note that this is not a directory in the usual sense, but rather a key prefix that will be prepended to the file name of each object to build the final key name.

We again capture the number of open connections on port 443 while the copy command is running to demonstrate the parallelism built into the AWS CLI for Amazon S3:

$ lsof -i tcp:443 | tail -n +2 | wc -l
10

Looks like we still have 10 connections open. Even though we only have 5 files, we are breaking each file into multiple parts and uploading them in 10 individual streams.

Capture the CPU load:

$ mpstat -P ALL 10
Linux 3.14.35-28.38.amzn1.x86_64 (ip-10-0-0-37)     05/04/2015  _
x86_64_    (4 CPU)

<SNIP>
10:35:47 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
10:35:57 PM  all    6.30    0.00    3.57   76.51    0.00    0.17    0.75    0.00   12.69
10:35:57 PM    0    8.15    0.00    4.37   75.21    0.00    0.71    1.65    0.00    9.92
10:35:57 PM    1    5.14    0.00    3.20   75.89    0.00    0.00    0.46    0.00   15.31
10:35:57 PM    2    4.56    0.00    2.85   75.17    0.00    0.00    0.46    0.00   16.97
10:35:57 PM    3    7.53    0.00    3.99   79.36    0.00    0.00    0.57    0.00    8.55

This is a much more serious piece of work for our instance: We see around 70-80% iowait (where the CPU is sitting idle, waiting for disk I/O) on every core. This hints that we are reaching the limits of our I/O subsystem, but also demonstrates a point to consider: The AWS CLI for Amazon S3, by default and working with large files, is a powerful tool that can really stress a moderately powered system.

6. Check our count of the number of files moved to Amazon S3 to confirm that the copy was successful:

$ aws s3 ls --recursive s3://test_bucket/test_bigfiles/ | wc -l
5

7. Finally, let’s use the aws s3api command to examine the object head metadata on one of the files we uploaded.

$ aws s3api head-object --bucket test_bucket --key test_bigfiles/bigfile
.1
bytes   2147483648      binary/octet-stream     "9d071264694b3a028a22f20
ecb1ec851-256"    Thu, 07 May 2015 01:54:19 GMT

The 4th field in the command output is the ETag (opaque identifier), which contains an optional ‘-’ if the object was uploaded with multiple parts. In this case we see that the ETag ends with ‘-256’ indicating that the s3 cp command split the upload into 256 parts. Since all the parts but the last are of the same size, a little math tells us that each part is 8 MB in size.

The AWS CLI for Amazon S3 is built to optimize upload and download operations while respecting Amazon S3 part sizing rules. The Amazon S3 minimum part size (5 MB, except for the last part which can be smaller), the maximum part size (5 GB), and the maximum number of parts (10,000) are described in theS3 Quick Facts documentation.

Summary:

In example 2, we moved five 2-GB files to Amazon S3 in 10 parallel streams. The operation took 1 minute and 48 seconds. This represents an aggregate data rate of ~758 Mb/s (85,899,706,368 bytes in 108 seconds) – about 80% of the maximum bandwidth available on our host.

Example 3 – Periodically Synchronizing a Directory

In this example, we will keep the contents of a local directory synchronized with an Amazon S3 bucket using the aws s3 sync command. The rules aws s3 sync will follow when deciding when to copy a file are as follows: “A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix.” See the command reference for more information about these rules and additional arguments available to modify these behaviors.

This example will use multipart upload and parallel upload threads.

Let’s make our example files a bit more complicated and use a mix of file sizes (warning: inelegant hackery imminent):

>
$ i=1;
while [[ $i -le 132000 ]]; do
    num=$((8192*4/$i))
    [[ $num -ge 1 ]] || num=1
    mkdir randfiles/$i
    seq -w 1 $num | xargs -n1 -P 256 -I % dd if=/dev/urandom of=r
andfiles/$i/file_$i.% bs=16k count=$i;
    i=$(($i*2))
done

Check our work by getting file sizes and file counts:

$ du -sh randfiles/
12G     randfiles/
$ find ./randfiles/ -type f | wc -l
65537

So we have 65537 files totaling 12 GB in size, to sync.

Upload to Amazon S3 using the aws s3 sync command:

$ time aws s3 sync --quiet . s3://test_bucket/test_randfiles/ real    26m41.194s user    10m7.688s sys     2m17.592s

Notes:

Our source prefix is the current working directory (.) and the destination is s3://test_bucket/test_randfiles/.
The destination bucket is s3://test_bucket.
The destination prefix is test_randfiles/. Note that this is not a directory in the usual sense, but rather a key prefix that will be prepended to the file name of each object to build the final key name.

We again capture the number of open connections while the sync command is running to demonstrate the parallelism built into the AWS CLI for Amazon S3:

$ lsof -i tcp:443 | tail -n +2 | wc -l
10

Let’s check the CPU load. We are only showing one sample interval, but the load will vary much more than the other runs as the AWS CLI for Amazon S3 deals with various files of varying file sizes:

$ mpstat -P ALL 10
Linux 3.14.35-28.38.amzn1.x86_64 (ip-10-0-0-37)     05/07/2015  _
x86_64_    (4 CPU)

03:08:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
03:09:00 AM  all    6.23    0.00    1.70    1.93    0.00    0.08    0.31    0.00   89.75
03:09:00 AM    0   14.62    0.00    3.12    2.62    0.00    0.30    0.30    0.00   79.03
03:09:00 AM    1    3.15    0.00    1.22    0.41    0.00    0.00    0.31    0.00   94.91
03:09:00 AM    2    3.06    0.00    1.02    0.31    0.00    0.00    0.20    0.00   95.41
03:09:00 AM    3    4.00    0.00    1.54    4.41    0.00    0.00    0.31    0.00   89.74

Let’s run a quick count to verify that the synchronization is complete:

$ aws s3 ls --recursive s3://test_bucket/test_randfiles/  | wc -l
65537

Looks like all the files have been copied!

Now we’ll make some changes to our source directory:

With this command we are touching eight existing files to update the modification time (mtime) and creating a directory containing five new files.

$ touch 4096/*
$ mkdir 5_more
$ seq -w 1 5 | xargs -n1 -P 5 -I % dd if=/dev/urandom of=5_more/5
_more% bs=1024k count=5

$ find . –type f -mmin -10
.
./4096/file_4096.8
./4096/file_4096.5
./4096/file_4096.3
./4096/file_4096.6
./4096/file_4096.4
./4096/file_4096.1
./4096/file_4096.7
./4096/file_4096.2
./5_more/5_more1
./5_more/5_more4
./5_more/5_more2
./5_more/5_more3
./5_more/5_more5

Rerun the sync command. This will compare the source and destination files and upload any changed files to Amazon S3:

$ time aws s3 sync . s3://test_bucket/test_randfiles/
upload: 4096/file_4096.1 to s3://test_bucket/test_randfiles/4096/file_4096.1
upload: 4096/file_4096.2 to s3://test_bucket/test_randfiles/4096/file_4096.2
upload: 4096/file_4096.3 to s3://test_bucket/test_randfiles/4096/file_4096.3
upload: 4096/file_4096.4 to s3://test_bucket/test_randfiles/4096/file_4096.4
upload: 4096/file_4096.5 to s3://test_bucket/test_randfiles/4096/file_4096.5
upload: 4096/file_4096.6 to s3://test_bucket/test_randfiles/4096/file_4096.6
upload: 4096/file_4096.7 to s3://test_bucket/test_randfiles/4096/file_4096.7
upload: 5_more/5_more3 to s3://test_bucket/test_randfiles/5_more/5_more3
upload: 5_more/5_more5 to s3://test_bucket/test_randfiles/5_more/5_more5
upload: 5_more/5_more4 to s3://test_bucket/test_randfiles/5_more/5_more4
upload: 5_more/5_more2 to s3://test_bucket/test_randfiles/5_more/5_more2
upload: 5_more/5_more1 to s3://test_bucket/test_randfiles/5_more/5_more1
upload: 4096/file_4096.8 to s3://test_bucket/test_randfiles/4096/file_409
6.8

real    1m3.449s
user    0m31.156s
sys     0m3.620s

Notice that only the touched and new files were transferred to Amazon S3.

Summary:

This example shows the result of running the sync command to keep local and remote Amazon S3 locations synchronized over time. Synchronizing can be much faster than creating a new copy of the data in many cases.

Example 4 – Maximizing Throughput

When you’re transferring data to Amazon S3, you might want to do more or go faster than we’ve shown in the three previous examples. However, there’s no need to look for another tool—there is a lot more you can do with the AWS CLI to achieve maximum data transfer rates. In our final example, we will demonstrate running multiple commands in parallel to maximize throughput.

In the first example we uploaded a large number of small files and achieved a rate of 44 files/sec. Let’s see if we can do better. What we are going to do is string together a few additional Linux commands to help influence how the aws s3 cp command runs.

Launch 26 copies of the aws s3 cp command, one per directory:

$ time ( find smallfiles -mindepth 1 -maxdepth 1 -type d -print0 | xargs -n1 -0 -P30 -I {} aws s3 cp --recursive --quiet {}/ s3://test_bucket/{}/ )
real    2m27.878s
user    8m58.352s
sys     0m44.572s

Note how much faster this completed compared with our original example which took 20 minutes to run.

Notes:

The find part of the above command passes a null-terminated list of subdirectories to the ‘smallfiles’ directory to xargs.
xargs launches up to 30 parallel (‘-P30’) invocations of aws s3 cp. Only 26 are actually launched based on the output of the find.
xargs replaces the ‘{}’ argument in the aws s3 cp command with the file name passed from the output of the find command.
The destination here is s3://test_bucket/smallfiles/, which is slightly different from example 1.

Note the number of open connections

$ lsof -i tcp:443 | tail -n +2 | wc -l
260

We see 10 connections for each of the 26 invocations of the s3 cp command.

Let’s check system load:

$ mpstat -P ALL 10
Linux 3.14.35-28.38.amzn1.x86_64 (ip-10-0-0-37)     05/07/2015  _
x86_64_    (4 CPU)

07:02:49 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
07:02:59 PM  all   91.18    0.00    5.67    0.00    0.00    1.85    0.00    0.00    1.30
07:02:59 PM    0   85.30    0.00    6.50    0.00    0.00    7.30    0.00    0.00    0.90
07:02:59 PM    1   92.61    0.00    5.79    0.00    0.00    0.00    0.00    0.00    1.60
07:02:59 PM    2   93.60    0.00    5.10    0.00    0.00    0.00    0.00    0.00    1.30
07:02:59 PM    3   93.49    0.00    5.21    0.00    0.00    0.00    0.00    0.00    1.30

The server is finally doing some useful work! Since almost all the time is spent in %user with very little %idle or %iowait, we know that the CPU is working hard on application logic without much constraint from the storage or network subsystems. It’s likely that moving to a larger host with more CPU power would speed this process up even more.

Verify the file count:

$ aws s3 ls --recursive s3://test_bucket/smallfiles
53248

Summary:

Using 26 invocations of the command improved the execution time by a factor of 8: 2 minutes 27 seconds for 53,248 files vs. the original run time of 20 minutes. The file upload rate improved from 44 files/sec to 362 files/sec.

The application of similar logic to further parallelize our large file scenario in example 2 would easily saturate the network bandwidth on the host. Be careful when executing these examples! A well-connected host can easily overwhelm the Internet links at your source site!

Conclusion

In this post we demonstrated the use of the AWS CLI for common Amazon S3 workflows. We saw that the AWS CLI for Amazon S3 scaled to 10 parallel streams and enabled multipart uploads automatically. We also demonstrated how to accelerate the tasks with further parallelization by using common Linux CLI tools and techniques.

When using the AWS CLI for Amazon S3 to upload files to Amazon S3 from a single instance, your limiting factors are generally going to be end-to-end bandwidth to the AWS S3 endpoint for large file transfers and host CPU when sending many small files. Depending on your particular environment, your results might be different from our example results. As demonstrated in example 4, there may be an opportunity to go faster if you have the resources to support it. AWS also provides a variety of Amazon EC2 instance types, some of which might provide better results than the m3.xlarge instance type we used in our examples. Finally, networking bandwidth to the public Amazon S3 endpoint is a key consideration for overall performance.

We hope that this post helps illustrate how powerful the AWS CLI can be when working with Amazon S3, but this is just a small part of the story: the AWS CLI can launch Amazon EC2 instances, create new Amazon VPC’s and enable many of the other features of the AWS platform with just as much power and flexibility as it can for Amazon S3. Have fun exploring!

AWS Partner Network (APN) Blog

Getting the Most Out of the Amazon S3 CLI

What is Amazon S3?

WS Command Line Interface (AWS CLI)

AWS S3 Data Transfer Scenarios

Environment Setup

Example 1 – Uploading a Large Number of Small Files

Example 2 – Uploading a Small Number of Large Files

Example 3 – Periodically Synchronizing a Directory

Example 4 – Maximizing Throughput

Conclusion

Resources

Follow

Learn

Resources

Developers

Help