AWS Big Data Blog
Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3
April 2024: This post was reviewed for accuracy.
Although it’s common for Amazon EMR customers to process data directly in Amazon S3, there are occasions where you might want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster. Additionally, you might have a use case that requires moving large amounts of data between buckets or regions. In these use cases, large datasets are too big for a simple copy operation. Amazon EMR can help with this, and offers a utility – S3distCp – to help with moving data from S3 to other S3 locations or on-cluster HDFS.
In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:
1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster
1. Copy or move files without transformation
We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:
The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.
Each subfolder has the following files:
To copy only the required files, let’s use the --srcPattern option:
After the upload has finished successfully, let’s check the folder contents in the destination location to confirm only the files ending in .log were copied:
Sometimes, data needs to be moved instead of copied. In this case, we can use the --deleteOnSuccess option. This option is similar to aws s3 mv, which you might have used previously with the AWS CLI. The files are first copied and then deleted from the source:
After the preceding operation, the source location has only empty folders, and the target location contains all files.
The important things to remember here are that S3DistCp deletes only files with the --deleteOnSuccess flag and that it doesn’t delete parent folders, even when they are empty.
2. Copy and change file compression on the fly
Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:
The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:
- “none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.
- “keep” – Don’t change the compression of the files but copy them as-is.
Let’s check the files in the target folder, which have now been compressed with the gz codec:
3. Copy files incrementally
In real life, the upstream process drops files in some cadence. For instance, new files might get created every hour, or every minute. The downstream process can be configured to pick it up at a different schedule.
Let’s say data lands on S3 and we want to process it on HDFS daily. Copying all files every time doesn’t scale very well. Fortunately, S3DistCp has a built-in solution for that.
For this solution, we use a manifest file. That file allows S3DistCp to keep track of copied files. Following is an example of the command:
The command takes two manifest files as parameters, outputManifest and previousManifest. The first one contains a list of all copied files (old and new), and the second contains a list of files copied previously. This way, we can recreate the full history of operations and see what files were copied during each run:
S3DistCp creates the file in the local file system using the provided path, /tmp/mymanifest.gz. When the copy operation finishes, it moves that manifest to <DESTINATION LOCATION>.
4. Copy multiple folders in one job
Imagine that we need to copy several folders. Usually, we run as many copy jobs as there are folders that need to be copied. With S3DistCp, the copy can be done in one go. All we need is to prepare a file with list of prefixes and use it as a parameter for the tool:
In this case, the folders.lst file contains the following prefixes:
As a result, the target location has only the requested subfolders:
5. Aggregate files based on a pattern
Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost.
In the following example, we combine small files into bigger files. We do so by using a regular expression with the –groupBy option.
Let’s take a look into the target folders and compare them to the corresponding source folders:
As you can see, seven data files were combined into two with a size close to the requested 10 MB. The *.meta file was filtered out because --groupBy pattern works in a similar way to –srcPattern. We recommend keeping files larger than the default block size, which is 128 MB on EMR.
The name of the final file is composed of groups in the regular expression used in --groupBy plus some number to make the name unique. The pattern must have at least one group defined.
Let’s consider one more example. This time, we want the file name to be formed from three parts: year, month, and file extension (.log in this case). Here is an updated command:
Now we have final files named in a different way:
As you can see, names of final files consist of concatenation of 3 groups from the regular expression (2017-), (\d\d), (log).
You might find that occasionally you get an error that looks like the following:
In this case, the key information is contained in Created 0 files to copy 0 files. S3DistCp didn’t find any files to copy because the regular expression in the --groupBy option doesn’t match any files in the source location.
The reason for this issue varies. For example, it can be a mistake in the specified pattern. In the preceding example, we don’t have any files for the year 2018. Another common reason is incorrect escaping of the pattern when we submit S3DistCp command as a step, which is addressed later later in this post.
6. Upload files larger than 1 TB in size
The default upload chunk size when doing an S3 multipart upload is 128 MB. When files are larger than 1 TB, the total number of parts can reach over 10,000. Such a large number of parts can make the job run for a very long time or even fail.
In this case, you can improve job performance by increasing the size of each part. In S3DistCp, you can do this by using the --multipartUploadChunkSize option.
Let’s test how it works on several files about 200 GB in size. With the default part size, it takes about 84 minutes to copy them to S3 from HDFS.
We can increase the default part size to 1000 MB:
The maximum part size is 5 GB. Keep in mind that larger parts have a higher chance to fail during upload and don’t necessarily speed up the process. Let’s run the same job with the maximum part size:
7. Submit a S3DistCp step to an EMR cluster
You can run the S3DistCp tool in several ways. First, you can SSH to the master node and execute the command in a terminal window as we did in the preceding examples. This approach might be convenient for many use cases, but sometimes you might want to create a cluster that has some data on HDFS. You can do this by submitting a step directly in the AWS Management Console when creating a cluster.
In the console add step dialog box, we can fill the fields in the following way:
- Step type: Custom JAR
- Name*: S3DistCp Step
- JAR location: command-runner.jar
- Arguments: s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest /data/input/hourly_table --targetSize 10 --groupBy .*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)
- Action of failure: Continue
Notice that we didn’t add quotation marks around our pattern. We needed quotation marks when we were using bash in the terminal window, but not here. The console takes care of escaping and transferring our arguments to the command on the cluster.
Another common use case is to run S3DistCp recurrently or on some event. We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:
Summary
This post showed you the basics of how S3DistCp works and highlighted some of its most useful features. It covered how you can use S3DistCp to optimize for raw files of different sizes and also selectively copy different files between locations. We also looked at several options for using the tool from SSH, the AWS Management Console, and the AWS CLI.
If you have questions or suggestions, leave a message in the comments.
Next Steps
Take your new knowledge to the next level! Click on the post below and learn the top 10 tips to improve query performance in Amazon Athena.
Top 10 Performance Tuning Tips for Amazon Athena
About the Author
Illya Yalovyy is a Senior Software Development Engineer with Amazon Web Services. He works on cutting-edge features of EMR and is heavily involved in open source projects such as Apache Hive, Apache Zookeeper, Apache Sqoop. His spare time is completely dedicated to his children and family.
Audit History
Last reviewed and updated in April 2024 by Uday Narayanan | Principal Solutions Architect