How can I copy data to and from Amazon EFS in parallel to maximize performance on my EC2 instance?

Last updated: 2020-04-28

I have a large number of files to copy or delete. How can I run these jobs in parallel on an Amazon Elastic File System (Amazon EFS) file system on my Amazon Elastic Compute Cloud (Amazon EC2) instance?

Short Description

Use one of the following tools to run jobs in parallel on an Amazon EFS file system:

  • GNU parallel – For more information, see GNU Parallel on the GNU Operating System website.
  • msrsync – For more information, see msrsync on the GitHub website.
  • fpsync – For more information, see fpsync on the Ubuntu manuals website.

Resolution

GNU parallel

1.    Install GNU parallel.

For Amazon Linux and RHEL 6:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install parallel nload -y

For RHEL 7:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install parallel nload -y

For Amazon Linux 2:

$ sudo amazon-linux-extras install epel
$ sudo yum install nload sysstat parallel -y

For Ubuntu:

$ sudo apt-get install parallel

2.    Use rsync to copy the files to Amazon EFS.

$ sudo time find -L /src -type f | parallel rsync -avR {} /dst

or

$ sudo time find /src -type f | parallel -j 32 cp {} /dst

3.    Use the nload console application to monitor network traffic and bandwidth.

$ sudo nload -u M

msrsync

msrsync is a Python wrapper for rsync that runs multiple rsync processes in parallel.

Note: msrsync is compatible only with Python 2. You must run the msrsync script using Python version 2.7.14 or later.

1.    Install msrsync.

$ sudo curl -s https://raw.githubusercontent.com/jbd/msrsync/master/msrsync -o /usr/local/bin/msrsync && sudo chmod +x /usr/local/bin/msrsync

2.    Use the -p option to specify the number of rsync processes that you want to run in parallel. Replace with the number of rsync processes. The -P option shows the progress of each job.

$ sudo time /usr/local/bin/msrsync -P -p X --stats --rsync "-artuv" /src/ /dst/

fpsync

The fpsync tool synchronizes directories in parallel using fpart and rsync. It can execute several rsync processes locally or launch rsync transfers on several nodes (workers) through SSH.

For more information on fpart, see fpart on the Ubuntu manuals website.

1.    Enable the EPEL repository, and then install the fpart package.

For Amazon Linux and RHEL 6:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install fpart -y

For RHEL 7:

$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install fpart -y

For Amazon Linux 2:

$ sudo amazon-linux-extras install epel
$ sudo yum install fpart -y

For Ubuntu:

$ sudo apt-get install fpart

Note: In Ubuntu, fpsync is part of the fpart package.

2.    Use fpsync to synchronize the /dst and /src directories. Replace X with the number of rsync processes that you want to run in parallel.

$ sudo fpsync -n X /src /dst

Did this article help you?

Anything we could improve?


Need more help?