I have a large number of files to copy. I want to run these jobs in parallel on an Amazon Elastic File System (Amazon EFS) file system on my Amazon Elastic Compute Cloud (Amazon EC2) instance.
Short Description
Use one of the following tools to run jobs in parallel on an Amazon EFS file system:
- GNU parallel: For more information, see GNU Parallel on the GNU Operating System website.
- msrsync: For more information, see msrsync on the GitHub website.
- fpsync: For more information, see fpsync on the Ubuntu manuals website.
Resolution
GNU parallel
1. Install GNU parallel.
Amazon Linux and RHEL 6:
$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install parallel nload -y
RHEL 7:
$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install parallel nload -y
Amazon Linux 2:
$ sudo amazon-linux-extras install epel
$ sudo yum install nload sysstat parallel -y
Ubuntu:
$ sudo apt-get install parallel
2. Use rsync to copy the files to Amazon EFS:
$ sudo time find -L /src -type f | parallel rsync -avR {} /dst
or
$ sudo time find /src -type f | parallel -j 32 cp {} /dst
3. Use the nload console application to monitor network traffic and bandwidth.
$ sudo nload -u M
msrsync
msrsync is a Python wrapper for rsync that runs multiple rsync processes in parallel.
Note: msrsync is compatible only with Python. Run the msrsync script using Python version 2.7.14 or later.
1. Install msrsync.
$ sudo curl -s https://raw.githubusercontent.com/jbd/msrsync/master/msrsync -o /usr/local/bin/msrsync && sudo chmod +x /usr/local/bin/msrsync
2. Use the -p option to specify the number of rsync processes that you want to run in parallel. Replace X with the number of rsync processes. The **-**P option shows the progress of each job.
$ sudo time /usr/local/bin/msrsync -P -p X --stats --rsync "-artuv" /src/ /dst/
fpsync
The fpsync tool synchronizes directories in parallel using fpart and rsync. It can run several rsync processes locally or launch rsync transfers on several nodes (workers) through SSH.
For more information on fpart, see fpart on the Ubuntu manuals website.
1. Activate the EPEL repository, and then install the fpart package. Amazon Linux and RHEL 6:
$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
$ sudo yum install fpart -y
RHEL 7:
$ sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install fpart -y
Amazon Linux 2:
$ sudo amazon-linux-extras install epel
$ sudo yum install fpart -y
Ubuntu:
$ sudo apt-get install fpart
Note: In Ubuntu, fpsync is part of the fpart package.
2. Use fpsync to synchronize the /dst and /src directories. Replace X with the number of rsync processes that you want to run in parallel.
$ sudo fpsync -n X /src /dst