AWS Storage Blog

Excluding and including specific data in transfer tasks using AWS DataSync filters

AWS DataSync automates and accelerates copying data between your NFS servers, Amazon S3 buckets, and Amazon Elastic File System (Amazon EFS) file systems. With the recent launch of filtering, you can now specify the set of files, folders, or objects that should be transferred, those that should be excluded from the transfer, or a combination of the two. For example, you can choose to only copy selected parts of your source file system, or you can exclude temporary files that you never want to waste time transferring.

In this post, I explain the filtering capabilities of DataSync, and share examples of useful filters.

Filters and how to apply them

DataSync filters enable you to specify a list of patterns that match files, folders, and objects. For example:

  • A filter that matches a specific folder:
    • /path/to/my-folder
  • A filter that matches multiple folders that share a common pattern:
    • */temporary-folder-*
  • A filter that matches images could be composed of multiple patterns:
    • *.png|*.jpg|*.jpeg

For more information about the complete syntax for filters, see Filtering the Data Transferred by AWS DataSync.

Filters can be configured using the AWS Management Console, AWS CLI, or AWS SDK. When applying filters using the console, you can specify each pattern separately, and the ‘|’ delimiter is not required.

Useful filter examples

The following examples cover some common scenarios. Exclude filters are specified in create-task or update-task. Include filters are specified in start-task-execution. For brevity, the following examples specify only the command name and filter name, omitting the other parameters. The syntax for complete commands would be as follows:

aws datasync create-task 
    --source-location-arn 'arn:aws:datasync:region:account-id:location/location-id'
    --destination-location-arn 'arn:aws:datasync:region:account-id:location/location-id'
     --cloud-watch-log-group-arn 'arn:aws:logs:region:account-id:log-group:your-log-group' 
     --name your-task-name
     --excludes <filters will be listed here>
aws datasync start-task-execution 
    --task-arn 'arn:aws:datasync:region:account-id:task/task-id' 
    --includes <filters will be listed here>

Exclude a specific directory

Any path that you provide is interpreted as relative to the source location. Therefore, if your task’s source location is /mount-root, and you want to exclude /<code>mount-root/not-important</code>, you should run:

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='/not-important'

Exclude a specific file type or types

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*.temp'

When specifying more than one file type, the patterns should be delimited using “|”.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*.temp|*.tmp'

Exclude folder types that are not needed at the destination

For example, a common request from customers was to exclude the .snapshot folders created by NetApp backup jobs.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*/.snapshot'

Exclude multiple folders and folder types

This example includes two patterns that match multiple folders, and one specific folder.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*/.snapshot|*/temp-*/|/this-one-is-also-not-needed'

Transfer only specified files

When you have a list of specific files to transfer, you can specify them as delimited by “|”. The length of the filter string is currently limited to 100,000 characters.

aws datasync start-task-execution ... --includes FilterType=SIMPLE_PATTERN,Value='/folder/file1.txt|/folder/file2.txt|/folder/file3.txt|/folder/file4.txt'

Transfer only a specified folder

As a reminder, when you provide a path, it’s relative to the source location. Therefore, if your task’s source location is /mount-root, and you want to transfer /mount-root/folder-to-transfer, you should run:

aws datasync start-task-execution ... --includes FilterType=SIMPLE_PATTERN,Value='/folder-to-transfer'

Transfer only a specified folder type

Currently, include filters only support * (wildcard) as the rightmost characters of an include pattern.

aws datasync start-task-execution ... --includes FilterType=SIMPLE_PATTERN,Value='/mnt/important-folders-prefix-*'

Combining includes and excludes

What if you only want to transfer /projects/important-project-folder? However, you have many temporary /work-in-progress folders in different sub-folders of /projects/important-project-folder. In this case, you can:

  1. Create a task with exclude filter for the /work-in-progress folders.
  2. Run it with an include filter to only transfer the /projects/important-project-folder.

Exclude filters are specified in create-task or update-task. Include filters are specified in start-task-execution.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*/work-in-progress'

aws datasync start-task-execution ... 
--includes FilterType=SIMPLE_PATTERN,Value='/projects/important-folder'

Split your share’s root directory between two tasks

In this example, I assume that you have the following directory structure on your source location:

/mount-root

/project-videos
/project-images
...

other-file1.txt
funny-image.png
what-a-mess.jpeg
...

You want to transfer all the project files to one location, and all the other files to another. In this case, you can create two tasks: transfer project folders and transfer other files.

Task 1: Transfer project folders

aws datasync start-task-execution ... --includes FilterType=SIMPLE_PATTERN,Value='/project-*'

Task 2: Transfer other files

The other files under /mount-root don’t have a common pattern. Transfer them by excluding the project files transferred by Task1.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='/project-*'

Summary

The examples above represent use cases that we commonly discuss with customers. If you need something different, you can create your own filters. For more information, see Filtering the Data Transferred by AWS DataSync.

We are constantly enhancing DataSync. If you’re interested in additional filtering capabilities, we’d love to hear your feedback and ideas. Contact my team through the DataSync developer forumAWS Support, or comment below.

TAGS: ,
Olga Kogan

Olga Kogan

Olga Kogan is a Senior Product Manager at AWS. Previously, she was heading the Engineering Growth team at Fundbox, a FinTech startup that raised over $100M. Earlier Olga co-founded a startup that developed educational products for kids.