AWS Storage Blog

Excluding and including specific data in transfer tasks using AWS DataSync filters

UPDATE (8/25/2021): This post reflects that AWS DataSync now supports using both include and exclude filters when you create a task, giving you more granularity when specifying the files, folders, and objects that you want to transfer.


AWS DataSync automates and accelerates copying data between your NFS servers, Amazon S3 buckets, and Amazon Elastic File System (Amazon EFS) file systems. With the recent launch of filtering, you can now specify the set of files, folders, or objects that should be transferred, those that should be excluded from the transfer, or a combination of the two. For example, you can choose to only copy selected parts of your source file system, or you can exclude temporary files that you never want to waste time transferring.

In this post, I explain the filtering capabilities of DataSync, and share examples of useful filters.

Filters and how to apply them

DataSync filters enable you to specify a list of patterns that match files, folders, and objects. For example:

  • A filter that matches a specific folder:
    • /path/to/my-folder
  • A filter that matches multiple folders that share a common pattern:
    • */temporary-folder-*
  • A filter that matches images could be composed of multiple patterns:
    • *.png|*.jpg|*.jpeg

For more information about the complete syntax for filters, see Filtering the Data Transferred by AWS DataSync.

Filters can be configured using the AWS Management Console, AWS CLI, or AWS SDK. When applying filters using the console, you specify each pattern separately, and the ‘|’ delimiter is not required.

Useful filter examples

The following examples cover some common scenarios. I use CLI commands for my examples, but the same functionality is available from the AWS Management Console. When using the CLI, filters are specified using the create-task, update-task, or start-task-execution commands. For brevity, the following examples specify only the command name and filter name, omitting the other parameters. The syntax for the complete create-task commands would be as follows:

aws datasync create-task 
     --source-location-arn 'arn:aws:datasync:region:account-id:location/location-id'
     --destination-location-arn 'arn:aws:datasync:region:account-id:location/location-id'
     --cloud-watch-log-group-arn 'arn:aws:logs:region:account-id:log-group:your-log-group' 
     --name your-task-name
     --excludes <filters will be listed here>
     --includes <filters will be listed here>

Exclude a specific directory

Any path that you provide is interpreted as relative to the source location. Therefore, if your task’s source location is /mount-root, and you want to exclude /mount-root/not-important, you should run:

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='/not-important'

Exclude a specific file type or types

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*.temp'

When specifying more than one file type, the patterns should be delimited using “|”.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*.temp|*.tmp'

Exclude folder types that are not needed at the destination

For example, a common request from customers was to exclude the .snapshot folders created by NetApp backup jobs.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*/.snapshot'

Exclude multiple folders and folder types

This example includes two patterns that match multiple folders, and one specific folder.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='*/.snapshot|*/temp-*/|/this-one-is-also-not-needed'

Transfer only specified files

When you have a list of specific files to transfer, you can specify them as delimited by “|”. The length of the filter string is currently limited to 409,600 characters.

aws datasync create-task ... --includes FilterType=SIMPLE_PATTERN,Value='/folder/file1.txt|/folder/file2.txt|/folder/file3.txt|/folder/file4.txt'

Transfer only a specified folder

As a reminder, when you provide a path, it’s relative to the source location. Therefore, if your task’s source location is /mount-root, and you want to transfer /mount-root/folder-to-transfer, you should run:

aws datasync create-task ... --includes FilterType=SIMPLE_PATTERN,Value='/folder-to-transfer'

Transfer only a specified folder type

Currently, include filters only support * (wildcard) as the rightmost characters of an include pattern.

aws datasync create-task ... --includes FilterType=SIMPLE_PATTERN,Value='/mnt/important-folders-prefix-*'

Combining includes and excludes

What if you only want to transfer /projects/important-project-folder? However, you have many temporary /work-in-progress folders in different sub-folders of /projects/important-project-folder. In this case, you can create your task with both include and exclude filters:

aws datasync create-task ... --includes 
FilterType=SIMPLE_PATTERN,Value='/projects/important-folder' --excludes 
FilterType=SIMPLE_PATTERN,Value='*/work-in-progress'

Split your share’s root directory between two tasks

In this example, I assume that you have the following directory structure on your source location:

/mount-root

/project-videos
/project-images
...

other-file1.txt
funny-image.png
what-a-mess.jpeg
...

You want to transfer all the project files to one location, and all the other files to another. In this case, you can create two tasks: transfer project folders and transfer other files.

Task 1: Transfer project folders

aws datasync create-task ... --includes FilterType=SIMPLE_PATTERN,Value='/project-*'

Task 2: Transfer other files

The other files under /mount-root don’t have a common pattern. Transfer them by excluding the project files transferred by Task1.

aws datasync create-task ... --excludes FilterType=SIMPLE_PATTERN,Value='/project-*'

Summary

The examples above represent use cases that we commonly discuss with customers. If you need something different, you can create your own filters. For more information, see Filtering the Data Transferred by AWS DataSync.

We are constantly enhancing DataSync. If you’re interested in additional filtering capabilities, we’d love to hear your feedback and ideas. Contact my team through the DataSync developer forumAWS Support, or comment below.