Derive insights from AWS DataSync task reports using AWS Glue, Amazon Athena, and Amazon QuickSight

Update (9/22/2023): Step 6b updated to automatically detect and update the Amazon Athena table schema when crawler detects large data transfer values reported in bytes that would consume the table’s maximum integer value while storing data.

As customers scale their migration of large datasets with millions of files across multiple data transfers, they are faced with the challenge of getting an aggregated view to track and audit all of their data transfer processes. Identifying the success or failure status of each file being migrated, along with the failure reason in case of errors, is the critical first step in a successful migration.

To help address this challenge, AWS DataSync introduced task reports, a new feature that provides detailed transfer reports in an easy-to-use JSON output format. In this blog, we will show how you can leverage AWS Glue, Amazon Athena, and Amazon QuickSight to automatically catalog, query, and visualize task report output, providing valuable and timely insights into your data migrations. This solution enables you to track, audit, and troubleshoot a specific transfer task or all of your transfer tasks using a single dashboard, giving you a holistic view across your DataSync task executions.

Introducing DataSync task reports

Many of our customers use DataSync to migrate active datasets to AWS, archive data to AWS to free up on-premises storage capacity, replicate data to AWS for business continuity, or transfer data to the cloud for analysis and processing. Using task reports, you can get a summary report along with detailed reports for all files transferred, skipped, verified, and deleted, for each task execution. Task reports are generated when the execution completes, and are stored in your Amazon Simple Storage Service (S3) bucket for easy access. Task reports include file counts, bytes transferred, and file attributes such as size, path, timestamps, file checksums, and object version IDs where applicable.

DataSync task reports provide the following high-level configuration options:

Generate only a summary report – This report includes just the high-level execution details of a task such as the total number of files transferred, skipped, verified, and deleted, timestamps, total number of bytes transferred, and the overall execution status. Choose this report configuration to limit the size of the generated reports and reduce the data stored in S3.
Generate a standard report with errors only – This report includes the summary report above, along with detailed reports for all files that were skipped due to an error or failed to transfer, verify, or delete. Choose this report configuration if you need data for quick identification and troubleshooting of transfer errors.
Generate a standard report with successes and errors – This report includes the summary report above, along with detailed reports for all files that were attempted to be transferred, skipped, verified, and deleted, regardless of status. Choose this configuration if reports are required for auditing purposes.

Based on the size and number of files in the transfer, there could be multiple JSON files for each of the transferred, skipped, verified, and deleted reports. To learn more, see the AWS DataSync documentation.

Solution overview and walkthrough

In this walkthrough, we migrate files from an on-premises Network File System (NFS) server to S3 using multiple DataSync transfer tasks. We then demonstrate how we use Glue, Athena, and QuickSight to track and audit the progress of our overall data migration across all of our tasks by visualizing the consolidated data available in task reports for each task execution.

You can also use this solution to quickly verify the total number of files that were successfully transferred and verified in a specific task execution. In addition, you can use this to identify files that were skipped due to permissions or other errors, or files that failed verification, across all executions of a specific task.

To implement this solution, we first set up two DataSync tasks to transfer data across multiple tasks from NFS to S3, and configure them to generate and store task reports in a specific S3 bucket. We then set up a Glue crawler that crawls the S3 prefix where task reports are stored and catalogs them into a single database table. After that, we demonstrate the ability to query this Glue table using SQL statements in Athena, along with the creation of additional database views in Athena to optimize queries. Finally, we show how to visualize these database views in QuickSight to quickly see the number of files transferred, skipped, verified, and deleted, along with the total number of bytes transferred across both tasks, during the data migration. Our example is specific to transferring data from NFS to S3, but task reports are supported with all DataSync location types.

Derive insights from AWS DataSync task reports using AWS Glue, Amazon Athena, and Amazon QuickSight

Figure 1: Derive insights from AWS DataSync task reports using AWS Glue, Amazon Athena, and Amazon QuickSight

Step 1: Create and run DataSync tasks to generate task reports

In this section, we create and execute two DataSync tasks that will incrementally copy a source dataset comprised of 250,000 files with an approximate size of 5GB, from NFS to S3. The source dataset on the NFS server is spread across two different directories, with each defined and configured as the source location for each task and containing 100,000 files to start with. We add an additional 50,000 files to the source directory of the first task after its initial execution, to demonstrate that it skips the transfer of the existing 100,000 files while only transferring the 50,000 new files in the second execution.

Similarly, there are two different directories on S3, each defined and configured as the destination location for each task. We pre-populate the destination directory of the first task with 10,000 existing files before its execution, to demonstrate that it deletes these 10,000 files during the initial execution as those files don’t exist at the source.

To support this expected behavior across both tasks, we configure the tasks to delete any existing files at the destination that are not also present at the source, transfer only the new files, and verify the checksum of all files that exist at the destination with the source. And for the purpose of demonstrating task reports for transferred, skipped, verified, and deleted files in this blog, we configure the tasks to generate a standard task report with successes and errors.

Create the first DataSync task

Using the DataSync console, we create a DataSync task, named “NFS-TO-S3-TASK-1”, with /source-data/folder1 on NFS as the source and /destination-data/folder1 on S3 as the destination. We configure the task to verify all the data in the destination after the transfer, and to delete files at the destination that are not present at the source prior to the transfer. Task reports are configured to generate a standard report with successes and errors, and to store the reports in our S3 bucket, named “datasync-task-report-output” with a prefix of “reports”. At this stage in our example, a task with ID task-02ad9cddfdc9be4a4 is created.

Task reports configuration during the creation of DataSync task

Figure 2: Task reports configuration during the creation of DataSync task

Create the second DataSync task

We create a second task same as the first one above, to demonstrate that we can track the data transfer progress across multiple tasks using a single dashboard in QuickSight. For this task, named “NFS-TO-S3-TASK-2”, we use /source-data/folder2 on NFS as the source, and /destination-data/folder2 on S3 as the destination. Now, a task with ID task-03a436b138b91eeab is created.

Run the first DataSync task

Using the DataSync console, we run the DataSync task, NFS-TO-S3-TASK-1, and note its execution ID exec-02f437833a7526e66. It’s expected that this task execution generates a deleted files report which lists the 10,000 files that are deleted from the destination as they don’t exist at the source, along with transferred and verified reports that list the 100,000 files that are now transferred and verified.

Add additional files to the source location and rerun the first DataSync task

After the initial execution of the first task, we add 50,000 new files to its source location /source-data/folder1, and rerun the task to generate a report for skipped files. At this stage in our example, a new task execution with ID exec-0ae91431af20e848b has started. It’s expected that in this execution, it generates a report for skipped files that lists the 100,000 files that are skipped as they match at both the source and destination. The task execution also generates transferred reports that list the 50,000 files that are now transferred, and generates verified reports that list the 150,000 files that now exist at the destination after the two executions.

Run the second DataSync task

We now run the second task, NFS-TO-S3-TASK-2, and note its execution ID exec-09141a68da2abc573. This generates transferred and verified reports with another 100,000 files that are transferred and verified in this execution.

Step 2: Populate Glue catalog with task reports data using a Glue crawler

In this section, we create a Glue catalog database, and we define and run a Glue crawler that scans task reports for both tasks in the S3 prefix, extracts their schema information, and stores the metadata automatically in a unified table in the Glue catalog database. This catalog is then used to run queries in Athena and to create dashboards in QuickSight.

Create a Glue crawler

Using the Glue console, we first create a Glue database, named “datasync-task-reports-db”, to organize the task reports table. We then create a Glue crawler to crawl task reports from the S3 prefix and add them as a table to the database.

The high-level steps to configure a crawler are:

On the Glue console, navigate to the database.
Select Add tables using crawler. For Name, enter the crawler name and select Next. We entered “datasync-task-reports-crawler”.
Select Add a data source, and select the S3 bucket where the DataSync task reports are stored.
For Subsequent crawler runs, select Crawl new sub-folders only to crawl only new task reports folders for each task execution. Click Add an S3 data source, and click Next.
For IAM role, select Create new IAM role to auto generate a role that allows Glue to read the data from the specified S3 bucket, then select Next. Under Target database, select the Glue database. In our example, we selected datasync-task-reports-db.
Expand Advanced options:

a. For S3 schema grouping, select Create a single schema for each S3 path to group compatible schemas and create a single table instead of creating multiple tables for each S3 object.

b. For When the crawler detects schema changes in the data store…:

i. Select Update the table definition in the data catalog.

ii. Check Update all new and existing partitions with metadata from the table.

7. For Crawler schedule, leave the Frequency as On demand and select Next. Review the crawler configuration and select Create crawler.

Figure 3 - Glue crawler configuration to crawl DataSync task reports from the S3 prefix

Figure 3: Glue crawler configuration to crawl DataSync task reports from the S3 prefix

Run a Glue crawler

We run the crawler, datasync-task-reports-crawler, after each DataSync task execution. To automatically fetch task reports after each task execution, crawlers can be configured using Amazon EventBridge to trigger on events such as when new reports are added to the task reports S3 prefix or when there is a change in the status of DataSync tasks. When the crawler run is complete, it’s expected that a new table with task reports S3 prefix is created in the Glue catalog. In our example, “reports” table is created.

Step 3: Query task reports data in Glue tables using Athena query editor

In this section, we query against the Glue database table for task reports using the Athena query editor. To query the database efficiently, we create database views for summary, transferred, skipped, verified, and deleted reports. We then run a query against the skipped report view to identify the number of skipped files in a task execution and the reason for skipping their transfer.

Set Glue catalog as the data source

We launch the Athena query editor in the Athena console, set the Glue data source and select the Glue database, datasync-task-reports-db, to query for task reports. By default, any queries against the Glue reports table returns each of the transferred, skipped, verified, and deleted reports as a single array record instead of listing all file entries in them as individual records. This can be observed by running an example query below.

SELECT * FROM "AwsDataCatalog"."your-database-name"."your-table-name"
WHERE "Partition_1"='your-task-id' limit 1;

SQL query results showing the verified report as a single record, when queried against the default Glue database table for DataSync task reports

Figure 4: SQL query results showing the verified report as a single record, when queried against the default Glue database table for DataSync task reports

To make it easier to query specific file details in each report, we create database views for each of these reports that will unnest the arrays and extract all the file details as queryable fields.

Create database views

We run each of the queries below in the Athena query editor to create database views for transferred, skipped, verified, and deleted reports. If a specific report type is not yet produced by task reports, creation of the view for that report may fail. This step should be skipped for such reports until they exist.

Create a database view for summary report

create or replace view summary_view as
select
 partition_1 as "Task-id",
 partition_2 as "Execution-id",
 s.accountid as "Account-ID",
 s.sourcelocation.locationtype as "source-location-type",
 s.destinationlocation.locationtype as "destination-location-type",
 s.starttime as "task-execution-start-time",
 s.endtime as "task-execution-end-time",
 s.totaltime as "total-time-taken-by-task",
 s.overallstatus as "overall-status",
 s.result.filestransferred as "total-files-transferred",
 s.result.filesverified as "total-files-verified",
 s.result.filesdeleted as "total-files-deleted",
 s.result.filesskipped as "total-files-skipped",
 s.result.byteswritten/(1000*1000*1000) as "total-bytes-written-GB",
 s.result.bytestransferred/(1000*1000*1000) as "total-bytes-transferred-GB",
 s.result.bytescompressed/(1000*1000*1000) as "total-bytes-compressed-GB",
 s.result.prepareduration as "total-prepare-duration",
 s.result.preparestatus as "prepare-status",
 s.result.transferduration as "total-transfer-duration",
 s.result.transferstatus as "transfer-status",
 s.result.verifyduration as "total-verify-duration",
 s.result.verifystatus as "verify-status",
 s.result.errorcode as "error-code", /* Omit this in case of no errors */
 s.filters.includes as "include-filters",
 s.filters.excludes as "exclude-filters" 
 
from
    "datasync-task-reports-db"."reports" as s
    where partition_0='Summary-Reports'

Create a database view for transferred report

create or replace view transferred_view as
select
 partition_1 as "Task-id",
 partition_2 as "Execution-id",
 t.transferstatus as "TransferStatus",
 t.relativepath as "file-name",
 t.SrcMetadata.type as "file-type",
 t.SrcMetadata.mtime as "file-mtime",
 t.SrcMetadata.atime as "file-atime",
 t.SrcMetadata.uid as "file-uid",
 t.SrcMetadata.gid as "file-gid",
 t.SrcMetadata.mode as "file-mode",
 t.SrcMetadata.contentsize as "file-size-KB",
 t.overwrite as "is-file-overwritten-in-transfer-phase",
 t.transfertimestamp as "file-transfertimestamp",
 t.transfertype as "TransferType"
from
    "datasync-task-reports-db"."reports"
cross join unnest(transferred) as r(t)
where partition_0='Detailed-Reports'

Create a database view for verified report

create or replace view verified_view as
select
 partition_1 as "Task-id",
 partition_2 as "Execution-id",
 v.relativepath as "file-name",
 v.SrcMetadata.type as "Source-File-Type",
 v.SrcMetadata.mtime as "mtime-at-source",
 v.SrcMetadata.atime as "atime-at-source",
 v.SrcMetadata.uid as "file-uid-at-source",
 v.SrcMetadata.gid as "file-gid-at-source",
 v.SrcMetadata.mode as "file-mode-at-source",
 v.SrcMetadata.contentsize as "file-size-at-source",
 v.DstMetadata.type as "Destinatoin-File-Type",
 v.DstMetadata.mtime as "file-mtime-at-destination",
 v.DstMetadata.atime as "file-atime-destination",
 v.DstMetadata.uid as "file-uid-destination",
 v.DstMetadata.gid as "file-gid-destination",
 v.DstMetadata.mode as "file-mode-destination",
 v.DstMetadata.contentsize as "file-size-destination",
 v.verifytimestamp as "file-verify-timestamp",
 v.verifystatus as "verifyStatus",
 v.SrcChecksum as "Source-File-Checksum-Value",
 v.DstChecksum as "Destinatoin-File-Checksum-Value",
 v.FailureReason as "Verification-Failure-Reason" /* Omit this in case of no errors */
from
    "datasync-task-reports-db"."reports"
cross join unnest(verified) as r(v)
where partition_0='Detailed-Reports'

Create a database view for skipped report

create or replace view skipped_view as
select
 partition_1 as "Task-id",
 partition_2 as "Execution-id",
 s.relativepath as "file-name",
 s.SkippedTimestamp as "file-transfertimestamp",
 s.SrcMetadata.type as "source-file-type",
 s.SrcMetadata.mtime as "file-mtime-at-source",
 s.SrcMetadata.atime as "file-atime-at-source",
 s.SrcMetadata.uid as "file-uid-at-source",
 s.SrcMetadata.gid as "file-gid-at-source",
 s.SrcMetadata.mode as "file-mode-at-source",
 s.SrcMetadata.contentsize as "file-size-at-source",
 s.DstMetadata.type as "destinatoin-file-type",
 s.DstMetadata.mtime as "file-mtime-at-destination",
 s.DstMetadata.atime as "file-atime-destination",
 s.DstMetadata.uid as "file-uid-destination",
 s.DstMetadata.gid as "file-gid-destination",
 s.DstMetadata.mode as "file-mode-destination",
 s.DstMetadata.contentsize as "file-size-destination",
 s.SkippedReason as "skipped-reason"
from
    "datasync-task-reports-db"."reports"
cross join unnest(skipped) as r(s)
where partition_0='Detailed-Reports'

Create a database view for deleted report

create or replace view deleted_view as
select
 partition_1 as "Task-id",
 partition_2 as "Execution-id",
 d.relativepath as "file-name",
 d.DstMetadata.contentsize as "file-size-at-destination",
 d.deletetimestamp as "file-deletetimestamp",
 d.deletestatus as "deleteStatus",
 d.failurereason as "failure-reason" /* omit this in case of no errors */
from
    "datasync-task-reports-db"."reports"
cross join unnest(deleted) as r(d)
where partition_0='Detailed-Reports'

Run a query to troubleshoot skipped files

We now run a query against the skipped view that was created above, to identify the number of skipped files and the reason they were skipped in the second execution of our first task, exec-0ae91431af20e848b.

SELECT "task-id",
	"execution-id",
	"file-name",
	"skipped-reason"
FROM "your-database-name"."your-view-name"
WHERE "execution-id" = 'your-execution-id'

We confirm in the results below that the task skipped the transfer of 100,000 files as expected in our example, because they match at both the source and destination.

SQL query results showing the total skipped files in a DataSync task execution, when queried against the custom database view for summary report

Figure 5: SQL query results showing the total skipped files in a DataSync task execution, when queried against the custom database view for summary report

Step 4: Create QuickSight dashboards to visualize the data in task reports

In this section, we leverage the data loaded in the Glue catalog by visualizing Athena task reports views using QuickSight dashboards. We create a dataset for summary_view and publish the analyses for visualizing the total number of files transferred, skipped, verified, deleted, along with the total number of bytes transferred across all task executions.

Alternatively, you can create an analysis for any other available tables or views such as transferred_view, skipped_view, verified_view, and deleted_view to visualize specific file details as needed.

We configured dashboards to automatically refresh the data to account for new task reports generated after each task execution, enabling us to monitor the overall progress of our data migration.

Create a QuickSight dataset for Athena

We create a new dataset in QuickSight console using an existing Amazon Athena data source. We select Athena as the data source, and AwsDataCatalog for catalog. We then select Glue database, datasync-task-reports-db, and the summary_view table that were created above. We configure to query and refresh the data directly from the source dataset instead of caching it using SPICE, to automatically load new task reports in QuickSight visualizations after each task execution. We also have pre-configured the task reports S3 bucket access setup in QuickSight.

QuickSight dataset configuration

Figure 6: QuickSight dataset configuration

Visualize the QuickSight dataset

Now, we can choose any graph that appropriately depicts the summary view dataset. In this example, we create and publish a new analysis that uses a vertical bar chart visual to show all files transferred, skipped, deleted, and verified across all task executions.

Configuration of QuickSight analysis to visualize files transferred, verified, skipped, and deleted file attributes, across multiple DataSync task executions

Figure 7: Configuration of QuickSight analysis to visualize files transferred, verified, skipped, and deleted file attributes, across multiple DataSync task executions

We also create and publish another analysis that uses a donut chart visual to show the total volume of data transferred across all task executions.

Configuration of QuickSight analysis to visualize total data transferred across multiple DataSync task executions

Figure 8: Configuration of QuickSight analysis to visualize total data transferred across multiple DataSync task executions

From these graphs above, we can confirm that both tasks together have incrementally transferred and verified a total of 250,000 files successfully, while deleting 10,000 files at a destination directory during the initial execution of the first task, as configured in our examples. We can also see that we were able to track the total volume of data transferred across all task executions, which is 5GB in our examples, giving us the ability to easily calculate the remaining percentage of this total data migration.

A single dashboard containing these two published analyses provided us the total visibility that we need to periodically track and audit the progress of our data migration across all of our task executions.

QuickSight dashboard with multiple published analyses that visualizes specific file attributes across multiple DataSync task executions

Figure 9: QuickSight dashboard with multiple published analyses that visualizes specific file attributes across multiple DataSync task executions

Cleaning up

If you implemented this solution, delete the following resources to avoid incurring future costs.

AWS DataSync:
- Agent
- Locations
- Tasks
Amazon S3:
- Task Report objects
- Transferred objects
AWS Glue:
- Crawler
- Catalog database
Amazon QuickSight:
- Dataset
- Dashboards
AWS IAM:
- IAM roles

Conclusion

In this post, we walked through implementing a solution to configure and customize task reports during the creation of DataSync tasks. We then demonstrated how we were able to easily track, audit, and troubleshoot the data transfers by analyzing and visualizing task reports using Glue, Athena, and QuickSight. We used S3 to store the task reports, Glue to catalog them, Athena to query specific data in the reports, and finally, QuickSight to visualize that data.

Using this solution, you can easily identify the overall progress and status of your data migration across multiple DataSync tasks and their executions. You can streamline the troubleshooting and identification of failures during each task execution by quickly querying for individual files and their associated errors. You can identify common transfer trends or failure patterns in your data migration by creating the analyses and dashboards that exhibit the granular task execution details available in task reports. Together, these benefits will enable you to gain better insights into your data migration processes and allow you to make more informed business decisions.

For more information, visit the AWS DataSync product page, view the product documentation and task reports documentation, or try it in the AWS Management Console.

AWS Storage Blog

Derive insights from AWS DataSync task reports using AWS Glue, Amazon Athena, and Amazon QuickSight

Introducing DataSync task reports

Solution overview and walkthrough

Step 1: Create and run DataSync tasks to generate task reports

Create the first DataSync task

Create the second DataSync task

Run the first DataSync task

Add additional files to the source location and rerun the first DataSync task

Run the second DataSync task

Step 2: Populate Glue catalog with task reports data using a Glue crawler

Create a Glue crawler

Run a Glue crawler

Step 3: Query task reports data in Glue tables using Athena query editor

Set Glue catalog as the data source

Create database views

Run a query to troubleshoot skipped files

Step 4: Create QuickSight dashboards to visualize the data in task reports

Create a QuickSight dataset for Athena

Cleaning up

Conclusion

Resources

Follow