Migrate on-premises data to AWS for insightful visualizations

When migrating data from on premises, customers seek a data store that is scalable, durable, and cost effective. Equally as important, BI must support modern, interactive, and fast dashboards that can scale to tens of thousands of users seamlessly while providing the ability to create meaningful data visualizations for analysis. Visualization of on-premises business analytics data sets unlocks the power of data, creating opportunities for effective decision making and cost optimization through learning gleaned from the visualizations.

In this post, we cover automating your migration of on-premises files from a Network File System (NFS) or Server Message Block (SMB) file system to Amazon Simple Storage Service (S3) and using the data in those files to create visualizations showing cost and usage, trends, and data correlation. This involves a one-time setup and creation of the data flow pipeline and automation to enable the migration and processing of the newly created files without manual intervention. This approach allows customers to migrate their data seamlessly and at scale, and also to accelerate innovation through insightful visualizations of their data.

Solution overview

We use AWS DataSync to transfer on-premises files to Amazon S3. DataSync is a secure, online service that automates and accelerates moving data between on premises and AWS Storage services. AWS Glue crawlers scan these migrated files stored on Amazon S3 and save the data in an AWS Glue Catalog table, which is utilized by Amazon Athena for querying and Amazon QuickSight for visualizations.

Architecture showing movement from On-premesis to AWS Cloud using AWS DataSync agent

Prerequisites and limitations

Network requirements: Network requirements when using DataSync depend on how you plan to transfer data. Check this link for more information.

DataSync agent requirements: An agent is a virtual machine (VM) or Amazon Elastic Compute Cloud (Amazon EC2) instance that DataSync uses to read from or write to a storage system. Agents are commonly used when copying data from on-premises storage to AWS. Check this link for more information.

DataSync limits: Use this link to find out about quotas when working with DataSync.

QuickSight access to Amazon S3: Grant QuickSight access to the S3 bucket where your data is stored using the following steps. You must be an Administrator user in Amazon QuickSight to complete these steps.

Go to the QuickSight console. If you have not yet signed up for QuickSight, then sign up by following these steps.
In the upper right corner of the console, choose Admin/username, and then choose Manage QuickSight. Choose Security and permissions. Under QuickSight access to AWS services, select Manage.
Choose Athena, and then choose Next. Give QuickSight access to the required S3 bucket, and select Finish.

Solution setup

To appropriately set up this solution, we’ll cover the following steps:

Migrating existing on-premises data to Amazon S3 using AWS DataSync
Configure AWS Glue Catalog with a Glue crawler
Visualize data with a Glue Catalog in AWS Quicksight
Automate new data visualization with AWS Lambda and Amazon S3 events

Migrating existing on-premises data to Amazon S3 using AWS DataSync

In this section, we set up a DataSync agent and locations, as well as create and execute an DataSync task.

Setting up the DataSync agent

A DataSync agent must be installed on your local on-premises hypervisor that communicates with the DataSync service as part of the data migration.
Open the DataSync console in your AWS Region and select Transfer data.
Follow the steps listed in the DataSync getting started guide to deploy your DataSync agent. Once the agent is deployed and activated, the agent appears on the DataSync console Agents Verify that the service endpoint is correct and the agent status is Online.

Agent appearing on the DataSync console Agents page

DataSync locations

Once the agent is deployed and online, source and destination locations are created and used as part of the DataSync task. For our scenario, the source is an on-premises file system and the destination is an S3 bucket.

Follow the AWS documentation links to create a source and destination location. Once created, both locations are listed in the locations navigation pane.

DataSync task

Now that the source and destination locations are created, we create a task that is used to orchestrate the data transfer from source to destination.

The steps can be found in the task creation section of the DataSync documentation for task creation and configuration. Check the following steps for configuration:

The default option is to keep files in the destination even if they are deleted from the source location and overwrite files in the destination with files from the source.
If you want to exclude any files from migrating to cloud, then provide an exclude pattern. For our scenario, we exclude all the files with the .temp extension.
Keep the Frequency as “not scheduled”.
A new task is created and is visible under tasks with Status as Available. Select the newly created task and select Actions. From the actions dropdown, select Start to start the task execution.

New task is showing as available in actions

A pop up will appear at the top right. You can select See execution details and scroll down to Performance section to see the detailed task metrics such as Files transferred, Data throughput

At this stage, all existing data (except files with the .temp extension) from on premises is successfully transferred to AWS cloud inside an S3 bucket in the specified prefix.

Configure AWS Glue catalog with a Glue crawler

In this section, we create the AWS Glue Catalog database and AWS Glue crawler.

Creating AWS Glue Catalog database

Open the AWS Glue console in your region and choose Databases from the left navigation pane. Select Add database. Provide a name and description and then select Create database. You will see one database created under the Databases section.

Navigation pane with option to add databases

Create an AWS Glue Crawler

Go to the AWS Glue console and choose Crawlers from the left navigation pane. Select Create crawler.

Crawler on the navigation pane.

Name the crawler and select Next. On the Choose data sources and classifiers page, choose Not yet, and select Add a data source.

Classifier page with option to add data source

A pop-up appears asking you to add the data source details, which in our case is Amazon S3. Choose S3 from the Data source. For the S3 path textbox select Browse. A pop-up appears with all the S3 buckets. Select the applicable S3 bucket name (and not the radio button), which shows the prefixes inside it. Select the applicable prefix, and select Choose. This is the same prefix where we have the migrated files from on-premises. Leave everything else as default, and select Add an S3 data source at the bottom.

Now, you can see that the data source has been added in the Data Sources section in the Choose data sources and classifiers Select Next.

Figure 10: Datasource and classifier page showing that data is not mapped to Glue tables

The Configure security settings page opens. Select Create new IAM role and give a suitable name for creating the role, then select Next. In the Output configuration section, choose the Target database as the AWS Glue Catalog database created in the previous Creating a AWS Glue Catalog database. Leave the frequency as On-demand and select Next.

Figure 11 Create an IAM role with output configuration

Review the configurations and select Create crawler. A new crawler is visible under the crawlers with the state marked as Ready. Select the crawler checkbox and select Run. The state changes to Running. Once the crawler run is complete, the data from the files in the S3 bucket are scanned, and a table is created automatically in the AWS Glue Catalog. The state changes from Running to Stopping, then to Ready. The crawler shows Last run as Succeeded.

Figure 12 Create crawler showing a ready state

Go to Databases from the left navigation pane and select the database name (as mentioned in Step 6) to see the database details. The table can be viewed in the Tables Select Table data under the View data header. This takes you to Athena, then select Proceed.

Figure 13 Databases on the navigation pane with option to select table data

By default, a query to select 10 records appears. You can modify the query as required.

Figure 14 Showing up to 10 queries

Visualize the data

The data is loaded in the AWS Glue Catalog table and QuickSight access is set up, where we can visualize the data using QuickSight.

Go to the QuickSight console in your AWS Region, go to Datasets from the left navigation pane, and select New dataset. Select Athena. In the pop-up, enter the Data source name as the AWS Glue Catalog database name that we created earlier. In this case, it is my-trans-reports. Now select Create data source. The table is listed in Tables section, select the table and select Select.

Figure 15 QuickSight console with option to select new dataset

QuickSight asks if you want to import the data into SPICE (Super-fast, Parallel, In-memory Calculation Engine) or Directly query the data. Choose Directly query your data and select Visualize.

Figure 16 table for transactions reports and query data for visualization

Note that if you have a large data set, such as millions of rows of data, SPICE provides faster query performance. You must also schedule your SPICE dataset to refresh periodically to pick up new files from the Amazon S3 bucket.

Now, you can choose any graph that appropriately depicts the dataset. In this example, we use a vertical bar graph to visualize the country-wise total sales for all files present.

Figure 17 bar graph depicting sum of county codes and transaction dates

In our example, input files were having two files, i.e., one of date 07 Oct 22 and 08 Oct 22.

Setting up the automation

Now, we must create a function, which is triggered by Amazon S3 events whenever a new file is uploaded on Amazon S3. The Lambda function calls AWS Glue crawler.

Create an IAM role for the Lambda function

Go to the AWS Identity and Access Management (IAM) console, choose Policies from the left navigation pane, and select Create policy. Go to the JSON tab and enter the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "glue:StartCrawler",
            "Resource": "<AWS Glue crawler arn>"
        }
    ]
}

Select Next: Tags, then select Next: Review. The Review page appears. Give a name to the policy and select Create Policy. Now choose Roles from the left navigation pane and select Create role. Choose AWS Service as Trusted entity type and select Lambda. Select Next.

On the Permission policies page, search for the policy you just created above and select the same. For this case, it is StartCrawlerPolicy. Select Next. Give this role a name and then select Create Role. The Role is now created.

Creating a Lambda function

Go to the Lambda console, select Create function. Choose Author from scratch and provide a function name. Choose Python 3.9 as runtime. Select Change the default execution role and select Use an existing role. Select the role created in the previous section. Select Create function.

The function details page appears. Copy the following code in the Code source section, change the value assigned to the variable named glueCrawlerName with the name of the AWS Glue crawler created previously (in our case it is Crawl-S3-DataSync), and select Deploy. The changes are deployed and the Lambda function is ready for use.

import json
import boto3

# Import Boto 3 for AWS Glue
client = boto3.client('glue')

# Variables for the job: 
glueCrawlerName = "Crawl-S3-DataSync"

# Define Lambda function
def lambda_handler(event, context):
    try:
        response = client.start_crawler(Name = glueCrawlerName)
        print('## STARTED GLUE CRAWLER: ' + glueCrawlerName)
        print(response)
    except Exception as exp:
        print(f'## ERORR ## \n\n {exp}')

Create an Amazon S3 Event

In this section, we create an Amazon S3 event to trigger the Lambda function every time a new file is uploaded in the Amazon S3 bucket prefix.

Go to the Amazon S3 console and select your applicable Amazon S3 bucket name to open it. Go to Properties tab, scroll down to the event notifications section. Select Create event notification. In the Create event notification page, give the event name, and specify the prefix.

Figure 19 Properties tab with option to create event notification

In the Event types section, select the All object create events checkbox, and scroll to bottom of the page.

Choose the Destination as AWS Lambda function, and from the dropdown choose the Lambda function created previously. Select Save changes. Then, the event notifications is created.

Figure 20 Destination tab with Lambda function

Modify the DataSync task frequency

Go to the DataSync console, in the task list select the task id and then select Edit. Scroll to the Schedule section, choose Frequency as Daily, and specify the time. Select Save changes.

When DataSync tasks use an Amazon S3 location, consider Amazon S3 request costs as part of the scheduled task, here.

Figure 21 Bar graph depicting country code and transaction date

This figure has been updated with one more data file on the 09 Oct 22 data.

The fully automated solution setup is now complete. The DataSync task runs as scheduled and automatically migrates the new files from on premises to Amazon S3. This will trigger the AWS Glue crawler and data is updated in AWS Glue Catalog. The changes can now be seen on the QuickSight dashboard.

Cleaning up

To avoid recurring fees, follow the below steps:

Delete AWS DataSync agents locations and task.
Delete Amazon S3 event Notification.
Delete AWS Lambda function.
Delete AWS Glue crawler.
Delete Glue catalog database.
Delete dashboard and dataset from QuickSight.

Conclusion

In this blog, we used AWS DataSync to move data from on premises to Amazon S3, which can scale storage resources to meet fluctuating needs with 99.999999999% (11 9s) of data durability. This solution enables you to simply manage data movement with bandwidth throttling, migration scheduling, and task filtering. Furthermore, you can store data across Amazon S3 storage classes to reduce costs without upfront investment or hardware refresh cycles.

For visualizations, we used QuickSight. With QuickSight, all users can meet varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics, and natural language queries. Since QuickSight is serverless, it automatically scales to tens of thousands of users without the need to set up, configure, or manage your own servers.

Visualizations allow for streamlined analysis of on-premises data, enhancing the potential of each data set through further analysis. With proper visualizations, you can make better informed business decisions, evaluate cost analysis to a heighten capability, and more closely understand trends.

Using this automated solution lets you migrate your on-premises data and BI workloads to AWS cloud rapidly, securely, and without any manual intervention. It supports modern, interactive, and fast dashboards that can scale seamlessly as the data load increases.

For more information, visit: