AWS Marketplace
Automate ingesting updates from third-party data providers with AWS Data Exchange
As companies consume third-party data to help make business decisions, managing updates of that data can become complex and time-consuming. Auto-export, a recently released feature from AWS Data Exchange, makes data ingestion faster and easier for AWS customers.
In this blog post, we will show you how to configure auto-export in AWS Data Exchange to seamlessly incorporate updates to your third-party data into your data pipelines. Customers like Nextdoor, a neighborhood network dedicated to bringing communities closer together, have successfully used this feature to improve their end-to-end data ingestion pipeline. We provide step-by-step instructions that technical and nontechnical users can follow to set up the auto-export pipeline in minutes.
AWS Data Exchange background and concepts
AWS Data Exchange simplifies the third-party data licensing and procurement process by making it easy to find, subscribe to, and use third-party data in the cloud. For data subscribers, AWS Data Exchange provides access to thousands of third-party data products from hundreds of qualified data providers in a single catalog in AWS Marketplace.
Products are the unit of exchange in AWS Data Exchange. A product is a package of data sets that a provider publishes and others subscribe to. A product can contain one or more data sets.
A data set is a series of one or more revisions. Revisions are containers for one or more assets. Assets are the actual pieces of data that can be stored as Amazon S3 objects.
Feature overview
Today on AWS Data Exchange, you receive updated data such as stock prices from yesterday’s market close from your third-party data providers in the form of new revisions. You export these updates into your Amazon S3 buckets, incorporating the changes into your downstream systems. With auto-export, you can “set-and-forget” your export preferences, and AWS Data Exchange automates the delivery of any new revisions to the S3 buckets you specify. Refer to the following architecture diagram for a visualization of this workflow.
Key points:
- You can specify up to five S3 bucket destinations per data set. This makes it easier to manage data access across teams using separate S3 buckets.
- You can specify key naming patterns for each auto-export configuration, which enables you to control the folder structure of the data as it is delivered to your bucket.
- You receive a notification to your root account email for auto-export errors that require you to take action; for example, the destination bucket is deleted or permissions were revoked.
Prerequisites
To set up auto-export successfully, you need the following:
- An account that has a subscription to the data set with ongoing updates that you want to receive. For this blog post, we will use the AWS Data Exchange Heartbeat product as our example.
- An Amazon S3 bucket with permission granted to AWS Data Exchange to export new revisions on your behalf.
Walkthrough
The following tutorial shows how to set up auto-export for your own third-party data. This video also has step by step instructions on this process.
Step 1: Subscribe to a data set on AWS Data Exchange
- Sign in to your AWS account. Navigate to AWS Marketplace, and search for AWS Data Exchange Heartbeat.
- On the data product page, choose Continue to subscribe.
- At the bottom of the page, choose Subscribe.
- Once the subscription is active, go to the AWS Data Exchange console and choose Entitled data. Choose AWS Data Exchange Heartbeat (Test product) and then select the Heartbeat data set.
Step 2: Grant AWS Data Exchange permission to export new revisions into the S3 bucket of your choice
- Navigate to the bucket to which you want to export revisions.
- Select the Permissions In the bucket policy section, choose Edit.
- Copy the following statement and paste it at the bottom of the bucket policy. This policy grants AWS Data Exchange permission to export data into the given S3 bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "dataexchange.amazonaws.com"
},
"Action": [
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::<BUCKET-NAME>/*",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "<AWS ACCOUNT ID>"
}
}
}
]
}
The following screenshot shows the Edit bucket policy page with the previous statement pasted into the bucket policy.
- Choose Save changes.
- Repeat these steps for any bucket you wish to add as a bucket destination for your auto-export jobs.
Step 3: Configure auto-export on a data set
You can configure auto-export on an entitled data set at any time during your subscription. In the following procedure, you set up auto-export on the Heartbeat product you subscribed to in Step 1.
To set up auto-export on an entitled data set, do the following:
- In the left-side navigation pane under My subscriptions, choose Entitled data.
- Choose AWS Data Exchange Heartbeat (Test product) and then choose the Heartbeat data set.
- Choose Add auto-export job destination.
- Under Select Amazon S3 bucket folder destination, choose the existing S3 bucket that you’d like your new revisions to go to. By default, the revisions will be exported using {Revision.CreatedAt}/{Asset.Name} as the path name, but you can change this to the key naming pattern of your choice. Refer to the following screenshot for an example of an export revision with the default key naming pattern.
To learn more about key naming patterns for auto-export, see Configuring key naming patterns for auto-export in step 4.
- Set the encryption options you want on the data.
- Review the Amazon S3 pricing section to determine if cross-Region data transfer charges will be applied to your export. To complete your auto-export set up, choose Add bucket destination.
- To verify that the auto-export configuration was successful, navigate to the S3 bucket you selected and check for a 0-byte test file from ADX. The file name will start with _ADX-Test-.
Step 4: Configure key naming patterns for auto-export
By default, the revisions will be exported using {Revision.CreatedAt}/{Asset.Name} as the path name. This means that revisions automatically exported to your S3 bucket from AWS Data Exchange will each have their own folder named for the time the revision was created with all the assets inside that folder.
When setting up an auto-export job for a data set, you can also configure your own S3 key naming pattern using dynamic references, like Revision ID or Asset Name, which will change based on the day or time or the revision.
To configure a key naming pattern for your auto-export job, do the following:
- In the left-side navigation pane under My subscriptions, choose Entitled data.
- Choose AWS Data Exchange Heartbeat (Test product) and then choose the Heartbeat data set.
- Choose Add auto-export job destination and choose Advanced.
- Select the bucket that you want your new revisions to go into.
- Select a Dynamic reference and choose Append to create your key naming pattern. You must include either Name or Asset.Id as part of your key naming pattern.
- Review the Amazon S3 pricing section to determine if cross-Region data transfer charges will be applied to your exports.
- Choose Add bucket destination.
You will see the details of your job appear in the auto-export job destinations section. Each new revision will be added to the fold according to the year, month, and day it was published as well as the asset name chosen by the provider.
Congratulations! You have created an auto-export job. Any data updates the provider publishes to your data set will appear in your specified bucket.
Adding additional auto-export jobs to a data set
You can add up to five auto-export jobs on a single data set. To add additional jobs, choose Actions and Add auto-export job destination from the auto-export job destinations widget.
Remove an auto-export job
To delete an auto-export job, select the job you want to delete. Select the Actions menu in the auto-export job destinations section and choose Remove auto-export job destination.
Conclusion
AWS Data Exchange now enables you to automate the delivery of new third-party data updates. In this post, we showed you how to configure auto-export jobs for your data sets, including how to update your Amazon S3 bucket policies. To learn more about the service, see AWS Data Exchange.
About the authors
Anna Whitehouse is a Product Manager with AWS Data Exchange based out of New York, NY. Before joining AWS, Anna spent five years in financial services working with customers to solve business problems using third-party data.
Kanchan Waikar is a Senior Specialist Solutions Architect at Amazon Web Services with AWS Data Exchange. She has over 14 years of experience building, architecting, and managing natural language processing (NLP) and software development projects. She has a masters degree in computer science (data science major) and enjoys helping customers build solutions backed by AI/ML based AWS services and partner solutions.