Unify Analytics Leveraging Amazon Athena and Teradata for Robust Query Federation

By Claire Omiyi, Product Manager – Teradata
By Divyesh Sah, Director for Cloud and AI Architecture – Teradata
By Doug Mbaya, Sr. Partner Solutions Architect – AWS

Teradata

The Amazon Athena Teradata Connector enables Athena to run SQL queries on data stored in Teradata Vantage. It’s comprised of two AWS Lambda functions—one for metadata and one for record reading.

Federated query enables SQL execution across data stored in multiple data sources. Amazon Athena uses data source connectors to run federated queries. When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates.

This post describes the minimum steps required to query data in Teradata Vantage from Amazon Athena using the Athena Connector for Teradata, and how to run a federated query in Athena with join against Teradata Vantage and Amazon Simple Storage Service (Amazon S3).

Teradata is an AWS Specialization Partner and AWS Marketplace Seller with the Data and Analytics Competency. It’s a cloud-first, enterprise analytics solution that delivering customer insights at scale.

Customer Challenge

A purpose-built analytics engine such as Teradata often sits alongside other datastores and may be part of a data lake architecture. Having a single engine (such as Amazon Athena) that is capable of querying multiple tables across multiple datastores in a secure and federated configuration allows for a highly scalable and decoupled environment, while also reducing data duplication as much as possible.

Using Athena, analysts or machine learning (ML) engineers can now access and centralize data from multiple sources with different time horizons, such as joining analytics data with live transactional data to train a robust model to serve the business need.

About Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It’s serverless and works with a variety of data formats including CSV, JSON, ORC, Avro, and Parquet.

Athena is ideal for quick, ad-hoc querying but can also handle complex analysis. With Athena, no extract, transform, load (ETL) process is needed.

If you have data in sources other than S3, you can use Athena Federated Query and run SQL queries across data stored in relation, non-relational, object, and custom data sources.

About Teradata Vantage

Teradata Vantage is a platform for pervasive data intelligence and delivering real-time answers to users and systems across all parts of an organization. It leverages 100% of a business’ data, regardless of scale, volume, or complexity.

Vantage combines descriptive, predictive, prescriptive analytics, autonomous decision-making, ML functions, and visualization tools into a unified, integrated platform that uncovers real-time business intelligence at scale, no matter where the data resides.

Vantage enables companies to start small and elastically scale compute or storage, paying only for what they use, harnessing low-cost object stores and integrating their analytic workloads.
Vantage supports R, Python, Teradata Studio, and any other SQL-based tools. You can deploy Vantage across public clouds, on-premises, on optimized or commodity infrastructure, or as-a-service.

Solution Architecture

VantageCloud Enterprise and VantageCloudLake on AWS is fully managed service solution deployed in Teradata-owned AWS account. Customers deploy AWS services in their own account which then connect with the Teradata-managed Vantage AWS accounts using one of the following approved and supported connectivity options between a customer managed AWS account and software-as-a-service (SaaS):

AWS Transit Gateway (TGW)
AWS PrivateLink
AWS Site-to-Site VPN

The following diagram illustrates two connectivity methods: AWS Transit Gateway and AWS PrivateLink. You can find more connectivity options in the Teradata documentation.

Figure 1 – Teradata instances deployed with private IPs (top) and AWS PrivateLink (bottom).

Prerequisites

In this walkthrough, you are expected to be familiar with AWS concepts, Amazon Athena, and Teradata Vantage.

You’ll need the following accounts and systems:

AWS account; you can start with a free account.
Teradata Vantage environment.
Amazon S3 bucket to act as spill bucket to store data that exceeds AWS Lambda function response size limits. More details can be found on the Athena Connector documentation.

Deploy Teradata Connector Using Amazon Athena

Amazon Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. The Athena Teradata Connector is one of the pre-built data source connectors provided by AWS.

To use the Teradata connector with Athena, you must create a Lambda layer that includes the Teradata JDBC driver first.

Create a Lambda layer for the Teradata JDBC driver.
Download the latest Teradata JDBC driver.
Uncompress the tdjdbc4.jar from the downloaded file.
Create a \java\lib folder structure on your local system and place the .jar file in it (for example: \java\lib\terajdbc4.jar).
Create a .zip file of the entire folder structure that contains the terajdbc4.jar file.
Go to the AWS Lambda console and in the navigation pane choose Layers > Create layer.

Figure 2 – Teradata connector deployment.

For Name, enter a name for the layer (for example: TeradataJDBCDriver).
Ensure the Upload a .zip file option is selected.
Choose Upload, and then upload the zipped folder that contains the Teradata JDBC driver.
Choose Create.
On the details page for the layer, copy the layer Amazon Resource Name (ARN) by choosing the clipboard icon at the top of the page.

Figure 3 – Lambda Layers version details.

Save the ARN for later use.
Deploy the Teradata Connector. This can be deployed by using either the Athena console or the AWS Serverless Application Repository. Instructions on how to deploy using the AWS Serverless Application Repository can be found in the AWS documentation.

Note that in this post, we are using Athena Console to deploy the connector.

Go to Athena console.
If the console navigation pane is not visible, choose the expansion menu on the left.
In the navigation pane, choose Data sources under Administration.
Choose Create data source, and for Choose a data source then enter Teradata at the search bar.

Figure 4 – Teradata as a data source for Athena.

Choose Teradata if it’s not already selected, and then click Next.
On the Enter data source details page, enter Data source name (for example: TDVantage).
For Lambda function, choose Create Lambda function to open the AthenaTeradataConnector – version **** page in a different tab. The page includes detailed information about the connector.
Under Application settings, enter following values that correspond to your requirements:
- Application name: AthenaTeradataConnector
- SecretNamePrefix: Vantage-*
- SpillBucket: your S3 bucket (for example: ptctstoutput)
- DefaultConnectionString: teradata://jdbc:teradata://<Vantage Instance Endpoint>/TMODE=ANSI,CHARSET=UTF8,DATABASE=<Database Name>,USER=<Database User Name>,PASSWORD=<Database Password>
Replace the parameters highlighted in yellow with real value:
- LambdaFunctionName: athenavantagetest
- LambdaJDBCLayername: ARN for the Lambda layer created in previous step
- SecurityGroupIds: Security group that should be applied to the Lambda function; preferably the same group as your Vantage instances
- SpillPrefix: athena-spill
- SubnetIds: Subnet IDs corresponding to the subnet the Lambda function can use; preferably the same ID as your Vantage instances
Select I acknowledge that this app creates custom IAM roles.
Choose Deploy which kicks of a serverlessrepo-AthenaTeradataConnector stack.
Check the Deployments tab for the completion of the deployment.

Figure 5 – Lambda connector for Teradata.

If users want to see the Lambda function, first click on the Overview tab, and then the Resources section followed by the Logical ID.

Query Using Amazon Athena

Return to the Create data source page of the Athena console, and select Teradata as the data source if it’s not already there.
Enter the Data source name if not already there (for example: TDVantage).
At Connection details, choose the Lambda function you just created from the drop-down list (for example: athenavantagetest).
Click Next and review details and choose Create data source.

Run a Simple Query Against Vantage

If you don’t have a location for the query results, set it up first”

Click on Query editor from the left navigation panel, and then choose Settings > Manage.
At the Manage settings window, browse S3 to pick the bucket when you want the results to be stored, and then click Save.
Choose Query Editor from the top manual bar.
Select the Data source from the dropdown list (for example: TDVantage).
Choose the Database from the dropdown list (for example: test).
The table(s) under the database should show up in the Tables and views section.
Run queries against the database tables from the Query window. For example, a simple select statement returns all the rows from the employee table.

Figure 6 – Query Teradata from Athena.

Run a Federated Query Against Vantage and Amazon S3

Amazon Athena can run federated queries for data residing in different data sources. In this tutorial, we’ll join the employee table from Teradata Vantage to a customer file on Amazon S3 to retrieve relevant information. Any tabular dataset can be used for this exercise.

Sample customer file in S3 bucket input:

Figure 7 – Sample Amazon S3 file.

Join statement:

SELECT e.associate_id, 
       associate_name, 
       designation, 
       customer_name, 
       billing_street, 
       billing_city, 
       billing_state, 
       priority 
FROM test.employee e, AwsDataCatalog.s3data.input b
WHERE e.associate_id = b.associate_id;

Results:

Figure 8 – Joint query against Teradata and Amazon S3.

Cleanup

To avoid incurring charges to your AWS account for the resources used, follow these steps.

Amazon Athena:

Amazon S3 bucket and file:

Delete spill bucket

Stacks:

Delete serverlessrepo-AthenaTeradataConnector (deployed by the connector’s AWS CloudFormation stack). This will delete the Lambda function.

AWs Lambda:

Delete Lambda layer

Teradata Vantage Instance:

If source table is no longer needed, delete the source table
Stop/Terminate the instance if no longer needed

Conclusion

In this post, we walked through how Amazon Athena is capable of querying multiple tables across datastores, including Teradata, in a secure and federated configuration.

Athena is a serverless application that provides a highly scalable and decoupled query environment that allows users to reduce data duplication and data movement. Together, Athena and Teradata help you take advantage of Teradata’s parallel compute capability to handle complex queries that can be analyzed with other datastores in a serverless configuration.

.

.

Teradata – AWS Partner Spotlight

Teradata is an AWS Specialization Partner and cloud-first, enterprise analytics solution that delivering customer insights at scale.

Contact Teradata | Partner Overview | AWS Marketplace