Query SAP HANA using Athena Federated Query and join with data in your Amazon S3 data lake

This post was last reviewed and updated July, 2022 with updates in Athena federation connector.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store, you may need to join the data in your data lake with SAP HANA in the cloud, SAP HANA running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises SAP HANA, for example to build a dashboard or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from your SAP HANA database without building extract, transform, and load (ETL) pipelines to copy or unload the data to the S3 data lake or SAP HANA. This removes the overhead of creating additional ETL processes and shortens the development cycle.

In this post, we walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in an SAP HANA database running on AWS.

For this post, we use the SAP HANA Athena Federated Query connector. You can deploy the Athena Federated Query connector available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (in our case, Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and runs them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors that internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the AWS Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

Create a secret for the SAP HANA instance using AWS Secrets Manager.
Create an S3 bucket and subfolder for Lambda to use.
Configure Athena federation with the SAP HANA instance.
Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an SAP HANA database up and running on AWS. For this post, I’m using the SAP HANA trial version.

Create a secret for the SAP HANA instance

Our first step is to create a secret for the SAP HANA instance with a user name and password using Secrets Manager.

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
For Secret type, select Other types of secrets.
Set the credentials as key-value pairs (user name, password) for your SAP HANA instance.
1. Key = username, Value = <Your username>
2. Key = password, Value = <Your password>

For Secret name, enter a name for your secret. Use the prefix saphana-athena so it’s easy to find.
Leave the remaining fields at their defaults and choose Next.
Complete your secret creation.

Setting up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use s3://navnis-athena-federated/saphana.

Configure Athena federation with the SAP HANA instance

To configure Athena federation with your SAP HANA instance, complete the following steps:

On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.

For Choose a data source, search for the SAP HANA connector and choose Next.
For Data source name, provide a name for the data source (for example, saphanadb).
Under Connection details, choose Create Lambda function.

For Application name, keep the default AthenaSaphanaConnector.
For SecretNamePrefix, enter saphana-athena, which you created earlier.
For SpillBucket, enter your S3 bucket name (for example, navnis-athena-federated).
For JDBCConnectorConfig, use the format saphana://jdbc:sap://saphana1.host:port/?${Test/RDS/Saphana1}.

For example: saphana://jdbc:sap://63b80dce-90e5-407b-93f9-7dd0f1d1f156.hana.trial-us10.hanacloud.ondemand.com:443/?${saphana-athena}

For LambdaFunctionName, enter saphanaathenafederation.
For LambdaMemory, keep default 3008.
For LambdaTimeout, keep default 900.
For SecurityGroupID, enter the security group ID that is associated to the VPC ID corresponding to your subnet (a security group with default configuration should work fine).
For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, saphana).
For Subnetids, enter the corresponding subnet that the Lambda function can use to access your data source. For example: subnet1, subnet2.
Please refer the Appendix section to learn more about network setting.
Select the I acknowledge check box.
Choose Deploy.

The deployment takes 3–4 minutes to complete. Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena.7

After the function is created, choose the function name under Lambda function and keep the other options at their default.
Choose Next.
Choose Create Data Source.

Now you will able to see the newly created data source.

Run federated queries with Athena

Before running your federated query, be sure that you have selected Athena engine version 2. The current Athena engine version for any workgroup can be found on the Athena console.

On the Athena query editor, when you first try to refresh the database, you get an error message. Copy the environment variable from this error message.

To fix this issue, complete the following steps:

On the Lambda console, choose Functions in the navigation pane.
Open the function saphanaathenafederation.
Go to Edit environment variables and choose Add environment variable.
Provide the following values:
1. Key – saphanaathenafederation_connection_string
2. Value – saphana://jdbc:sap://63b80dce-90e5-407b-93f9-7dd0f1d1f156.hana.trial-us10.hanacloud.ondemand.com:443/?${saphana-athena}
Choose Save.

Now, when you refresh you will able to see all the tables.

Select SYS from the list and it will list all the tables available under SYS database and run the below query:
```
SELECT * FROM "SYS"."ADAPTERS" limit 10;
```

Querying using Serverless Application Repository

The Lambda function was deployed as part of AWS Serverless Application Repository, so you can run a query using lambda:saphanaathenafederation against tables in the SAP HANA database. This is the name of Lambda function that we created in an earlier step.

lambda:saphanaathenafederation is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing federated queries.

Key performance best practice considerations

If you’re considering Athena Federated Query with an SAP HANA database, we recommend the following best practices:

Athena federation works great for queries with predicate filtering because the predicates are pushed down to the SAP HANA database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from the SAP HANA database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your S3 data lake.
The star schema is a commonly used data model in SAP HANA databases. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in your SAP HANA database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with SAP HANA using Lambda. Now you don’t need to wait for all the data in your SAP HANA data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from SAP HANA for better performance. When queries are well-written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.

Appendix:

This is guidance for setting up the AWS Lambda function that’s connected to an Amazon Virtual Private Cloud (Amazon VPC) to have access to the internet.

Internet access from a private subnet requires network address translation (NAT). To give internet access to an Amazon VPC-connected Lambda function, route its outbound traffic to a NAT gateway or NAT instance in a public subnet.

Provide private subnet IDs in your lambda function.
Private Subnet IDs should have route table with NAT Gateway associated to it.
NAT Gateway should have Public Subnet attached to it.
Public gateway should have route table with Internet gateway attached.

For more information, see Internet gateways in the Amazon VPC User Guide. For an example setup, see VPC with public and private subnets (NAT).

You can find additional details here.

About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

AWS Big Data Blog