AWS Big Data Blog

Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake

This post was last reviewed and updated July, 2022 with updates in Athena federation connector.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Oracle as your transactional data store, you may need to join the data in your data lake with Oracle on Amazon Relational Database Service (Amazon RDS), Oracle running on Amazon Elastic Compute Cloud (Amazon EC2), or an on-premises Oracle database, for example to build a dashboard or create consolidated reporting.

In these use cases, Amazon Athena Federated Query allows you to seamlessly access the data you’re your Oracle database without having to move the data to the S3 data lake. This removes the overhead in managing such jobs.

In this post, we walk you through a step-by-step configuration to set up Athena Federated query using AWS Lambda to access data from Oracle database.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Oracle instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Amazon RDS for Oracle Database 19c Enterprise Edition
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an Oracle database up and running.

Create a secret for the Oracle instance

Our first step is to create a secret for the Snowflake instance with a user name and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. For Secret type, select other types of secret.
  4. Enter the credentials as key-value pairs (username, password) for your Oracle instance.
    • Key = username, Value = <Your username>
    • Key = password, Value = <Your password>

  5. For Secret name, enter a name for your secret. Use the prefix oracle-athena so it’s easy to find.
  6. Leave the remaining fields at their defaults and choose Next.
  7. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use navnis-athena-oracle-federation/oracle.

Configure Athena federation with the Oracle XE instance

To configure the Athena data source connector for Oracle with your Oracle instance, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Choose a data source, search for the Oracle connector and choose Next.
  4. For Data source name, provide a name for the data source (for example, athena-oracle).
  5. Under Connection details, choose Create Lambda function.
  6. For Application name, keep the default AthenaOracleConnector.
  7. For SecretNamePrefix, enter oracle-athena, which you created earlier.
  8. For SpillBucket, enter your S3 bucket name (for example, navnis-athena-oracle-federation).
  9. For JDBCConnectorConfig, use the format oracle://jdbc:oracle:thin:${secret-name}@//hostname:port/servicename
    For example: oracle://jdbc:oracle:thin:${oracle-athena}@//ora19csource.c7jzerofc1jp.us-east-2.rds.amazonaws.com:1521/ORCL
  10. For LambdaFunctionName, enter oracleathena.
  11. For SecurityGroupID, enter the security group ID that is associated to the VPC ID corresponding to your subnet (a security group with default configuration and HTTPS rule added should work fine).
  12. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, oracle).
  13. For Subnetids, enter the corresponding subnet that the Lambda function can use to access your data source. For example: subnet1, subnet2.
    Please refer the Appendix section to learn more about network setting.
  14. Select the I acknowledge check box.
  15. Choose Deploy.The deployment takes 3–4 minutes to complete. Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena.For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.
  16. After the function is created, choose the function name under Lambda function and keep the other options at their default.
  17. Choose Next.
  18. Choose Create Data Source.

Now you can see the newly created data source.

Run federated queries with Athena

On the Athena query editor, when you first try to refresh the database, you get an error message. Copy the environment variable from this error message.

To fix this error message, complete the following steps:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Open the function oracleathena.
  3. Go to Edit environment variables and choose Add environment variable.
  4. Provide the following values:
    1. Keyoracleathena_connection_string
    2. Valueoracle://jdbc:oracle:thin:${oracle-athena}@//ora19csource.c7jzerofc1jp.us-east-2.rds.amazonaws.com:1521/ORCL
  1. Choose Save.

Now when you refresh, you can see all the tables.

  1. Select the database from the list to see all the tables available under the database and run the query.

Query using AWS Serverless Application Repository

The Lambda function was deployed as part of AWS Serverless Application Repository, so you can run a query using lambda:oracleathena against tables in the Snowflake database. This is the name of Lambda function that we created in an earlier step.

lambda:oracleathena is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing federated queries.

Key performance best practices

If you’re considering Athena Federated query with Oracle, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Oracle database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Oracle database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your S3 data lake.
  • The star schema is a commonly used data model in Oracle. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Oracle. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with Oracle. Now you don’t need to wait for all the data in your Oracle data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Oracle for better performance. When queries are well written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.

Appendix:

This is guidance for setting up the AWS Lambda function that’s connected to an Amazon Virtual Private Cloud (Amazon VPC) to have access to the internet.

Internet access from a private subnet requires network address translation (NAT). To give internet access to an Amazon VPC-connected Lambda function, route its outbound traffic to a NAT gateway or NAT instance in a public subnet.

  1. Provide private subnet IDs in your lambda function.
  2. Private Subnet IDs should have route table with NAT Gateway associated to it.
  3. NAT Gateway should have Public Subnet attached to it.
  4. Public gateway should have route table with Internet gateway attached.

For more information, see Internet gateways in the Amazon VPC User Guide. For an example setup, see VPC with public and private subnets (NAT).

You can find additional details here.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.