Amazon Athena for SQL
- Unified metadata repository: AWS Glue is integrated across various AWS services. AWS Glue supports data stored in Amazon Aurora, Amazon Relational Database Service (RDS) for MySQL, Amazon RDS for PostreSQL, Amazon Redshift, and S3, as well as MySQL and PostgreSQL databases in your Amazon Virtual Private Cloud (VPC) running on Amazon Elastic Compute Cloud (EC2). AWS Glue provides out-of-the-box integration with Athena, Amazon EMR, Amazon Redshift Spectrum, and any application compatible with Apache Hive metastore.
- Automatic schema and partition recognition: AWS Glue automatically crawls your data sources, identifies data formats, and suggests schemas and transformations. Crawlers can help automate table creation and automatic loading of partitions.
Creating tables, data formats, and partitions
- Apache Web Logs: "org.apache.hadoop.hive.serde2.RegexSerDe"
- CSV: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- TSV: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- Custom Delimiters: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- Parquet: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
- Orc: "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
- JSON: “org.apache.hive.hcatalog.data.JsonSerDe” or "org.openx.data.jsonserde.JsonSerDe"
Querying and data formats
You can write UDFs in Java using the Athena Query Federation SDK. When a UDF is used in a SQL query submitted to Athena, it is invoked and run on AWS Lambda. UDFs can be used in both SELECT and FILTER clauses of a SQL query. You can invoke multiple UDFs in the same query.
Q: What is a federated query?
If you have data in sources other than S3, you can use Athena to query the data in place or build pipelines that extract data from multiple data sources and store them on S3. With Athena Federated Query, you can run SQL queries across data stored in relational, nonrelational, object, and custom data sources.
Q: Why should I use federated queries in Athena?
Organizations often store data in a data source that meets the needs of their applications or business processes. These can include relational, key-value, document, in-memory, search, graph, time-series, and ledger databases in addition to storing data in an S3 data lake. Performing analytics on such diverse sources can be complex and time consuming because it typically requires learning new programming languages or database constructs and building complex pipelines to extract, transform, and duplicate data before it can be used for analysis. Athena reduces this complexity by allowing you to run SQL queries on the data where it is. You can use well-known SQL constructs to query data across multiple data sources for quick analysis, or use scheduled SQL queries to extract and transform data from multiple data sources and store them on S3 for further analysis.
Q: Which data sources are supported?
Athena provides built-in connectors to several popular data stores, including Amazon Redshift and Amazon DynamoDB. You can use these connectors to enable SQL analytics use cases on structured, semi-structured, object, graph, time-series, and other data storage types. For a list of supported sources, review the Amazon Athena User Guide: Using Athena Data Source Connectors.
You can also use Athena’s data connector SDK to create a custom data source connector and query it with Athena. Get started by reviewing the documentation and example connector implementation.
Q: Which use cases does federated query enable?
With Athena, you can use your existing SQL knowledge to extract insights from various data sources without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure. Using Amazon Athena, you can perform the following tasks:
- Run on-demand analysis on data spread across multiple data stores using a single tool and SQL dialect.
- Visualize data in BI applications that push complex, multisource joins down to Athena’s distributed compute engine over ODBC and JDBC interfaces.
- Design self-service ETL pipelines and event-based data-processing workflows with Athena integration with AWS Step Functions.
- Unify diverse data sources to produce rich input features for ML model-training workflows.
- Develop user-facing data-as-a-product applications that surface insights across data mesh architectures.
- Support analytics use cases while your organization migrates on-premises sources to AWS.
Q: Can I use federated query for ETL?
Athena saves query results to a file on S3. This means you can use Athena to make federated data available to other users and applications. If you want to perform analysis on the data using Athena without repeatedly querying the underlying source, use the Athena CREATE TABLE AS function. You can also use the Athena UNLOAD function to query the data and store the results in a specific file format on S3.
Q: How do data source connectors work?
A data source connector is a piece of code that runs on Lambda that translates between your target data source and Athena. When you use a data source connector to register a data store with Athena, you can run SQL queries on federated data stores. When a query runs on a federated source, Athena calls the Lambda function and tasks it with running the parts of your query that are specific to the federated source. To learn more, review the Amazon Athena User Guide: Using Amazon Athena Federated Query.
Q: Which use cases does Athena support for embedded ML?
Athena use cases for ML span different industries, as in the following examples. Financial risk data analysts can run what-if analysis and Monte Carlo simulations. Business analysts might run linear regression or forecasting models to predict future values to help them create richer and forward-looking business dashboards that forecast revenues. Marketing analysts can use k-means clustering models to help determine their different customer segments. Security analysts can use logistic regression models to find anomalies and detect security incidents from logs.
Q: Which ML models can be used with Athena?
Athena can invoke any ML model that is deployed on SageMaker. You have the flexibility to train your own model using your proprietary data, or use a model that is pretrained and deployed on SageMaker. For example, cluster analysis would likely be trained on your own data because you want to categorize new records into the same categories that you used for previous records. Alternatively, for predicting real-world sports events, you could use a publicly available model because the training data used would be in the public domain already. Domain-specific or industry-specific predictions will typically be trained on your own data in SageMaker, while undifferentiated ML needs might use external models.
Q: Can I train my ML model using Athena?
You cannot train and deploy your ML models on SageMaker using Athena. You can train your ML model or use an existing pretrained model that is deployed on SageMaker using Athena. Read the documentation detailing training steps on SageMaker.
Q: Can I run inference on models deployed on other services such as Comprehend, Forecasting, or Models deployed on my own EC2 cluster?
Athena only supports invoking ML models deployed on SageMaker. We welcome feedback on what other services that you want to use with Athena. Email us your feedback to: email@example.com.
Q: What are the performance implications of using Athena queries for SageMaker inference?
Operational performance improvements are constantly being added to our features and services. To enhance performance of your Athena ML queries, rows are batched when invoking your SageMaker ML model for inference. At this time, user-provided row batch size overrides are not supported.
Q: Which features does Athena ML support?
Athena offers ML inference (prediction) capabilities wrapped by a SQL interface. You can also call an Athena UDF to invoke pre- or post-processing logic on your result set. Inputs can include any column, record, or table, and multiple calls can be batched together for higher scalability. You can run inference in the Select phase or in the Filter phase. To learn more, refer to the Amazon Athena User Guide: Using Machine Learning (ML) with Amazon Athena.
Q: Which ML models can I use?
SageMaker supports various ML algorithms. You can also create your proprietary ML model and deploy it on SageMaker. For example, cluster analysis would likely be trained on your own data because you want to categorize new records into the same categories that you used for previous records. Alternatively, for predicting real-world sports events, you could use a publicly available model because the training data used would be in the public domain.
We expect that domain- or industry-specific predictions will typically be trained on your own data in SageMaker, while undifferentiated ML needs such as machine translation will use external models.
Security and availability
Q: How do I control access to my data?
Amazon Athena supports fine-grained access control with AWS Lake Formation. AWS Lake Formation allows for centrally managing permissions and access control for data catalog resources in your S3 data lake. You can enforce fine-grained access control policies in Athena queries for data stored in any supported file format using table formats such as Apache Iceberg, Apache Hudi, and Apache Hive. You get the flexibility to choose the table and file format best suited for your use case and get the benefit of centralized data governance to secure data access when using Athena. For example, you can use Iceberg table format to store data in your S3 data lake for reliable write transactions at scale together with row-level security filters in Lake Formation so that data analysts residing in different countries get access to data only for customers located in their own country to meet the regulatory requirements. The new expanded support for table and file formats does not require any change in how you set up fine-grained access control policies in Lake Formation and requires Athena engine version 3 which offers new features and improved query performance. Athena also allows you to control access to your data by using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and S3 bucket policies. With IAM policies, you can grant IAM users fine-grained control to your S3 buckets. By controlling access to data on S3, you can restrict users from querying it using Athena.
Q: Can Athena query encrypted data in S3?
Yes, you can query data that’s encrypted using server-side encryption (SSE) with S3-managed encryption keys, SSE with AWS Key Management Service (KMS)–managed keys, and client-side encryption (CSE) with keys managed by AWS KMS. Athena also integrates with AWS KMS and provides you with an option to encrypt your result sets.
Q: Is Athena highly available?
Yes. Athena is highly available and runs queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Athena uses S3 as its underlying data store, making your data highly available and durable. S3 provides durable infrastructure to store important data. Your data is redundantly stored across multiple facilities and multiple devices in each facility.
Q: Can I provide cross-account access to someone else’s S3 bucket?
Yes, you can provide cross-account access to S3.
Pricing and billing
Q: How is Athena priced?
Athena is priced per query and charges based on the amount of data scanned by the query. You can store data in various formats on S3. If you compress your data, partition, or convert it to columnar storage formats, you pay less because you scan less data. Converting data to the columnar format allows Athena to read only the columns that it must process the query. For more details, review the Amazon Athena pricing page.
Q: Why do I get charged less when I use a columnar format?
Athena charges you for the amount of data scanned per query. Compressing your data allows Athena to scan less data. Converting your data to columnar formats allows Athena to selectively read only required columns to process the data. Partitioning your data also allows Athena to restrict the amount of data scanned. This leads to cost savings and improved performance. For more details, review the Amazon Athena pricing page.
Q: How do I lower my costs?
You can save 30% to 90% on your query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. Each of these operations reduces the amount of data that Athena must scan to run a query. Athena supports Parquet and ORC, two of the most popular open-source columnar formats. You can see the amount of data scanned per query on the Athena console.
Q: Does Athena charge me for failed queries?
No, you are not charged for failed queries.
Q: Does Athena charge me for canceled queries?
Yes. If you cancel a query manually, you are charged for the amount of data scanned up to the point at which you canceled the query.
Q: Are there any additional charges associated with Athena?
Athena queries data directly from S3, so your source data is billed at S3 rates. When Athena runs a query, it stores the results in an S3 bucket of your choice. You are then billed at standard S3 rates for these result sets. It is recommended that you monitor these buckets and use lifecycle policies to control how much data gets retained.
Q: Will I be charged for using Data Catalog?
Yes. You are charged separately for using the Data Catalog. To learn more about Data Catalog pricing, review the AWS Glue pricing page.
Amazon Athena for Apache Spark
Q: What is Amazon Athena for Apache Spark?
Athena supports Apache Spark framework to enable data analysts and data engineers with the interactive, fully managed experience of Athena. Apach Spark is a popular open-source, distributed processing system that is enhanced for fast analytics workloads against data of any size that offers a rich system of open-source libraries. You can now build Spark applications in expressive languages, such as Python, using a simplified notebook experience in the Athena console or through Athena APIs. You can query data from various sources, chain together multiple calculations, and visualize the results of their analyses. For interactive Spark applications, you spend less time waiting and are more productive as Athena starts running applications under a second. Customers get a simplified and purpose-built Spark experience that minimizes work required for version upgrades, performance tuning, and integration with other AWS services.
Q: Why should I use Athena for Apache Spark?
Use Athena for Apache Spark when you need an interactive, fully managed analytics experience and a tight integration with AWS services. You can use Spark to perform analytics in Athena using familiar, expressive languages such as Python and the growing environment of Spark packages. You can also enter their Spark applications through Athena APIs or into simplified notebooks in the Athena console, and begin running Spark applications under a second without setting up and tuning the underlying infrastructure. Like the SQL query capabilities of Athena, Athena offers a fully managed Spark experience and handles the performance tuning, machine configurations, and software patching automatically so that you do not need to worry about keeping current with version upgrades. Also, Athena is tightly integrated with other analytics services in the AWS system such as Data Catalog. Therefore, you can create Spark applications on data in S3 data lakes by referencing tables from your Data Catalog.
Q: How do I start working with Athena for Apache Spark?
To get started with Athena for Apache Spark, you can start a notebook in the Athena console or start a session using the AWS Command Line Interface (CLI) or Athena API. In your notebook, you can start entering and shutting down Spark applications using Python. Athena also integrates with Data Catalog, so you can work with any data source referenced in the catalog, including data directly in S3 data lakes. Using notebooks, you can now query data from various sources, chain together multiple calculations, and visualize the results of their analyses. On your Spark applications, you can check the execution status and review logs and execution history in the Athena console.
Q: Which Spark version is Athena based on?
Athena for Apache Spark is based on the stable Spark 3.2 release. As a fully managed engine, Athena will provide a custom build of Spark and will handle most Spark version updates automatically in a backward-compatible way without requiring your involvement.
Q: How is Athena for Apache Spark priced?
You only pay for the time your Apache Spark application takes to run. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your Apache Spark application. A single DPU provides 4 vCPU and 16 GB of memory. You will be billed in increments of 1 second, rounded up to the nearest minute.
When you start a Spark session either by starting a notebook on the Athena console or using Athena API, two nodes are provisioned for your application: a notebook node that will act as the server for the notebook user interface and a Spark driver node that coordinated that Spark application and communicates with all the Spark worker nodes. Athena will charge you for driver and worker nodes during the duration of the session. Amazon Athena provides notebooks on the console as a user interface for creating, submitting, and executing Apache Spark applications and offers it to you at no additional cost. Athena does not charge for the notebook nodes used during the Spark session.
When to use Athena versus other big data services
Amazon Athena and Amazon Redshift Serverless address different needs and use cases even if both services are serverless and enable SQL users.
EMR Serverless is the easiest way to run Spark and Hive applications in the cloud and the only serverless Hive solution in the industry. With EMR Serverless, you can eliminate the operational overhead of tuning, rightsizing, securing, patching, and managing clusters, and only pay for the resources that your applications actually use. With EMR’s performance optimized runtime, you get 2x+ faster performance than standard open source, so your applications run faster, and reduce your compute costs. EMR’s performance-optimized runtime is 100% API compatible with standard open source, so you do not have to rewrite your applications to run them on EMR. You also do not need deep Spark expertise to turn them on since these are turned on by default. EMR provides the option to run applications on EMR clusters, EKS clusters, or EMR Serverless. EMR clusters are suitable for customers that need maximum control and flexibility over how to run their application. With EMR clusters, customers can choose the EC2 instance type, customize the Amazon Linux Image AMI, customize EC2 instance configuration, customize and extend open-source frameworks, and install additional custom software on cluster instances. EMR on EKS is suitable for customers that want to standardize on EKS to manage clusters across applications or use different versions of an open-source framework on the same cluster. EMR Serverless is suitable for customers who want to avoid managing and operating clusters and simply want to run applications using open-source frameworks.
Athena SQL queries can invoke ML models deployed on Amazon SageMaker. You can specify the S3 location where they want to store results of these Athena SQL queries.