Amazon Athena Documentation

Amazon Athena is an interactive query service that is designed to analyze data directly in Amazon S3 using standard SQL. Customers can point Athena at their data stored in S3 and begin using standard SQL to run ad-hoc queries and get results. Athena is serverless, so there is no infrastructure for you to setup or manage. You can use Athena to process logs, perform ad-hoc analysis, and run interactive queries. Athena is designed to scale automatically – executing queries in parallel – so results are fast, even with large datasets and complex queries.

Serverless

Amazon Athena is serverless, so there is no infrastructure for you to manage. You don’t need to worry about configuration, software updates, failures or scaling your infrastructure as your datasets and number of users grow. Athena is designed to automatically take care of all of this for you, so you can focus on the data, not the infrastructure.

Easy to get started

To get started, log into the Athena console, define your schema using the console wizard or by entering DDL statements, and start querying using the built-in query editor. You can also use AWS Glue to crawl data sources to discover data and populate your Data Catalog with new and modified table and partition definitions. Results are promptly displayed in the console, and automatically written to a location of your choice in S3. You can also download them to your desktop. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis.

Easy to query, just use standard SQL

Amazon Athena uses Presto, an open source, distributed SQL query engine optimized for low latency, ad hoc analysis of data. This means you can run queries against large datasets in Amazon S3 using ANSI SQL, with full support for large joins, window functions, and arrays. Athena supports a variety of data formats such as CSV, JSON, ORC, Avro, or Parquet. You can also connect to Athena from a wide variety of BI tools using Athena's JDBC driver.

Performance

With Amazon Athena, you don’t have to worry about managing or tuning clusters to get performance. Athena is designed to be optimized for performance with Amazon S3. Athena is designed to automatically execute queries in parallel, so that you get faster query results, even on large datasets.

Highly available & durable

Amazon Athena is designed to be highly available and execute queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable. Athena uses Amazon S3 as its underlying data store, which is designed to make your data highly available and durable. Your data is redundantly stored across multiple facilities and multiple devices in each facility.

Secure

Amazon Athena is designed to allow you to control access to your data by using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and Amazon S3 bucket policies. With IAM policies, you can grant IAM users fine-grained control to your S3 buckets. By controlling access to data in S3, you can restrict users from querying it using Athena. Athena also allows you to query encrypted data stored in Amazon S3 and write encrypted results back to your S3 bucket. Both, server-side encryption and client-side encryption are supported.

Integrated

Amazon Athena is designed to integrate with AWS Glue. With Glue Data Catalog, you will be able to create a unified metadata repository across various services, crawl data sources to discover data and populate your Data Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s managed ETL capabilities to transform data or convert it into columnar formats to optimize query performance.

Federated query

Athena enables you to run SQL queries across data stored in relational, non-relational, object, and custom data sources. You can use familiar SQL constructs to JOIN data across multiple data sources for quick analysis, and store results in Amazon S3 for subsequent use. Athena executes federated queries using Athena Data Source Connectors that run on AWS Lambda. AWS has open source data source connectors for Amazon DynamoDB, Apache HBase, Amazon Document DB, Amazon Redshift, AWS CloudWatch, AWS CloudWatch Metrics, and JDBC-compliant relational databases such as MySQL, and PostgreSQL. You can use these connectors to run federated SQL queries in Athena. Additionally, using the Athena Query Federation SDK, you can build connectors to other data sources.

Machine learning

You can invoke your SageMaker Machine Learning models in an Athena SQL query to run inference. The ability to use ML models in SQL queries can make complex tasks such anomaly detection, customer cohort analysis and sales predictions as simple as writing a SQL query. Athena can help anyone with SQL experience run ML models deployed on Amazon SageMaker.

Amazon Athena for Apache Spark

With Amazon Athena for Apache Spark, you can run interactive analytics on Apache Spark in under a second. Interactive Spark applications start and run faster with our optimized Spark runtime, and you can build Spark applications using the expressiveness of Python with a simplified notebook experience in an Athena console or through Athena APIs.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.