- Unified Metadata Repository: AWS Glue is integrated across a wide range of AWS services. AWS Glue supports data stored in Amazon Aurora, Amazon RDS MySQL, Amazon RDS PostreSQL, Amazon Redshift, and Amazon S3, as well as MySQL and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. AWS Glue provides out-of-the-box integration with Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and any Apache Hive Metastore-compatible application.
- Automatic schema and partition recognition: AWS Glue automatically crawls your data sources, identifies data formats, and suggests schemas and transformations. Crawlers can help automate table creation and automatic loading of partitions.
- Easy to build pipelines: AWS Glue’s ETL engine generates Python code that is customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. Once your ETL job is ready, you can schedule it to run on AWS Glue's fully managed, scale-out Spark infrastructure. AWS Glue is serverless, so it handles provisioning, configuration, and scaling of the resources required to run your ETL jobs, allowing you to tightly integrate ETL in your workflow.
When to use Athena vs other big data services
Creating tables, data formats and partitions
- Apache Web Logs: "org.apache.hadoop.hive.serde2.RegexSerDe"
- CSV: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- TSV: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- Custom Delimiters: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
- Parquet: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
- Orc: "org.apache.hadoop.hive.ql.io.orc.OrcSerde"
- JSON: “org.apache.hive.hcatalog.data.JsonSerDe” OR org.openx.data.jsonserde.JsonSerDe
Querying and data formats
Q: Can I use Amazon QuickSight with Amazon Athena?
Q: Does Athena support other BI Tools and SQL Clients?
Q: How do I access the functions supported by Amazon Athena?
You can write UDFs in Java using the Athena Query Federation SDK. When a UDF is used in a SQL query submitted to Athena, it is invoked and executed on AWS Lambda. UDFs can be used in both SELECT and FILTER clauses of a SQL query. You can invoke multiple UDFs in the same query.
Q: What is federated query?
If you have data in sources other than Amazon S3, you can use Athena to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.
Q: Why should you use federated queries in Athena?
Organizations often store data in a data source that meets the needs of their applications or business processes. These may include relational, key-value, document, in-memory, search, graph, time-series and ledger databases in addition to storing data in a S3 data lake. Performing analytics on such diverse sources can be complex and time consuming because it typically requires learning new programming languages or database constructs and building complex pipelines to extract, transform, and duplicate data before it can be used for analysis. Athena eliminates this complexity by allowing you to run SQL queries on the data where it is. You can use well-known SQL constructs to query data across multiple data sources for quick analysis, or use scheduled SQL queries to extract and transform data from multiple data sources, and store them in S3 for further analysis.
Q: What data sources are supported?
Athena provides built-in connectors to several popular data stores including Amazon Redshift and Amazon DynamoDB. You can use these connectors to enable SQL analytics use cases on structured, semi-structured, object, graph, time series, and other data storage types. For a list of supported sources, see Using Athena Data Source Connectors.
You can also use Athena’s data connector SDK to create a custom data source connector and query it with Athena. Get started by reviewing our documentation and example connector implementation.
Q: What use cases does federated query enable?
With Athena you can leverage your existing SQL knowledge to extract insights from a wide range of data sources without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure. Using Amazon Athena, you can:
- Run on-demand analysis on data spread across multiple data stores using a single tool and SQL dialect
- Visualize data in business intelligence applications which push complex, multi-source joins down to Athena’s distributed compute engine over JDBC and ODBC interfaces
- Design self-service ETL pipelines and event-based data processing workflows with Athena’s integration with AWS Step Functions
- Unify diverse data sources to produce rich input features for machine learning model training workflows
- Develop user-facing data-as-a-product applications that surface insights across data mesh architectures
- Support analytics use cases while your organization migrates on-premises sources to the AWS cloud
Q: Can I use federated query for ETL (Extract, Transform, Load)?
Athena saves query results to a file in Amazon S3. This means you can use Athena to make federated data available to others users and applications. If you want to perform analysis on the data using Athena without repeatedly querying the underlying source, use Athena’s CREATE TABLE AS function. You can also use Athena’s UNLOAD function to query the data and store the results in a specific file format on Amazon S3.
Q: How do data source connectors work?
A data source connector is a piece of code that runs on AWS Lambda that translates between your target data source and Athena. Once you use a data source connector to register a data store with Athena, you can run SQL queries on federated data stores. When a query runs on a federated source, Athena calls the Lambda function and tasks it with executing the parts of your query that are specific to the federated source. To learn more, see Using Amazon Athena Federated Query.