How can I store an Amazon Athena query output in a format other than CSV, such as a compressed format?
Last updated: 2021-12-20
I want to store Amazon Athena query results in a format other than CSV, such as JSON or a compressed format.
Athena supports CSV output files only. The UNLOAD query writes query results from a SELECT statement to the specified data format. Supported formats for UNLOAD include Apache Parquet, ORC, Apache Avro, and JSON. CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the output of a SELECT query to the formats that UNLOAD supports. Although you can use the CTAS statement to output data in formats other than CSV, those statements also require the creation of a table in Athena. The UNLOAD statement is useful when you want to output the results of a SELECT query in a non-CSV format but don't require the associated table. For example, a downstream application might require the results of a SELECT query to be in JSON format, and Parquet or ORC might provide a performance advantage over CSV if you intend to use the results of the SELECT query for additional analysis.
To store query output files in a different format, use a CREATE TABLE AS SELECT (CTAS) query, and then configure the format property. After the query completes, drop the CTAS table. Keep the following in mind:
- You can set format to ORC, PARQUET, AVRO, JSON, or TEXTFILE. If you don't specify a format for the CTAS query, then Athena uses Parquet by default.
- The name of the parameter, format , must be listed in lowercase, or your CTAS query fails. For an example, see Example: Writing query results to a different format.
- By default, all CTAS queries use GZIP compression. For Parquet and ORC, you can use the parquet_compression and orc_compression options to specify other compression types, such as SNAPPY. For an example, see Example: Specifying data storage and compression formats.
- For text-based formats, you can specify a field_delimiter like this: WITH (field_delimiter = ',') . If you don't specify a field delimiter, then Athena uses \001 by default.
- If you use the external_location parameter in the CTAS query, then be sure that the Amazon Simple Storage Service (Amazon S3) location has no data. Otherwise, you might see an error like this: "HIVE_PATH_ALREADY_EXISTS: Target directory for table 'table_name' already exists: s3://AWSDOC-EXAMPLE-BUCKET". For more information about the external_location parameter, see CTAS table properties.