How can I set the number or size of files when I run a CTAS query in Athena?

6 minute read

When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data per file.

Resolution

Use bucketing to set the file size or number of files in a CTAS query.

Note: The following steps use the Global Historical Climatology Network Daily public dataset (s3://noaa-ghcn-pds/csv.gz/) to illustrate the solution. For more information about this dataset, see Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight. These steps show how to examine your dataset, create the environment, and then modify the dataset:

Modify the number of files in the Amazon Simple Storage Service (Amazon S3) dataset.
Set the approximate size of each file.
Convert the data format and set the approximate file size.

Examine the dataset

Run the following AWS Command Line Interface (AWS CLI) to verify the number of files and the size of the dataset:

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-11-30 01:58:05    3.3 KiB csv.gz/1763.csv.gz
2019-11-30 01:58:06    3.2 KiB csv.gz/1764.csv.gz
2019-11-30 01:58:06    3.3 KiB csv.gz/1765.csv.gz
2019-11-30 01:58:07    3.3 KiB csv.gz/1766.csv.gz
...
2019-11-30 02:05:43  199.7 MiB csv.gz/2016.csv.gz
2019-11-30 02:05:50  197.7 MiB csv.gz/2017.csv.gz
2019-11-30 02:05:54  197.0 MiB csv.gz/2018.csv.gz
2019-11-30 02:05:57  168.8 MiB csv.gz/2019.csv.gz

Total Objects: 257
Total Size: 15.4 GiB

Create the environment

1. Run a statement similar to the following to create a table:

CREATE EXTERNAL TABLE historic_climate_gz(
  id string,
  yearmonthday int,
  element string,
  temperature int,
  m_flag string,
  q_flag string,
  s_flag string,
  obs_time int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://noaa-ghcn-pds/csv.gz/'

2. Run the following command to test the table:

SELECT * FROM historic_climate_gz LIMIT 10

The output shows ten lines from the dataset. After the environment is created, use one or more of the following methods to modify the dataset when you run CTAS queries.

Modify the number of files in the dataset

It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing vs Partitioning. In the following example, we use the yearmonthday field.

1. To convert the dataset into 20 files, run a statement similar to the following:

CREATE TABLE "historic_climate_gz_20_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',
      format = 'TEXTFILE',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
     ) AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: format that you want for the output (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket (for example, yearmonthday)

2. Run the following command to confirm that the bucket contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readable
Total Objects: 20
Total Size: 15.6 Gib

Set the approximate size of each file

1. Determine how many files you need to achieve the desired file size. For example, to split the 15.4 GB dataset into 2 GB files, you need 8 files (15.4 / 2 = 7.7, rounded up to 8).

2. Run a statement similar to the following:

CREATE TABLE "historic_climate_gz_2GB_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/',
      format = 'TEXTFILE',
      bucket_count=8,
      bucketed_by = ARRAY['yearmonthday']) AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query
format: must be the same format as the source data (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE)
bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.

3. Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-09-03 10:59:20    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz
2019-09-03 10:59:19    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz
2019-09-03 10:59:17    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz
2019-09-03 10:59:21    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz
2019-09-03 10:59:18    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz
2019-09-03 10:59:17    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz
Total Objects: 8
Total Size: 15.0 GiB

Convert the data format and set the approximate file size

1. Run a statement similar to the following to convert the data to a different format:

CREATE TABLE "historic_climate_parquet"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet/',
      format = 'PARQUET') AS
SELECT * FROM historic_climate_gz

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query format: format that you want to covert to (ORC,PARQUET, AVRO, JSON, or TEXTFILE)

2. Run the following command to confirm the size of the dataset:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readable

The output looks similar to the following:

Total Objects: 30
Total Size: 9.8 GiB

3. Determine how many files that you need to achieve the desired file size. For example, if you want 500 MB files and the dataset is 9.8 GB, then you need 20 files (9,800 / 500 = 19.6, rounded up to 20).

4. To convert the dataset into 500 MB files, run a statement similar to the following:

CREATE TABLE "historic_climate_parquet_500mb"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/',
      format = 'PARQUET',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
       ) AS
SELECT * FROM historic_climate_parquet

Replace the following values in the query:
external_location: Amazon S3 location where Athena saves your CTAS query bucket_count: number of files that you want (for example, 20)
bucketed_by: field for hashing and saving the data in the bucket. Choose a field with high cardinality.

5. Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readable

The output looks similar to the following:

2019-09-03 12:01:45  333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000
2019-09-03 12:01:01  666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001
2019-09-03 12:01:00  665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002
2019-09-03 12:01:06  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003
2019-09-03 12:00:59  667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004
2019-09-03 12:01:27  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005
2019-09-03 12:01:10  666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006
2019-09-03 12:01:12  668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007
2019-09-03 12:01:03  666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008
2019-09-03 12:01:10  646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009
2019-09-03 12:01:35  639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010
2019-09-03 12:00:52  529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011
2019-09-03 12:01:29  334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012
2019-09-03 12:01:32  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013
2019-09-03 12:01:34  332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014
2019-09-03 12:01:44  333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015
2019-09-03 12:01:51  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016
2019-09-03 12:01:39  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017
2019-09-03 12:01:47  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018
2019-09-03 12:01:49  332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019
Total Objects: 20
Total Size: 9.9 GiB

Note: The INSERT INTO statement isn't supported on bucketed tables. For more information, see Bucketed tables not supported.

Related information

Examples of CTAS queries

Considerations and limitations for CTAS queries

Topics

Analytics

Relevant content

Athena CTAS Internal Error
Jfarrin1
asked 5 years ago
Athena - CTAS - query fails - COLUMN_NOT_FOUND - Column cannot be resolved or requester is not authorized to access requested resources.
Maharajan Eswaramoorthy
asked 6 months ago
AmazonS3Exception error when running CTAS using Athena engine version 3
mmanganiello
asked a year ago
How can I explicitly specify the size of the files to be split or the number of files?
Accepted Answer
Yen Trinh
asked 2 years ago
A query that runs by itself in Athena fails to run when I add CREATE TABLE table_name AS on top of it due to a 'GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'id' at line 1...'
Spencer Marlen-Starr
asked a year ago
How do I resolve the error "GENERIC_INTERNAL_ERROR" when I query a table in Amazon Athena?
AWS OFFICIALUpdated a year ago
How do I resolve the "HIVE_PATH_ALREADY_EXISTS" exception when I run a CTAS query in Amazon Athena?
AWS OFFICIALUpdated 3 years ago
How do I use the results of an Amazon Athena query in another query?
AWS OFFICIALUpdated 2 years ago
How can I resolve "Query Timeout" errors in Athena?
AWS OFFICIALUpdated 2 months ago
How to customize audit logs in OpenSearch to see which queries were run by users?
EXPERT
Cathy W
published 2 months ago