How can I set the number or size of files when I run a CTAS query in Athena?

Last updated: 2020-01-03

When I run a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena, I want to define the number of files or the amount of data per file.

Resolution

Use bucketing to set the file size or number of files in a CTAS query.

Note: These steps use the Global Historical Climatology Network Daily public dataset (s3://noaa-ghcn-pds/csv.gz/) to illustrate the solution. For more information about this dataset, see Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight. These steps show how to examine your dataset, create the environment, and then modify the dataset as follows:

1.    Modify the number of files in the Amazon Simple Storage Service (Amazon S3) dataset.

2.    Set the approximate size of each file.

3.    Convert the data format and set the approximate file size.

Examine the dataset

Use the AWS Command Line Interface (AWS CLI) to verify the number of files and the size of the dataset:

aws s3 ls s3://noaa-ghcn-pds/csv.gz/ --summarize --recursive --human-readable

The output looks like this:

2019-11-30 01:58:05    3.3 KiB csv.gz/1763.csv.gz
2019-11-30 01:58:06    3.2 KiB csv.gz/1764.csv.gz
2019-11-30 01:58:06    3.3 KiB csv.gz/1765.csv.gz
2019-11-30 01:58:07    3.3 KiB csv.gz/1766.csv.gz
...
2019-11-30 02:05:43  199.7 MiB csv.gz/2016.csv.gz
2019-11-30 02:05:50  197.7 MiB csv.gz/2017.csv.gz
2019-11-30 02:05:54  197.0 MiB csv.gz/2018.csv.gz
2019-11-30 02:05:57  168.8 MiB csv.gz/2019.csv.gz

Total Objects: 257
Total Size: 15.4 GiB

Create the environment

1.    Run a statement similar to the following to create a table:

CREATE EXTERNAL TABLE historic_climate_gz(
  id string,
  yearmonthday int,
  element string,
  temperature int,
  m_flag string,
  q_flag string,
  s_flag string,
  obs_time int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://noaa-ghcn-pds/csv.gz/'

2.    Run the following command to test the table:

select * from historic_climate_gz limit 10

The output shows 10 lines from the dataset. After the environment is created, use one or more of the following methods to modify the dataset when you run CTAS queries.

Modify the number of files in the dataset

It's a best practice to bucket data by a column that has high cardinality and evenly distributed values. For more information, see Bucketing vs Partitioning. In the following example, we use the yearmonthday field.

1.    To convert the dataset into 20 files, run a statement similar to the following. Replace the values in these fields:

external_location: the Amazon S3 location where Athena saves your CTAS query format: the format that you want for the output (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE) bucket_count: the number of files that you want (for example, 20) bucketed_by: the field for hashing and saving the data in the bucket (for example, yearmonthday)

CREATE TABLE "historic_climate_gz_20_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',
      format = 'TEXTFILE',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
       ) as
select * from historic_climate_gz

2.    Run the following command to confirm that the bucket contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_20_files/ --summarize --recursive --human-readable
Total Objects: 20
Total Size: 15.6 Gib

Set the approximate size of each file

1.    Determine how many files that you need to achieve the desired file size. For example, to split the 15.4 GB dataset into 2 GB files, you need 8 files (15.4 / 2 = 7.7, rounded up to 8).

2.    Run a statement similar to the following. Replace the values in these fields:

external_location: the Amazon S3 location where Athena saves your CTAS query format: must be the same format as the source data (such as ORC, PARQUET, AVRO, JSON, or TEXTFILE) bucket_count: the number of files that you want (for example, 20) bucketed_by: the field for hashing and saving the data in the bucket. Choose a field with high cardinality.

CREATE TABLE "historic_climate_gz_2GB_files"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_gz_2GB_file/',
      format = 'TEXTFILE',
   bucket_count=8,
   bucketed_by = ARRAY['yearmonthday']) as
select * from historic_climate_gz

3.    Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_gz_2GB_file/ --summarize --recursive --human-readable

The output looks like this:

2019-09-03 10:59:20    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00000.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00001.gz
2019-09-03 10:59:20    2.0 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00002.gz
2019-09-03 10:59:19    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00003.gz
2019-09-03 10:59:17    1.7 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00004.gz
2019-09-03 10:59:21    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00005.gz
2019-09-03 10:59:18    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00006.gz
2019-09-03 10:59:17    1.9 GiB historic_climate_gz_2GB_file/20190903_085819_00005_bzbtg_bucket-00007.gz
Total Objects: 8
Total Size: 15.0 GiB

Convert the data format and set the approximate file size

1.    Run a statement similar to the following to convert the data to a different format. Replace the values in these fields:

external_location: the Amazon S3 location where Athena saves your CTAS query format: the format that you want to covert to (ORC, PARQUET, AVRO, JSON, or TEXTFILE)

CREATE TABLE "historic_climate_parquet"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet/',
      format = 'PARQUET') as
select * from historic_climate_gz

2.    Run the following command to confirm the size of the dataset:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet/ --summarize --recursive --human-readable

The output looks like this:

Total Objects: 30
Total Size: 9.8 GiB

3.    Determine how many files you need to achieve the desired file size. For example, if you want 500 MB files and the dataset is 9.8 GB, then you need 20 files (9,800 / 500 = 19.6, rounded up to 20).

4.    To convert the dataset into 500 MB files, run a statement similar to the following. Replace the values in these fields:

external_location: the Amazon S3 location where Athena saves your CTAS query bucket_count: the number of files that you want (for example, 20) bucketed_by: the field for hashing and saving the data in the bucket. Choose a field with high cardinality.

CREATE TABLE "historic_climate_parquet_500mb"
WITH (
      external_location = 's3://awsexamplebucket/historic_climate_parquet_500mb/',
      format = 'PARQUET',
      bucket_count=20,
      bucketed_by = ARRAY['yearmonthday']
       ) as
select * from historic_climate_parquet

5.    Run the following command to confirm that the dataset contains the desired number of files:

aws s3 ls s3://awsexamplebucket/historic_climate_parquet_500mb/ --summarize --recursive --human-readable

The output looks like this:

2019-09-03 12:01:45  333.9 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00000
2019-09-03 12:01:01  666.7 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00001
2019-09-03 12:01:00  665.6 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00002
2019-09-03 12:01:06  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00003
2019-09-03 12:00:59  667.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00004
2019-09-03 12:01:27  666.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00005
2019-09-03 12:01:10  666.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00006
2019-09-03 12:01:12  668.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00007
2019-09-03 12:01:03  666.8 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00008
2019-09-03 12:01:10  646.4 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00009
2019-09-03 12:01:35  639.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00010
2019-09-03 12:00:52  529.5 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00011
2019-09-03 12:01:29  334.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00012
2019-09-03 12:01:32  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00013
2019-09-03 12:01:34  332.2 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00014
2019-09-03 12:01:44  333.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00015
2019-09-03 12:01:51  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00016
2019-09-03 12:01:39  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00017
2019-09-03 12:01:47  333.0 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00018
2019-09-03 12:01:49  332.3 MiB historic_climate_parquet_500mb/20190903_095742_00001_uipqt_bucket-00019
Total Objects: 20
Total Size: 9.9 GiB