I am using S3DistCp (s3-dist-cp) to concatenate files in Apache Parquet format with the --groupBy and --targetSize options. The s3-dist-cp job completes without errors, but the generated Parquet files are broken and can't be read by other applications. I get an error message similar to the following:

Expected n values in column chunk at /path/to/concatenated/parquet/file offset m but got x values instead over y pages ending at file offset z

How can I concatenate Parquet files in Amazon EMR?

S3DistCp does not support concatenation for Parquet files. Use PySpark instead.

Although the target size can't be specified in PySpark, you can specify the number of partitions. Estimate the number of partitions by using the data size and the target individual file size.

1.    Create an Amazon EMR cluster with Apache Spark installed.

2.    Run a command similar to the following:

$  pyspark --num-executors <number_of_executors>

3.    To load the source Parquet files into an Apache Spark DataFrame, run a command similar to the following:

df=sqlContext.read.parquet(“/input/path/to/parquet/files/")

4.    Repartition the Spark DataFrame. In the following example, "n" is the number of partitions. Each partition is saved to a separate output file.

df_output=df.coalesce(n)

5.    Save the Spark DataFrame:

df_output.write.parquet("URI://path/to/destination")

6.    Run a command similar to the following to verify how many files were written to the destination directory:

hadoop fs -ls "URI://path/to/file | wc -l"

The total number of files should be the value of n from step #4, plus one. The extra file is a file called _SUCCESS that is written by the Parquet output committer.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2018-10-23