Short description
------------------



[S3DistCp](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html) doesn't support concatenation for Parquet files. Use [PySpark](https://spark.apache.org/docs/0.9.0/python-programming-guide.html) instead.



 Resolution
-----------



You can't specify the target file size in PySpark, but you can specify the number of partitions. Spark saves each partition to a separate output file. To estimate the number of partitions that you need, divide the size of the dataset by the target individual file size.


1.    [Create an Amazon EMR cluster with Apache Spark installed](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html).


2.    Specify how many executors you need. This depends on cluster capacity and dataset size. For more information, see [Best practices for successfully managing memory for Apache Spark applications on Amazon EMR](https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/).





```plaintext
$  pyspark --num-executors number_of_executors
```



3.    Load the source Parquet files into a Spark DataFrame. This can be an Amazon Simple Storage Service (Amazon S3) path or an HDFS path. For example:





```plaintext
df=sqlContext.read.parquet("s3://awsdoc-example-bucket/parquet-data/")
```



HDFS:





```plaintext
df=sqlContext.read.parquet("hdfs:///tmp/parquet-data/")
```



4.    Repartition the DataFrame. In the following example, **n** is the number of partitions.





```plaintext
df_output=df.coalesce(n)
```



5.    Save the DataFrame to the destination. This can be an Amazon S3 path or an HDFS path. For example:





```plaintext
df_output.write.parquet("URI:s3://awsdoc-example-bucket1/destination/")
```



HDFS:





```plaintext
df=sqlContext.write.parquet("hdfs:///tmp/destination/")
```



6.    Verify how many files are now in the destination directory:





```plaintext
hadoop fs -ls "URI:s3://awsdoc-example-bucket1/destination/ | wc -l"
```



The total number of files should be the value of **n** from step 4, plus one. The Parquet output committer writes the extra file, called **\_SUCCESS**.





---








I'm using S3DistCp (s3-dist-cp) to concatenate files in Apache Parquet format with the --groupBy and --targetSize options. The s3-dist-cp job completes without errors, but the generated Parquet files are broken. When I try to read the Parquet files in applications, I get an error message similar to the following:
"Expected n values in column chunk at /path/to/concatenated/parquet/file offset m but got x values instead over y pages ending at file offset z"

Concatenate Parquet files using Amazon EMR

How can I concatenate Parquet files in Amazon EMR?

Analytics

FreeRTOS versus the AWS IoT Device SDK for Embedded C (C-SDK)

How can I configure automatic scaling in Amazon EMR?

How can I access the Spark UI in Amazon EMR?

How can I troubleshoot stage failures in Spark jobs on Amazon EMR?

How can I push Amazon EMR application logs to CloudWatch?

Apache Flink 1.9.0 support in Amazon EMR

Parquet vs ORC on EMR

PARQUET argument is not supported when loading from file system

Parquet files in S3?

appsync doesnt filter with arrays or doesnt support in and nin operators

How can I concatenate Parquet files in Amazon EMR?

Short description

Resolution

Relevant content