Posted On: Jun 5, 2018
You can now COPY Apache Parquet and Apache ORC file formats from Amazon S3 to your Amazon Redshift cluster. Apache Parquet and ORC are columnar data formats that allow users to store their data more efficiently and cost-effectively. With this update, Redshift now supports COPY from six file formats: AVRO, CSV, JSON, Parquet, ORC and TXT.
The nomenclature for copying Parquet or ORC is the same as existing COPY command. For example, to load the Parquet files inside “parquet” folder at the Amazon S3 location “s3://mybucket/data/listings/parquet/”, you would use the following command:
COPY listing FROM 's3://mybucket/data/listings/parquet/' IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' FORMAT AS PARQUET;
All general purpose Amazon S3 storage classes are supported by this new feature, including S3 Standard, S3 Standard-Infrequent Access, and S3 One Zone-Infrequent Access. The current version of the COPY function supports certain parameters, such as FROM, IAM_ROLE, CREDENTIALS, STARTUPDATE, and MANIFEST. Succeeding versions will include more COPY parameters. The Amazon Redshift documentation lists the current restrictions on the function.
COPY from Parquet and ORC is available with the latest release <1.0.2294> in the following AWS regions: US East (N. Virginia, Ohio), US West (Oregon, N. California), Canada (Central), South America (Sao Paulo), EU (Frankfurt, Ireland, London), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, Tokyo).