How do I resolve the "Unable to infer schema" exception in AWS Glue?

Last updated: 2019-06-12

My AWS Glue job fails with one of the following exceptions:

  • "AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'"
  • "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

Short Description

This error usually happens when AWS Glue tries to read a Parquet or Orc file that is not stored in an Apache Hive-style partitioned path that uses the key=val structure. AWS Glue expects the Amazon Simple Storage Service (Amazon S3) source files to be in key-value pairs. For example, if the AWS Glue job is processing files at s3://s3-bucket/parquet-data/, the files should have the following partition structure:

s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet

If your Parquet or Orc files are stored in a hierarchical structure, the AWS Glue job fails with the "Unable to infer schema" exception. Example:

s3://s3-bucket/parquet-data/year/month/day/file1.parquet

Resolution

Use one of the following methods to resolve the error.

Restructure your data

Copy the files into a new S3 bucket and use Hive-style partitioned paths. Run the job again.

Replace partition column names with asterisks

If restructuring your data isn't feasible, create the DynamicFrame directly from Amazon S3. In the Amazon S3 path, replace all partition column names with asterisks (*). When you use this solution, AWS Glue does not include the partition columns in the DynamicFrame—it only includes the data.

For example, assume that your files are stored in an S3 bucket with the following partition structure:

s3://s3-bucket/parquet-data/year/month/day/files.parquet

To process all files in the s3://s3-bucket/parquet-data/ path, create the DynamicFrame like this:

dynamic_frame0 = glueContext.create_dynamic_frame_from_options('s3',connection_options={'paths':['s3://s3-bucket/parquet-data/*/*/*'],},format="parquet",transformation_ctx = "dynamic_frame0")

Use a map transformation to add partition columns

To include the partition columns in the DynamicFrame, create a DataFrame first, and then add a column for the Amazon S3 file path. Then, create the DynamicFrame and apply a map transformation to add the partition columns, as shown in the following example. Before using the sample code, replace the Amazon S3 paths and enter your partition column names using the correct index values.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import *
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df= spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*")
modified_df = df.withColumn('partitions_column',input_file_name())
dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0")

def modify_col(x):
     if x['partitions_column']:
        new_columns = x['partitions_column'].split('/')
        x['year'],x['month'],x['day'] = new_columns[4],new_columns[5],new_columns[6]
        del x['partitions_column']
     return x

modified_dyf = Map.apply(dyf_0,f=modify_col)
datasink2 = glueContext.write_dynamic_frame.from_options(frame =modified_dyf , connection_type = "s3", connection_options = {"path": "s3://my-output-bucket/output/","partitionKeys": ["year","month","day"]}, format = "parquet", transformation_ctx = "datasink2")

For more information about applying map transformations, see Map Class.


Did this article help you?

Anything we could improve?


Need more help?