Dieser Inhalt ist in der ausgewählten Sprache nicht verfügbar

Wir arbeiten ständig daran, Inhalte in der ausgewählten Sprache bereitzustellen. Vielen Dank für deine Geduld.

How do I resolve the "Unable to infer schema" exception in AWS Glue?

Lesedauer: 3 Minute

My AWS Glue job fails with one of the following exceptions:

"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

Short description

This error usually happens when AWS Glue tries to read a Parquet or Orc file that's not stored in an Apache Hive-style partitioned path with the key=val structure. AWS Glue expects the Amazon Simple Storage Service (Amazon S3) source files to be in key-value pairs. For example, if the AWS Glue job is processing files at s3://s3-bucket/parquet-data/, the files must have the following partition structure:

s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet

If your Parquet or Orc files are stored in a hierarchical structure, the AWS Glue job fails with the "Unable to infer schema" exception. Example:

s3://s3-bucket/parquet-data/year/month/day/file1.parquet

Resolution

Use one of the following methods to resolve the error.

Restructure your data

Copy the files into a new S3 bucket and use Hive-style partitioned paths. Run the job again.

Replace partition column names with asterisks

If restructuring your data isn't feasible, create the DynamicFrame directly from Amazon S3. In the Amazon S3 path, replace all partition column names with asterisks (*). When you use this solution, AWS Glue does not include the partition columns in the DynamicFrame—it only includes the data.

For example, assume that your files are stored in an S3 bucket with the following partition structure:

s3://s3-bucket/parquet-data/year/month/day/files.parquet

To process all files in the s3://s3-bucket/parquet-data/ path, create the DynamicFrame:

dynamic_frame0 = glueContext.create_dynamic_frame_from_options('s3',connection_options={'paths':['s3://s3-bucket/parquet-data/*/*/*'],},format="parquet",transformation_ctx = "dynamic_frame0")

Use a map transformation to add partition columns

To include the partition columns in the DynamicFrame, create a DataFrame first, and then add a column for the Amazon S3 file path. Then, create the DynamicFrame and apply a map transformation to add the partition columns, as shown in the following example. Before using the sample code, replace the Amazon S3 paths and enter your partition column names using the correct index values.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import *
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df= spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*")
modified_df = df.withColumn('partitions_column',input_file_name())
dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0")

def modify_col(x):
     if x['partitions_column']:
        new_columns = x['partitions_column'].split('/')
        x['year'],x['month'],x['day'] = new_columns[4],new_columns[5],new_columns[6]
        del x['partitions_column']
     return x

modified_dyf = Map.apply(dyf_0,f=modify_col)
datasink2 = glueContext.write_dynamic_frame.from_options(frame =modified_dyf , connection_type = "s3", connection_options = {"path": "s3://my-output-bucket/output/","partitionKeys": ["year","month","day"]}, format = "parquet", transformation_ctx = "datasink2")

For more information about applying map transformations, see Map Class.

Related information

Managing partitions for ETL output in AWS Glue

Themen

Analysen

Relevanter Inhalt

Wie vermeide ich den Fehler „Unable to validate the following destination configurations“ bei Lambda-Ereignisbenachrichtigungen in CloudFormation?
AWS OFFICIALAktualisiert vor 3 Jahren
Wie behebe ich den Fehler „The specified queue does not exist or you do not have access to it.“, wenn ich einen AWS Glue Job ausführe, um Nachrichten an Amazon SQS in einer anderen Region zu senden?
AWS OFFICIALAktualisiert vor 3 Jahren
Wie erkennt der AWS-Glue-Crawler das Schema?
AWS OFFICIALAktualisiert vor 2 Jahren
Wie behebe ich den Fehler „Unable to validate the following destination configurations“ in AWS CloudFormation?
AWS OFFICIALAktualisiert vor 2 Jahren