Why does my AWS Glue crawler fail with an internal service exception?
Last updated: 2021-08-17
My AWS Glue crawler fails with the error "ERROR : Internal Service Exception".
Crawler internal service exceptions are sometimes caused by transient issues. Before you start troubleshooting, run the crawler again. If you still get an internal service exception, check for the following common problems.
AWS Glue Data Catalog
- Be sure that the column name lengths don't exceed 255 characters and don't contain special characters. For more information about column requirements, see Column.
- Check for malformed data. For example, if the column name doesn't conform to the regular expression pattern "[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]", then the crawler doesn't work.
- Check for columns that have a length of 0. This happens when columns in the data don't match the data format of the table.
- If your data contains DECIMAL columns with the "(precision, scale)" format, then be sure that the scale value is less than or equal to the precision value.
- In the schema definition of your table, be sure that the Type of each of your columns isn't longer than 131,072 bytes. For more information, see Column structure.
- If your crawler fails with either of the following errors, then be sure that the total schema definition of your table is not larger than 1 MB:
- "Unable to create table in Catalog"
- "Payload size of request exceeded limit"
Amazon Simple Storage Service (Amazon S3)
- Be sure that the Amazon S3 path doesn't contain special characters.
- Confirm that the AWS Identity and Access Management (IAM) role for the crawler has permissions to access the Amazon S3 path. For more information, see Create an IAM role for AWS Glue.
- Having a large number of small files can cause the crawler to fail with an internal service exception. To avoid this problem, use the S3DistCp tool to combine smaller files. You incur additional Amazon EMR charges when you use S3DistCp. Or, set exclude patterns and then crawl the files in batches.
- Remove special ASCII characters such as ^, %, and ~ from your data, if possible. If that's not possible, use custom classifiers to classify your data.
- Confirm that the S3 objects use the STANDARD storage class. To restore objects to the STANDARD storage class, see Restoring an archived object.
- Confirm that the include and exclude patterns in the crawler configuration match the S3 bucket paths.
- If you're crawling an encrypted S3 bucket, then confirm that the IAM role for the crawler has the appropriate permissions for the AWS Key Management Service (AWS KMS) key. For more information, see Working with security configurations on the AWS Glue console and Setting up encryption in AWS Glue.
- If you're crawling an encrypted S3 bucket, be sure that the bucket, AWS KMS key, and AWS Glue job are in the same AWS Region.
- Check the request rate on the S3 bucket that you're crawling. If it's high, consider creating more prefixes to parallelize reads. For more information, see Best practices design patterns: optimizing Amazon S3 performance.
- Be sure that the S3 bucket partitions and keys are consistent. For example, if the crawler expects the objects to use the path s3://awsdoc-example-bucket/yyyy=xxxx/mm=xxx/dd=xx/[files], but some of the objects use the path s3://awsdoc-example-bucket/yyyy=xxxx/mm=xxx/[files], the crawler fails with an internal service exception.
- Be sure that the S3 resource path length is less than 700 characters.
- Be sure that the table has enough read capacity units.
- Be sure that the IAM role that you use to run the crawler has the dynamodb:Scan permission. For more information, see DynamoDB API permissions: actions, resources, and conditions reference.
- Be sure that the table name doesn't include white space characters.
- If you're crawling a JDBC data source that's encrypted with AWS KMS, then check the subnet that you're using for the connection. The subnet's route table must have a route to the AWS KMS endpoint, either through an AWS KMS VPC endpoint or a NAT gateway.
- Be sure that you're using the correct Include path syntax. For more information, see Defining crawlers.
- If you're crawling a JDBC data store, then confirm that the SSL connection is configured correctly. If you're not using an SSL connection, then be sure that Require SSL connection isn't selected when you configure the crawler.
- Confirm that the database name in the AWS Glue connection matches the database name in the crawler's Include path. Also, be sure that you enter the Include path correctly. For more information, see Include and exclude patterns.
- Be sure that the subnet that you're using is in an Availability Zone that's supported by AWS Glue.
- Be sure that the subnet that you're using has enough available private IP addresses.
- Confirm that the JDBC data source is supported with the built-in AWS Glue JDBC driver.
- If you're using AWS KMS, then the AWS Glue crawler must have access to AWS KMS. To grant access, select the Enable Private DNS Name option when you create the AWS KMS endpoint. Then, add the AWS KMS endpoint to the VPC subnet configuration for the AWS Glue connection. For more information, see Creating an AWS KMS VPC endpoint (VPC console).