My Athena query fails with the error "HIVE_INVALID_METADATA: Hive metadata for table is invalid: Table descriptor contains duplicate columns"

Last updated: 2020-12-17

My Amazon Athena query fails with the error "HIVE_INVALID_METADATA: Hive metadata for table sample_table is invalid: Table descriptor contains duplicate columns".

Short description

This error happens when the AWS Glue table schema contains duplicate column names or columns with the same name as partition columns.

Resolution

Check for duplicate column names

To identify the duplicate column, run SHOW CREATE TABLE to retrieve the table schema. Or, view the table schema in the AWS Glue console. In the following example, there are two columns named "column1."

CREATE EXTERNAL TABLE `athenatestingduplicatecolumn_athenatesting`(
  `column1` bigint, 
  `column2` bigint, 
  `column3` bigint, 
  `column1` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc-example-bucket/athenatesting/'
TBLPROPERTIES ( 
  'classification'='csv', 
  'skip.header.line.count'='1'
  )

To resolve the error, run CREATE TABLE to recreate the Athena table with unique column names. Or, use the AWS Glue console to rename the duplicate columns:

  1. Open the AWS Glue console.
  2. Choose the table name from the list, and then choose Edit schema.
  3. Choose the column name, enter a new name, and then choose Save.

Check for table columns and partition columns that have the same name

To check for duplicate partition column and table column names, view the table schema in the AWS Glue console. In the following example, the partition column and table column are both named "id".

Sample data:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

Table definition:

Column name Data type Partition key
id int   
name string    
id string Partition (0)

Use one of the following options to resolve the issue:

  • Rename the partition column in the Amazon Simple Storage Service (Amazon S3) path.
  • Rename the column name in the data and in the AWS glue table definition.
  • If the source data is JSON, manually recreate the table and add partitions in Athena, using the mapping function, instead of using an AWS Glue crawler. In the following example, the AWS Glue table points to s3://doc-example-bucket/athenajsontesting/, which has this partition structure: s3://doc-example-bucket/athenajsontesting/id=12. After you create the table, use MSCK REPAIR TABLE to load the partition.
CREATE EXTERNAL TABLE `athenajsontest_athenajsontesting2`(
  `id1` int COMMENT 'from deserializer', 
  `name` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `id` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'mapping.id1'='id') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc-example-bucket/athenajsontesting'
TBLPROPERTIES (
  'classification'='json',
  'transient_lastDdlTime'='1578399113')

Check the JSON source data for duplicate keys

If the JSON source data contains column names that differ only by case (for example, {"Username": "bob1234", "username": "bob" }), see I get errors when I try to read JSON data in Amazon Athena. Athena is case-insensitive by default. Athena treats "Username" and "username" as duplicate keys, unless you use OpenX SerDe and set the case.insensitive property to false.


Did this article help?


Do you need billing or technical support?