How do I resolve "HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key" when reading files from AWS Config in Athena?

Last updated: 2019-12-09

When I use Amazon Athena to query AWS Config files, I get the following error: "Error: HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key."

Short Description

This error commonly occurs when AWS Config resources have multiple tags with the same name—and some of the tags are in uppercase and some are in lowercase. For example, the following record uses tc:Name and tc:name JSON keys:

{
  "fileVersion": "1.0",
  "configSnapshotId": "35eced35-a13a-45b7-81e4-446e35616e70",
  "configurationItems": [
    {
      "tags": { "tc:Name": "6", "tc:name": "abc6-38" }
    },
    {
      "tags": { "tc:Name": "6", "tc:name": "abc6-38" }
    },
    {
      "tags": { "tc:Name": "6" }
    },
   {
      "tags": { "tc:name": "6" }
    }
  ]
}

Resolution

Run a CREATE TABLE statement similar to the following. This statement creates an Athena table, sets case.insensitive to false, and maps column names to JSON keys that aren't identical to the column names. Before running this statement:

  • In the LOCATION field, replace s3://awsexamplebucket/AWSLogs/ with the name of your Amazon Simple Storage Service (Amazon S3) bucket.
  • Replace all mapping properties with your column names and JSON keys (for example, mapping.fileversion'='fileVersion').
CREATE EXTERNAL TABLE aws_config_configuration_snapshot (
 fileversion STRING,
 configsnapshotid STRING,
 configurationitems ARRAY < STRUCT <
        configurationItemVersion : STRING,
        configurationItemCaptureTime : STRING,
        configurationStateId : BIGINT,
        awsAccountId : STRING,
        configurationItemStatus : STRING,
        resourceType : STRING,
        resourceId : STRING,
        resourceName : STRING,
        ARN : STRING,
        awsRegion : STRING,
        availabilityZone : STRING,
        configurationStateMd5Hash : STRING,
        configuration : STRING,
        supplementaryConfiguration : MAP < STRING, STRING >,
        tags: MAP < STRING, STRING >,
        resourceCreationTime : STRING > >
) 
PARTITIONED BY ( dt STRING , region STRING )
ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'case.insensitive'='false',
  'mapping.fileversion'='fileVersion',
  'mapping.configsnapshotid'='configSnapshotId',
  'mapping.configurationitems'='configurationItems',
  'mapping.configurationitemversion'='configurationItemVersion',
  'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
  'mapping.configurationstateid'='configurationStateId',
  'mapping.awsaccountid'='awsAccountId',
  'mapping.configurationitemstatus'='configurationItemStatus',
  'mapping.resourcetype'='resourceType',
  'mapping.resourceid'='resourceId',
  'mapping.resourcename'='resourceName',
  'mapping.arn'='ARN',
  'mapping.awsregion'='awsRegion',
  'mapping.availabilityzone'='availabilityZone',
  'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
  'mapping.supplementaryconfiguration'='supplementaryConfiguration',
  'mapping.configurationstateid'='configurationStateId'
  )
LOCATION 's3://awsexamplebucket/AWSLogs/';

If you already have a table with partitions loaded, you can add the new SerDe properties to the table like this:

ALTER TABLE aws_config_configuration_snapshot SET TBLPROPERTIES (  
   'case.insensitive'='false',
'mapping.fileversion'='fileVersion',
'mapping.configsnapshotid'='configSnapshotId',
'mapping.configurationitems'='configurationItems',
'mapping.configurationitemversion'='configurationItemVersion',
'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
'mapping.configurationstateid'='configurationStateId',
'mapping.awsaccountid'='awsAccountId',
'mapping.configurationitemstatus'='configurationItemStatus',
'mapping.resourcetype'='resourceType',
'mapping.resourceid'='resourceId',
'mapping.resourcename'='resourceName',
'mapping.arn'='ARN',
'mapping.awsregion'='awsRegion',
'mapping.availabilityzone'='availabilityZone',
'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
'mapping.supplementaryconfiguration'='supplementaryConfiguration',
'mapping.configurationstateid'='configurationStateId')

When the table is ready, you can access tags using configurationItem.tags['TAGNAME']. For example, to access the tc:Name tag, run the following query:

select configurationItem.tags['tc:Name']
  from your_table
cross join unnest(configurationItems) as t(configurationItem)
where configurationItem.tags['tc:Name'] is not null