How do I resolve "HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key" when reading files from AWS Config in Athena?

3 minute read
0

When I use Amazon Athena to query AWS Config files, I get the following error: "Error: HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key."

Short description

This error commonly occurs during the following situations:

  • AWS Config resources have multiple tags with the same name.
  • Some of the tags are in uppercase and some are in lowercase.

For example, the following record uses tc:Name and tc:name JSON keys:

{
  "fileVersion": "1.0",
  "configSnapshotId": "35eced35-a13a-45b7-81e4-446e35616e70",
  "configurationItems": [
  {
    "tags": { "tc:Name": "6", "tc:name": "abc6-38" }
  },
  {
    "tags": { "tc:Name": "6", "tc:name": "abc6-38" }
  },
  {
    "tags": { "tc:Name": "6" }
  },
   {
    "tags": { "tc:name": "6" }
  }
  ]
}

Resolution

Run a CREATE TABLE statement similar to the below example. This statement creates an Athena table, sets case.insensitive to false, and maps column names to JSON keys that aren't identical to the column names. Before running this statement, note the following:

  • In the LOCATION field, replace s3://awsexamplebucket/AWSLogs/ with the name of your Amazon Simple Storage Service (Amazon S3) bucket.
  • Replace all mapping properties with your column names and JSON keys (for example, mapping.fileversion'='fileVersion').
CREATE EXTERNAL TABLE aws_config_configuration_snapshot (
 fileversion STRING,
 configsnapshotid STRING,
 configurationitems ARRAY < STRUCT <
    configurationItemVersion : STRING,
    configurationItemCaptureTime : STRING,
    configurationStateId : BIGINT,
    awsAccountId : STRING,
    configurationItemStatus : STRING,
    resourceType : STRING,
    resourceId : STRING,
    resourceName : STRING,
    ARN : STRING,
    awsRegion : STRING,
    availabilityZone : STRING,
    configurationStateMd5Hash : STRING,
    configuration : STRING,
    supplementaryConfiguration : MAP < STRING, STRING >,
    tags: MAP < STRING, STRING >,
    resourceCreationTime : STRING > >
) 
PARTITIONED BY ( dt STRING , region STRING )
ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'case.insensitive'='false,
  'mapping.fileversion'='fileVersion',
  'mapping.configsnapshotid'='configSnapshotId',
  'mapping.configurationitems'='configurationItems',
  'mapping.configurationitemversion'='configurationItemVersion',
  'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
  'mapping.configurationstateid'='configurationStateId',
  'mapping.awsaccountid'='awsAccountId',
  'mapping.configurationitemstatus'='configurationItemStatus',
  'mapping.resourcetype'='resourceType',
  'mapping.resourceid'='resourceId',
  'mapping.resourcename'='resourceName',
  'mapping.arn'='ARN',
  'mapping.awsregion'='awsRegion',
  'mapping.availabilityzone'='availabilityZone',
  'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
  'mapping.supplementaryconfiguration'='supplementaryConfiguration',
  'mapping.configurationstateid'='configurationStateId'
  )
  LOCATION 's3://awsexamplebucket/AWSLogs/';

If you already have a table with partitions loaded, then you can add the new SerDe properties to the table. Use a statement similar to the following:

ALTER TABLE aws_config_configuration_snapshot SET TBLPROPERTIES (  
'case.insensitive'='false',
'mapping.fileversion'='fileVersion',
'mapping.configsnapshotid'='configSnapshotId',
'mapping.configurationitems'='configurationItems',
'mapping.configurationitemversion'='configurationItemVersion',
'mapping.configurationitemcapturetime'='configurationItemCaptureTime',
'mapping.configurationstateid'='configurationStateId',
'mapping.awsaccountid'='awsAccountId',
'mapping.configurationitemstatus'='configurationItemStatus',
'mapping.resourcetype'='resourceType',
'mapping.resourceid'='resourceId',
'mapping.resourcename'='resourceName',
'mapping.arn'='ARN',
'mapping.awsregion'='awsRegion',
'mapping.availabilityzone'='availabilityZone',
'mapping.configurationstatemd5hash'='configurationStateMd5Hash',
'mapping.supplementaryconfiguration'='supplementaryConfiguration',
'mapping.configurationstateid'='configurationStateId')

When the table is ready, access the tags using configurationItem.tags['TAGNAME']. For example, to access the tc:Name tag, run the following query:

SELECT configurationItem.tags['tc:Name']
FROM your_table
CROSS JOIN unnest(configurationItems) AS t(configurationItem)
WHERE configurationItem.tags['tc:Name'] IS NOT NULL

Related information

Create tables in Amazon Athena from nested JSON and mappings using JSONSerDe

Why do I get errors when I try to read JSON data in Amazon Athena?

Hive-JSON-Serde (Github)

AWS OFFICIAL
AWS OFFICIALUpdated a year ago