Why do I get errors when I try to read JSON data in Amazon Athena?

Last updated: 2021-03-25

When I try to read JSON data in Amazon Athena, I receive NULL or incorrect data errors.

Resolution

Check the following common issues:

Use the correct JSON SerDe

Athena processes JSON data using one of two JSON SerDes:

If you're not sure which SerDe that you used, try both of the SerDe versions. If you use the OpenX SerDe, you can ignore malformed records to identify the lines that are causing the errors, as shown in the following example. When ignore.malformed.json is set to true, malformed records return as NULL.

CREATE EXTERNAL TABLE json (
    a string,
    b int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/';

Query the new table to identify the files with malformed records. For example:

SELECT "$PATH", * FROM your_table where your_column is NULL

Use one line per record

The following JSON records are formatted correctly:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

The following JSON records are formatted incorrectly:

{
  "id" : 50,
  "name":"John"
},
{
  "id" : 51,
  "name":"Jane"
}
{
  "id" : 53,
  "name":"Jill"
}

These records are also formatted incorrectly:

{ "id" : 50, "name":"John" } { "id" : 51, "name":"Jane" } { "id" : 53, "name":"Jill" }

Use the correct data type in each column

The second line in the following example contains an incorrect data type for "age". The column value should be "11" instead of "eleven". This causes the following error message: HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven".

{"name":"Patrick","age":35,"address":"North Street"}
{"name":"Carlos","age":"eleven","address":"Flowers Street"}
{"name":"Fabiana","age":22,"address":"Main Street"}

Use the correct extension for compressed JSON files

When you use a compressed JSON file, the file must end in ".json" followed by the extension of the compression format, such as ".gz". For example, this is a correctly formatted extension for a gzip file: "myfile.json.gz".

Use case-insensitive columns or set the case.insensitive property to false

Athena is case-insensitive by default. If you have column names that differ only by case (for example, “Column” and “column”), Athena generates an error ("HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key") and your data is not visible in Athena. The easiest way to avoid this problem is to generate your data with case-insensitive columns.

If you use the OpenX SerDe, you can use case-sensitive key names. To do this, set the case.insensitive SerDe property to false and add mapping for the uppercase key. For example, to use upper and lowercase columns like this:

{"Username": "bob1234", "username": "bob" }

Use these SerDe properties:

CREATE external TABLE casesensitive_json (user_name String,username String)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'mapping.user_name' = 'Username','case.insensitive'='false')
LOCATION 's3://awsexamplebucket/';

Be sure that all rows in the JSON SerDE table are in JSON format

To find if there are invalid JSON rows or file names in the Athena table, do the following:

1.    Create a table with a delimiter that's not present in the input files. Run a command similar to the following:

CREATE EXTERNAL TABLE IF NOT EXISTS json_validator (jsonrow string) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '%'
location 's3://awsexamplebucket/';

2.    Run a query similar to the following to return the file name, row details, and Amazon S3 path for the invalid JSON rows.

WITH testdataset AS (SELECT "$path" s3path,jsonrow,try(json_parse(jsonrow)) isjson FROM json_validator)
SELECT * FROM testdataset WHERE ISJSON IS NULL;

Did this article help?


Do you need billing or technical support?