I get errors when I try to read JSON data in Amazon Athena

Last updated: 2019-11-18

When I try to read JSON data in Amazon Athena, I receive NULL or incorrect data errors. How can I resolve this?

Resolution

Check the following common issues:

Use the correct JSON SerDe

Athena processes JSON data using one of two JSON SerDes:

If you're not sure which SerDe you used, try both of the SerDe versions. If you use the OpenX SerDe, you can ignore malformed records to identify the lines that are causing the errors, as shown in the following example. When ignore.malformed.json is set to true, malformed records return as NULL.

CREATE EXTERNAL TABLE json (
    a string,
    b int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket';

Query the new table to identify the files with malformed records. Example:

SELECT "$PATH", * FROM your_table where your_column is NULL

Use one line per record

The following JSON records are formatted correctly:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

The following JSON records are formatted incorrectly:

{
  "id" : 50,
  "name":"John"
},
{
  "id" : 51,
  "name":"Jane"
}
{
  "id" : 53,
  "name":"Jill"
}

These records are also formatted incorrectly:

{ "id" : 50, "name":"John" } { "id" : 51, "name":"Jane" } { "id" : 53, "name":"Jill" }

Use the correct data type in each column

The second line in the following example contains an incorrect data type for "age". The column should say "11" instead of "eleven". This causes the following error message: HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven".

{"name":"Patrick","age":35,"address":"North Street"}
{"name":"Carlos","age":"eleven","address":"Flowers Street"}
{"name":"Fabiana","age":22,"address":"Main Street"}

Use the correct extension for compressed JSON files

When you use a compressed JSON file, the file must end in ".json" followed by the extension of the compression format, such as ".gz". For example, this is a correctly formatted extension for a gzip file: "myfile.json.gz".

Use case-insensitive columns or set the case.insensitive property tofalse

Athena is case-insensitive by default. If you have column names that differ only by case (for example, “Column” and “column”), Athena generates an error ("HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key") and your data is not visible in Athena. The easiest way to avoid this problem is to generate your data with case-insensitive columns.

If you use the OpenX SerDe, you can use case-sensitive key names. To do this, set the case.insensitive SerDe property to false and add mapping for the uppercase key. For example, to use upper and lowercase columns like this:

{"Username": "bob1234", "username": "bob" }

Use these SerDe properties:

CREATE external TABLE casesensitive_json (user_name String,username String)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'mapping.user_name' = 'Username','case.insensitive'='false')
LOCATION 's3://awsexamplebucket';

Did this article help you?

Anything we could improve?


Need more help?