Why do I get errors when I try to read JSON data in Amazon Athena?

4 minute read
0

When I try to read JSON data in Amazon Athena, I receive NULL or incorrect data errors.

Resolution

Check the following common issues:

Use the correct JSON SerDe

Athena uses one of the following JSON SerDes to process JSON:

If you need to run INSERT INTO query against the created table, then use the Hive JSON SerDe.

If you use the OpenX JSON SerDe, then you can ignore malformed records to identify the files that cause the errors. In the following example, when ignore.malformed.json is set to true, malformed records return as NULL:

CREATE EXTERNAL TABLE json (    id int,
    name string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://awsexamplebucket/';

Query the new table to identify the files with malformed records similar to the following:

SELECT "$PATH", * FROM your_table where your_column is NULL

Note: If you run Athena queries for pretty-print JSON files, you can use the Amazon Ion Hive SerDe. The Amazon Ion Hive SerDe doesn't expect each row of data to be on a single line. Use this feature to query JSON datasets that are in pretty print format, or break up the fields in a row with newline characters.

Use one line per record

The following JSON records are formatted correctly and can be read by all three JSON SerDe:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

The following JSON records can only be read by the Amazon Ion Hive SerDe:

{  "id" : 50,
  "name":"John"
}
{
  "id" : 51,
  "name":"Jane"
}
{
  "id" : 53,
  "name":"Jill"
}

The following JSON data has multiple records within a single line and is formatted incorrectly:

{ "id" : 50, "name":"John" } { "id" : 51, "name":"Jane" } { "id" : 53, "name":"Jill" }

Use the correct data type in each column

The second line in the following example contains an incorrect data type for "age". The column value should be "11" instead of "eleven". This causes the following error message: HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven".

{"name":"Patrick","age":35,"address":"North Street"}
{"name":"Carlos","age":"eleven","address":"Flowers Street"}
{"name":"Fabiana","age":22,"address":"Main Street"}

Use the correct extension for compressed JSON files

When you use a compressed JSON file, the file must end in the ".gz" format. For example, "myfile.json.gz" or "myfile.gz" are correct formatted extensions.

Use case-insensitive columns or set the case.insensitive property to false

Athena is case-insensitive by default. If you have column names that differ by case, for example, "Column" and "column", Athena generates the following error:

"HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key"

Additionally, your data isn't visible in Athena. To avoid this error, generate your data with case-insensitive columns.

If you use the OpenX SerDe, then use case-sensitive key names. To do this, set the case.insensitive SerDe property to false and add mapping for the uppercase key. For example, to use upper and lowercase columns like this:

{"Username": "bob1234", "username": "bob" }

Use these SerDe properties:

CREATE external TABLE casesensitive_json (user_name String,username String)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'mapping.user_name' = 'Username','case.insensitive'='false')
LOCATION 's3://awsexamplebucket/';

If the JSON file has duplicate case-sensitive column names, then update the file to remove the duplicate columns:

{"username": "bob1234", "username": "bob" }

Be sure that all rows in the JSON SerDe table are in JSON format

To find if there are JSON rows or file names that aren't valid in the Athena table, do the following:

  1. Create a table with a delimiter that's not present in the input files. Run a command similar to the following:

    CREATE EXTERNAL TABLE IF NOT EXISTS json_validator (jsonrow string) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '%'
    location 's3://awsexamplebucket/';
  2. Run a query similar to the following to return the file name, row details, and Amazon S3 path for the JSON rows that aren't valid:

    WITH testdataset AS (SELECT "$path" s3path,jsonrow,try(json_parse(jsonrow)) isjson FROM json_validator)
    SELECT * FROM testdataset WHERE ISJSON IS NULL;

Related information

Best practices for reading JSON data

JSON-related errors

AWS OFFICIAL
AWS OFFICIALUpdated 19 days ago