AWS Database Blog

Analyze JSON data efficiently with Amazon Redshift SUPER

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that you can use to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

Amazon Redshift transforms how organizations analyze JSON data by combining the analytical power of a columnar data warehouse with robust JSON processing capabilities. By using Amazon Redshift SUPER datatype, you can efficiently store, query, and analyze complex hierarchical data alongside traditional structured data without sacrificing performance.

This post focuses on JSON features of Amazon Redshift.

Amazon Redshift architecture

The Amazon Redshift architecture is designed as a massively parallel processing (MPP) data warehouse system that efficiently handles online analytics processing (OLAP) workloads. At its core is a cluster consisting of a leader node and compute nodes. The leader node manages client communications, develops execution plans, and coordinates query processing, while compute nodes handle the actual data processing in parallel. This MPP architecture particularly benefits OLAP operations by distributing data processing across multiple nodes and executing queries in parallel. The system is optimized for complex analytical queries and reporting on large datasets, rather than traditional online transaction processing (OLTP) operations. Key features include automatic data compression, seamless integration with various AWS services like Amazon Simple Storage Service (Amazon S3) and AWS Glue for data lake analytics, concurrency scaling to handle fluctuating workloads, and a serverless option for simplified management and auto-scaling, alongside robust security measures including encryption and network isolation.

JSON document capabilities in Amazon Redshift

The Amazon Redshift SUPER datatype offers several key advantages for working with JSON data:

  • Direct storage of JSON documents with complex hierarchical structures without transformation overhead
  • Schema flexibility for evolving JSON documents, so you can store and query semi-structured data without requiring cumbersome database migrations for schema changes within the JSON structure
  • SQL-compatible querying with PartiQL, an intuitive syntax for semi-structured data

Amazon Redshift excels at analytical processing of JSON data at scale, proving ideal for complex aggregations, joins, and analytical queries across vast JSON document volumes. To unlock its full potential and achieve optimal performance with JSON data, adherence to Amazon Redshift best practices and to SUPER datatype best practices is crucial.

Real-world example: Retail customer interaction analysis

Consider building a retail analytics system that captures and analyzes customer interactions across multiple channels. This use case demonstrates the SUPER datatype capabilities, because customer interactions generate semi-structured data with varying attributes depending on the channel, device, and action types.

First, we create a table to store customer interactions from various channels. Instead of creating dozens of columns to accommodate interaction attributes, we’ll use a single SUPER column that can store the JSON structure:

CREATE TABLE customer_interactions (
    customer_id INT,
    interaction_date TIMESTAMP,
    interaction_data SUPER
);

This basic schema provides vast flexibility. As the retail business evolves, adding new channels, tracking new metrics, or capturing new device attributes, you won’t need to alter the database schema. This is particularly valuable for retail analytics systems that need to adapt quickly to changing business requirements.

Let’s insert a sample customer interaction from a mobile app. This JSON document captures a rich interaction where a customer viewed a product and then added it to their cart:

-- Insert a mobile app interaction with product view and cart addition
INSERT INTO customer_interactions VALUES
(1001, '2023-09-15 14:32:45',
 JSON_PARSE('{
    "channel": "mobile_app",
    "actions": [
      {"type": "view_product", "product_id": "P-5432", "duration_sec": 45},
      {"type": "add_to_cart",  "product_id": "P-5432", "quantity": 1}
    ],
    "device_info": {"os": "iOS", "version": "15.4.1", "model": "iPhone 13"}
 }')),
(1002, '2023-09-15 15:10:12',
 JSON_PARSE('{
    "channel": "web",
    "actions": [
      {"type": "view_product", "product_id": "P-1234", "duration_sec": 20},
      {"type": "checkout",     "product_id": "P-1234", "payment_method": "credit_card"}
    ],
    "device_info": {"os": "Windows", "version": "10", "model": "Dell XPS"}
 }')),
(1003, '2023-09-16 09:22:33',
 JSON_PARSE('{
    "channel": "mobile_app",
    "actions": [
      {"type": "view_product", "product_id": "P-9999", "duration_sec": 60},
      {"type": "add_to_cart",  "product_id": "P-9999", "quantity": 2},
      {"type": "checkout",     "product_id": "P-9999", "payment_method": "paypal"}
    ],
    "device_info": {"os": "Android", "version": "11", "model": "Pixel 5"}
 }')),
(1004, '2023-09-16 15:45:30',
 JSON_PARSE('{
    "channel": "mobile_app",
    "actions": [
      {"type": "store_locator", "timestamp": "2023-09-16T15:45:32Z"}
    ],
    "device_info": {"os": "Android", "version": "12", "model": "Samsung Galaxy S21"},
    "location": {
      "latitude": 37.7749, "longitude": -122.4194,
      "accuracy_meters": 10, "captured_at": "2023-09-16T15:45:30Z"
    }
 }'));

This JSON structure contains valuable information for retail analytics:

  • The customer’s shopping channel (mobile app)
  • A sequence of actions they took (viewing a product, then adding it to cart)
  • Details about the product they interacted with
  • Information about their device

In a traditional relational model, this might require joins across multiple tables (interactions, actions, devices). With JSON in Amazon Redshift, all related data stays together in a single row.

Advanced JSON querying in Amazon Redshift

The SUPER datatype offers several key features for working with JSON data. To explore these features, see these examples. Let’s discuss some of these features in the context of a retail analytics platform.

Each of these examples showcase how Amazon Redshift, with the SUPER datatype, combined with PartiQL navigation, treats JSON documents as standard SQL constructs, enabling powerful, flexible analysis of JSON-driven retail events.

Count of actions by product

This query tallies the number of times each product was viewed, added to cart, and checked out across all customer sessions.

SELECT
  action.product_id::varchar AS product_id,
  SUM(CASE WHEN action.type::varchar = 'view_product' THEN 1 ELSE 0 END) AS views,
  SUM(CASE WHEN action.type::varchar = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart,
  SUM(CASE WHEN action.type::varchar = 'checkout' THEN 1 ELSE 0 END) AS checkouts
FROM customer_interactions c
CROSS JOIN c.interaction_data.actions AS action
GROUP BY 1
ORDER BY 1;

Output:

product_id add_to_cart checkouts
P-1234 0 1
P-5432 1 0
P-9999 1 1

This query

  • Uses CROSS JOIN to flatten a JSON array (c.interaction_data.actions AS action).
  • Unifies relational aggregation (SUM and GROUP BY) with JSON navigation to pivot user events.

High-engagement sessions

This query identifies customers whose total product viewing time exceeds 30 seconds by summing up the duration of all view_product actions.

SELECT
  c.customer_id,
  SUM(CASE WHEN action.type::varchar = 'view_product'
           THEN action.duration_sec::int ELSE 0 END) AS total_view_time
FROM customer_interactions c
CROSS JOIN c.interaction_data.actions AS action
GROUP BY c.customer_id
HAVING SUM(CASE WHEN action.type::varchar = 'view_product'
                THEN action.duration_sec::int ELSE 0 END) > 30
ORDER BY c.customer_id;

Output:

customer_id total_view_time
1001 45
1003 60

In this query, JSON fields are used directly in aggregations (action.duration_sec).

Device model extraction

This query pulls the device model from the nested device_info object, supporting device-based segmentation.

SELECT
  customer_id,
  interaction_data.device_info.model::varchar AS device_model
FROM customer_interactions;

Output:

customer_id device_model
1001 iphone 13
1002 Dell XPS
1003 Pixel 5
1004 Samsung Galaxy S21

In this use case, native dot-notation JSON navigation on a SUPER column (interaction_data.device_info.model) enables direct in-line extraction of nested text values without casting or parsing functions.

Filtering by operating system

This query uses a PartiQL WHERE clause to filter sessions to return only those from Android devices, enabling OS-specific analysis.

SELECT customer_id
FROM customer_interactions
WHERE interaction_data.device_info.os = 'Android';

Output:

customer_id
1003
1004

In this query, JSON attributes participate directly in WHERE filters. You can segment data by JSON attributes as if they were relational columns.

Extracting second action details

This query retrieves the type, product ID, and quantity of the second action in each session, useful for analyzing follow-up behaviors.

SELECT
  c.customer_id,
  action.type::varchar       AS second_action_type,
  action.product_id::varchar AS second_action_product,
  action.quantity::int       AS quantity
FROM customer_interactions c,
     c.interaction_data.actions AS action AT idx
WHERE idx = 1
ORDER BY c.customer_id;

Output:

customer_id second_action_type second_action_product quantity
1001 add_to_cart P-5432 1
1002 checkout P-1234 (null)
1003 add_to_cart P-9999 2

In this query, PartiQL’s array indexing (AT idx) lets queries target specific JSON array elements. This gives Amazon Redshift parity with JSON document databases where array traversal is a key feature.

Pivoting customer actions

This query pivots counts of each action type into separate columns for a concise per-customer summary. It shows how to convert document-style events into relational-style summaries directly. JSON flexibility is preserved while producing structured aggregates.

SELECT
  c.customer_id,
  SUM(CASE WHEN action.type::varchar = 'view_product' THEN 1 ELSE 0 END) AS view_product,
  SUM(CASE WHEN action.type::varchar = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_cart,
  SUM(CASE WHEN action.type::varchar = 'checkout' THEN 1 ELSE 0 END) AS checkout,
  SUM(CASE WHEN action.type::varchar = 'store_locator' THEN 1 ELSE 0 END) AS store_locator
FROM customer_interactions c
CROSS JOIN c.interaction_data.actions AS action
GROUP BY c.customer_id
ORDER BY c.customer_id;

Output:

customer_id view_product add_to_cart checkout store_locator
1001 1 1 0 0
1002 1 0 1 0
1003 1 1 1 0
1004 0 0 0 1

Mobile app view_product events

Flatten actions arrays and extract only view_product events on mobile. This example flattens the actions array and filters to show only mobile app product-view events, along with viewing durations.

SELECT
  c.customer_id,
  c.interaction_date,
  action.product_id::varchar AS product_id,
  action.duration_sec::int   AS duration_sec
FROM customer_interactions c
CROSS JOIN c.interaction_data.actions AS action
WHERE action.type::varchar = 'view_product'
  AND c.interaction_data.channel = 'mobile_app'
ORDER BY c.customer_id;

Output:

customer_id interaction_date product_id duration_sec
1001 2023-09-15 14:32:45 P-5432 45
1003 2023-09-16 09:22:33 P-9999 60

This query shows a multi-level JSON querying within plain SQL; complex filters combine JSON array attributes (action.type) with parent object fields (channel).

Performance optimization for JSON in Amazon Redshift

Amazon Redshift implements several optimizations for the SUPER datatype that are particularly valuable for retail analytics:

  •  Columnar storage: Columnar storage in Amazon Redshift significantly benefits the SUPER datatype by optimizing the storage and retrieval of semi-structured data, even though the SUPER datatype can store entire JSON documents within a single column.
  • Zone maps: Zone maps can help skip blocks that don’t match query predicates on extracted values from the SUPER columns
  • Parallel processing: The MPP architecture of Amazon Redshift distributes data and query processing across multiple nodes, including operations on SUPER data, enabling high-performance queries on large datasets.
  • Materialized views: Create materialized views over commonly accessed JSON paths; they can pre-compute results for frequently accessed JSON paths, further enhancing performance.

For example, consider materialized view. For frequently accessed JSON paths, these views can significantly improve performance. In our retail example, we might create a view for mobile product views:

-- Create a materialized view to accelerate mobile shopping analysis
-- This improves performance for frequently run reports on mobile engagement
CREATE MATERIALIZED VIEW mv_mobile_product_views AS
SELECT
  c.customer_id,
  c.interaction_date,
  action.product_id::varchar AS product_id,
  c.interaction_data.device_info.model::varchar AS device_model
FROM customer_interactions c
CROSS JOIN c.interaction_data.actions AS action
WHERE action.type::varchar = 'view_product'
  AND c.interaction_data.channel = 'mobile_app';

This materialized view pre-extracts commonly accessed JSON paths, converting them to a columnar format for faster query performance. Retail analysts running reports on mobile engagement will experience significantly faster query times compared to parsing the JSON structure repeatedly. Here is an example:

SELECT * FROM mv_mobile_product_views;

Output:

customer_id interaction_date product_id device_model
1001 2023-09-15 14:32:45 P-5432 iPhone 13
1003 2023-09-16 09:22:33 P-9999 Pixel 5

Advanced use cases

In this section, we discuss how to deal with advanced JSON uses cases when working with Amazon Redshift.

Geospatial data in JSON

For retail businesses, location data is crucial for understanding customer shopping patterns. Amazon Redshift supports geospatial queries through its GEOMETRY and GEOGRAPHY data types. You can store location data inside SUPER columns and extract it for spatial operations:

-- Store a customer interaction with location data from a store visit
INSERT INTO customer_interactions VALUES (
    1004, 
    '2023-09-16 15:45:30',
    '{ 
        "channel": "mobile_app", 
        "actions": [
            {"type": "store_locator", "timestamp": "2023-09-16T15:45:32Z"}
        ],
        "device_info": {
            "os": "Android",
            "version": "12",
            "model": "Samsung Galaxy S21"
        },
        "location": {
            "latitude": 37.7749,
            "longitude": -122.4194,
            "accuracy_meters": 10,
            "captured_at": "2023-09-16T15:45:30Z"
        }
    }'
);
-- Extract location data to find customers near specific stores
-- This enables location-based marketing campaigns
SELECT
  customer_id,
  interaction_date,
  interaction_data.channel::varchar AS channel,
  ST_GeomFromText('POINT(' ||
        interaction_data.location.longitude::varchar || ' ' ||
        interaction_data.location.latitude::varchar || ')') AS customer_location
FROM customer_interactions
WHERE interaction_data.location IS NOT NULL;

Output:

customer_id interaction_date channel customer_location
1004 2023-09-16 15:45:30 mobile_app POINT(-122.4194 37.7749)

By using this capability, retail businesses can perform location-based analytics, such as identifying customers who visited physical stores, analyzing foot traffic patterns, or targeting promotions to customers in specific geographic areas.

Complex nested JSON analysis

Retail businesses often need to analyze complex customer behavior patterns across purchase history. Amazon Redshift enables sophisticated analysis of deeply nested JSON structures. Let’s add a customer and product tables and insert some data.

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name  VARCHAR(50)
);
CREATE TABLE products (
    product_id VARCHAR(20) PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50)
);
INSERT INTO customers VALUES
(1001, 'Alice', 'Johnson'),
(1002, 'Bob', 'Smith'),
(1003, 'Carol', 'Taylor'),
(1004, 'David', 'Brown');
INSERT INTO products VALUES
('P-5432', 'Red Sneakers', 'Shoes'),
('P-1234', 'Wireless Mouse', 'Electronics'),
('P-9999', 'Coffee Maker', 'Appliances');

-- Analyze purchasing patterns of customers over 30
-- This helps target specific demographics with appropriate products
SELECT ci.customer_id, cu.first_name, cu.last_name,
       p.product_name, p.category
FROM customer_interactions ci
JOIN customers cu ON ci.customer_id = cu.customer_id
LEFT JOIN products  p ON p.product_id = ci.interaction_data.actions[0].product_id;

Output:

customer_id first_name last_name product_name category
1001 Alice Johnson Red Sneakers Shoes
1002 Bob Smith Wireless Mouse Electronics
1003 Carol Taylor Coffee Maker Appliances
1004 David Brown NULL NULL

This query analyzes purchasing patterns of most popular products purchased by customers over 30 years old, in the current year. The ability to navigate through nested customer information, purchase history, and item details, provides powerful insights for retail merchandising and marketing teams.

Time-series analysis on JSON

Retail businesses often need to analyze time-based patterns, such as shopping behavior during specific hours or seasonal trends. JSON data often contains time-series information that can be analyzed efficiently in Amazon Redshift. Let illustrate use case through a time-series IOT example, using the store_sensors table.

CREATE TABLE store_sensors (
    store_id INT,
    readings SUPER
);
INSERT INTO store_sensors VALUES
(1, JSON_PARSE('[
  {"value": {"temperature": 22.5, "timestamp": "2023-03-01T10:00:00Z"}},
  {"value": {"temperature": 23.0, "timestamp": "2023-06-15T12:00:00Z"}},
  {"value": {"temperature": 21.8, "timestamp": "2023-11-20T09:30:00Z"}}
]')),
(2, JSON_PARSE('[
  {"value": {"temperature": 19.5, "timestamp": "2023-02-10T14:00:00Z"}},
  {"value": {"temperature": 20.0, "timestamp": "2023-07-25T16:00:00Z"}}
]'));

-- Analyze in-store temperature sensors to optimize shopping environment
-- This helps maintain optimal store conditions for customer comfort

SELECT
  s.store_id,
  COUNT(*) AS reading_count,
  AVG((reading.value.temperature)::DOUBLE PRECISION) AS avg_temp
FROM store_sensors s
CROSS JOIN s.readings AS reading
WHERE (reading.value.timestamp)::TIMESTAMP BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY s.store_id;

Output:

store_id reading_count avg_temp
1 3 69
2 2 69

This query analyzes store environmental conditions, identifying the average temperature. For retail businesses, maintaining optimal shopping conditions is crucial for customer satisfaction and can directly impact sales. The ability to store and analyze IoT sensor data as JSON provides valuable operational insights without complex data transformations.

Limitations and considerations

While the SUPER datatype offers powerful JSON capabilities for retail analytics, there are limitations that you need to be aware of. For more information, see the public documentation on the limitations.

The SUPER datatype inherits two general size limitations of Amazon Redshift. The total size of a document cannot exceed 16MB, and 64 KB is the limit for VARCHAR strings.

For retail businesses with large JSON documents, to deal with these limitations, consider:

  • Extracting critical fields: Use generated columns to extract frequently accessed JSON fields into dedicated columns, such as customer tier.
  • Amazon Redshift Spectrum: Store large JSON datasets in Amazon S3 and query them using Amazon Redshift Spectrum. Note that the size limit of 64 KB still applies to the external table definition, but you could potentially, pre-process the large objects into more compact formats and store them on Amazon S3. This approach is scalable and avoids storing extremely large, non-relational data within your analytical database, improving performance and stability.
  • Hybrid approach: Store frequently accessed data in Amazon Redshift tables and less frequently accessed data in Amazon S3, querying both with federated queries. In this case, you join external data using spectrum together with local data in Amazon Redshift.

Is Amazon Redshift right for your JSON workloads?

As powerful Amazon Redshift is, it is not always the right fit for your JSON workloads. In this section, we discuss when Amazon Redshift is and when it is not a right fit.

Amazon Redshift is ideal for JSON workloads when:

  • Analytical processing is primary: Your main use case involves complex analytical queries, such as customer segmentation, product affinity analysis, or sales trend identification.
  • Mixed data types: You need to analyze JSON alongside traditional structured data, such as combining customer interaction data with inventory and sales records.
  • Data warehouse integration: Your JSON data needs to be part of a broader data warehousing strategy, integrating customer behavior with enterprise data.
  • BI tool integration: You need to connect business intelligence tools to analyze JSON data for executive dashboards and reports.
  • Cost-effective analytics at scale: You require cost-effective analysis of large volumes of JSON data, such as years of customer interactions or Internet of Things (IoT) sensor readings from multiple store locations.

Consider alternatives when:

  • Operational workloads dominate: If you primarily need single-document lookups or updates, such as retrieving or updating individual customer profiles, consider Amazon DynamoDB or Amazon DocumentDB.
  • Advanced search is critical: If advanced text search, including full-text and semantic search, is your primary requirement (such as for searching product reviews or customer feedback), Amazon OpenSearch Service might be more appropriate.
  • Document size exceeds limits: If you are generally dealing with very large documents with total size exceeding 16MB or documents with VARCHAR strings exceeding 64KB in size.

Conclusion

Amazon Redshift offers a powerful platform for retail JSON analytics that combines the flexibility of semi-structured data with the performance of a columnar data warehouse. By using the SUPER datatype and PartiQL, retail organizations can analyze complex customer interaction data at scale without sacrificing SQL compatibility or analytical capabilities.

As data continues to grow in volume and complexity, the ability to bridge the gap between structured and semi-structured data analysis becomes increasingly valuable. By using Amazon Redshift to remove the traditional boundary between these data types, organizations can consolidate their analytical workloads on a single platform, reducing complexity, improving performance, and accelerating time to insight for business-critical decisions.

Try Amazon Redshift’s SUPER datatype and leave your comments here.


About the authors

Ezat Karimi

Ezat Karimi

Ezat is a Senior Solutions Architect at AWS, based in Austin, TX. Ezat specializes in designing and delivering modernization solutions and strategies for database applications. Working closely with multiple AWS teams, Ezat helps customers migrate their database workloads to the AWS cloud.