AWS Database Blog
Analyze JSON data efficiently with Amazon Redshift SUPER
Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that you can use to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.
Amazon Redshift transforms how organizations analyze JSON data by combining the analytical power of a columnar data warehouse with robust JSON processing capabilities. By using Amazon Redshift SUPER datatype, you can efficiently store, query, and analyze complex hierarchical data alongside traditional structured data without sacrificing performance.
This post focuses on JSON features of Amazon Redshift.
Amazon Redshift architecture
The Amazon Redshift architecture is designed as a massively parallel processing (MPP) data warehouse system that efficiently handles online analytics processing (OLAP) workloads. At its core is a cluster consisting of a leader node and compute nodes. The leader node manages client communications, develops execution plans, and coordinates query processing, while compute nodes handle the actual data processing in parallel. This MPP architecture particularly benefits OLAP operations by distributing data processing across multiple nodes and executing queries in parallel. The system is optimized for complex analytical queries and reporting on large datasets, rather than traditional online transaction processing (OLTP) operations. Key features include automatic data compression, seamless integration with various AWS services like Amazon Simple Storage Service (Amazon S3) and AWS Glue for data lake analytics, concurrency scaling to handle fluctuating workloads, and a serverless option for simplified management and auto-scaling, alongside robust security measures including encryption and network isolation.
JSON document capabilities in Amazon Redshift
The Amazon Redshift SUPER datatype offers several key advantages for working with JSON data:
- Direct storage of JSON documents with complex hierarchical structures without transformation overhead
- Schema flexibility for evolving JSON documents, so you can store and query semi-structured data without requiring cumbersome database migrations for schema changes within the JSON structure
- SQL-compatible querying with PartiQL, an intuitive syntax for semi-structured data
Amazon Redshift excels at analytical processing of JSON data at scale, proving ideal for complex aggregations, joins, and analytical queries across vast JSON document volumes. To unlock its full potential and achieve optimal performance with JSON data, adherence to Amazon Redshift best practices and to SUPER datatype best practices is crucial.
Real-world example: Retail customer interaction analysis
Consider building a retail analytics system that captures and analyzes customer interactions across multiple channels. This use case demonstrates the SUPER datatype capabilities, because customer interactions generate semi-structured data with varying attributes depending on the channel, device, and action types.
First, we create a table to store customer interactions from various channels. Instead of creating dozens of columns to accommodate interaction attributes, we’ll use a single SUPER column that can store the JSON structure:
This basic schema provides vast flexibility. As the retail business evolves, adding new channels, tracking new metrics, or capturing new device attributes, you won’t need to alter the database schema. This is particularly valuable for retail analytics systems that need to adapt quickly to changing business requirements.
Let’s insert a sample customer interaction from a mobile app. This JSON document captures a rich interaction where a customer viewed a product and then added it to their cart:
This JSON structure contains valuable information for retail analytics:
- The customer’s shopping channel (mobile app)
- A sequence of actions they took (viewing a product, then adding it to cart)
- Details about the product they interacted with
- Information about their device
In a traditional relational model, this might require joins across multiple tables (interactions, actions, devices). With JSON in Amazon Redshift, all related data stays together in a single row.
Advanced JSON querying in Amazon Redshift
The SUPER datatype offers several key features for working with JSON data. To explore these features, see these examples. Let’s discuss some of these features in the context of a retail analytics platform.
Each of these examples showcase how Amazon Redshift, with the SUPER datatype, combined with PartiQL navigation, treats JSON documents as standard SQL constructs, enabling powerful, flexible analysis of JSON-driven retail events.
Count of actions by product
This query tallies the number of times each product was viewed, added to cart, and checked out across all customer sessions.
Output:
| product_id | add_to_cart | checkouts |
| P-1234 | 0 | 1 |
| P-5432 | 1 | 0 |
| P-9999 | 1 | 1 |
This query
- Uses CROSS JOIN to flatten a JSON array (c.interaction_data.actions AS action).
- Unifies relational aggregation (SUM and GROUP BY) with JSON navigation to pivot user events.
High-engagement sessions
This query identifies customers whose total product viewing time exceeds 30 seconds by summing up the duration of all view_product actions.
Output:
| customer_id | total_view_time |
| 1001 | 45 |
| 1003 | 60 |
In this query, JSON fields are used directly in aggregations (action.duration_sec).
Device model extraction
This query pulls the device model from the nested device_info object, supporting device-based segmentation.
Output:
| customer_id | device_model |
| 1001 | iphone 13 |
| 1002 | Dell XPS |
| 1003 | Pixel 5 |
| 1004 | Samsung Galaxy S21 |
In this use case, native dot-notation JSON navigation on a SUPER column (interaction_data.device_info.model) enables direct in-line extraction of nested text values without casting or parsing functions.
Filtering by operating system
This query uses a PartiQL WHERE clause to filter sessions to return only those from Android devices, enabling OS-specific analysis.
Output:
| customer_id |
| 1003 |
| 1004 |
In this query, JSON attributes participate directly in WHERE filters. You can segment data by JSON attributes as if they were relational columns.
Extracting second action details
This query retrieves the type, product ID, and quantity of the second action in each session, useful for analyzing follow-up behaviors.
Output:
| customer_id | second_action_type | second_action_product | quantity |
| 1001 | add_to_cart | P-5432 | 1 |
| 1002 | checkout | P-1234 | (null) |
| 1003 | add_to_cart | P-9999 | 2 |
In this query, PartiQL’s array indexing (AT idx) lets queries target specific JSON array elements. This gives Amazon Redshift parity with JSON document databases where array traversal is a key feature.
Pivoting customer actions
This query pivots counts of each action type into separate columns for a concise per-customer summary. It shows how to convert document-style events into relational-style summaries directly. JSON flexibility is preserved while producing structured aggregates.
Output:
| customer_id | view_product | add_to_cart | checkout | store_locator |
| 1001 | 1 | 1 | 0 | 0 |
| 1002 | 1 | 0 | 1 | 0 |
| 1003 | 1 | 1 | 1 | 0 |
| 1004 | 0 | 0 | 0 | 1 |
Mobile app view_product events
Flatten actions arrays and extract only view_product events on mobile. This example flattens the actions array and filters to show only mobile app product-view events, along with viewing durations.
Output:
| customer_id | interaction_date | product_id | duration_sec |
| 1001 | 2023-09-15 14:32:45 | P-5432 | 45 |
| 1003 | 2023-09-16 09:22:33 | P-9999 | 60 |
This query shows a multi-level JSON querying within plain SQL; complex filters combine JSON array attributes (action.type) with parent object fields (channel).
Performance optimization for JSON in Amazon Redshift
Amazon Redshift implements several optimizations for the SUPER datatype that are particularly valuable for retail analytics:
- Columnar storage: Columnar storage in Amazon Redshift significantly benefits the SUPER datatype by optimizing the storage and retrieval of semi-structured data, even though the SUPER datatype can store entire JSON documents within a single column.
- Zone maps: Zone maps can help skip blocks that don’t match query predicates on extracted values from the SUPER columns
- Parallel processing: The MPP architecture of Amazon Redshift distributes data and query processing across multiple nodes, including operations on SUPER data, enabling high-performance queries on large datasets.
- Materialized views: Create materialized views over commonly accessed JSON paths; they can pre-compute results for frequently accessed JSON paths, further enhancing performance.
For example, consider materialized view. For frequently accessed JSON paths, these views can significantly improve performance. In our retail example, we might create a view for mobile product views:
This materialized view pre-extracts commonly accessed JSON paths, converting them to a columnar format for faster query performance. Retail analysts running reports on mobile engagement will experience significantly faster query times compared to parsing the JSON structure repeatedly. Here is an example:
Output:
| customer_id | interaction_date | product_id | device_model |
| 1001 | 2023-09-15 14:32:45 | P-5432 | iPhone 13 |
| 1003 | 2023-09-16 09:22:33 | P-9999 | Pixel 5 |
Advanced use cases
In this section, we discuss how to deal with advanced JSON uses cases when working with Amazon Redshift.
Geospatial data in JSON
For retail businesses, location data is crucial for understanding customer shopping patterns. Amazon Redshift supports geospatial queries through its GEOMETRY and GEOGRAPHY data types. You can store location data inside SUPER columns and extract it for spatial operations:
Output:
| customer_id | interaction_date | channel | customer_location |
| 1004 | 2023-09-16 15:45:30 | mobile_app | POINT(-122.4194 37.7749) |
By using this capability, retail businesses can perform location-based analytics, such as identifying customers who visited physical stores, analyzing foot traffic patterns, or targeting promotions to customers in specific geographic areas.
Complex nested JSON analysis
Retail businesses often need to analyze complex customer behavior patterns across purchase history. Amazon Redshift enables sophisticated analysis of deeply nested JSON structures. Let’s add a customer and product tables and insert some data.
Output:
| customer_id | first_name | last_name | product_name | category |
| 1001 | Alice | Johnson | Red Sneakers | Shoes |
| 1002 | Bob | Smith | Wireless Mouse | Electronics |
| 1003 | Carol | Taylor | Coffee Maker | Appliances |
| 1004 | David | Brown | NULL | NULL |
This query analyzes purchasing patterns of most popular products purchased by customers over 30 years old, in the current year. The ability to navigate through nested customer information, purchase history, and item details, provides powerful insights for retail merchandising and marketing teams.
Time-series analysis on JSON
Retail businesses often need to analyze time-based patterns, such as shopping behavior during specific hours or seasonal trends. JSON data often contains time-series information that can be analyzed efficiently in Amazon Redshift. Let illustrate use case through a time-series IOT example, using the store_sensors table.
Output:
| store_id | reading_count | avg_temp |
| 1 | 3 | 69 |
| 2 | 2 | 69 |
This query analyzes store environmental conditions, identifying the average temperature. For retail businesses, maintaining optimal shopping conditions is crucial for customer satisfaction and can directly impact sales. The ability to store and analyze IoT sensor data as JSON provides valuable operational insights without complex data transformations.
Limitations and considerations
While the SUPER datatype offers powerful JSON capabilities for retail analytics, there are limitations that you need to be aware of. For more information, see the public documentation on the limitations.
The SUPER datatype inherits two general size limitations of Amazon Redshift. The total size of a document cannot exceed 16MB, and 64 KB is the limit for VARCHAR strings.
For retail businesses with large JSON documents, to deal with these limitations, consider:
- Extracting critical fields: Use generated columns to extract frequently accessed JSON fields into dedicated columns, such as customer tier.
- Amazon Redshift Spectrum: Store large JSON datasets in Amazon S3 and query them using Amazon Redshift Spectrum. Note that the size limit of 64 KB still applies to the external table definition, but you could potentially, pre-process the large objects into more compact formats and store them on Amazon S3. This approach is scalable and avoids storing extremely large, non-relational data within your analytical database, improving performance and stability.
- Hybrid approach: Store frequently accessed data in Amazon Redshift tables and less frequently accessed data in Amazon S3, querying both with federated queries. In this case, you join external data using spectrum together with local data in Amazon Redshift.
Is Amazon Redshift right for your JSON workloads?
As powerful Amazon Redshift is, it is not always the right fit for your JSON workloads. In this section, we discuss when Amazon Redshift is and when it is not a right fit.
Amazon Redshift is ideal for JSON workloads when:
- Analytical processing is primary: Your main use case involves complex analytical queries, such as customer segmentation, product affinity analysis, or sales trend identification.
- Mixed data types: You need to analyze JSON alongside traditional structured data, such as combining customer interaction data with inventory and sales records.
- Data warehouse integration: Your JSON data needs to be part of a broader data warehousing strategy, integrating customer behavior with enterprise data.
- BI tool integration: You need to connect business intelligence tools to analyze JSON data for executive dashboards and reports.
- Cost-effective analytics at scale: You require cost-effective analysis of large volumes of JSON data, such as years of customer interactions or Internet of Things (IoT) sensor readings from multiple store locations.
Consider alternatives when:
- Operational workloads dominate: If you primarily need single-document lookups or updates, such as retrieving or updating individual customer profiles, consider Amazon DynamoDB or Amazon DocumentDB.
- Advanced search is critical: If advanced text search, including full-text and semantic search, is your primary requirement (such as for searching product reviews or customer feedback), Amazon OpenSearch Service might be more appropriate.
- Document size exceeds limits: If you are generally dealing with very large documents with total size exceeding 16MB or documents with VARCHAR strings exceeding 64KB in size.
Conclusion
Amazon Redshift offers a powerful platform for retail JSON analytics that combines the flexibility of semi-structured data with the performance of a columnar data warehouse. By using the SUPER datatype and PartiQL, retail organizations can analyze complex customer interaction data at scale without sacrificing SQL compatibility or analytical capabilities.
As data continues to grow in volume and complexity, the ability to bridge the gap between structured and semi-structured data analysis becomes increasingly valuable. By using Amazon Redshift to remove the traditional boundary between these data types, organizations can consolidate their analytical workloads on a single platform, reducing complexity, improving performance, and accelerating time to insight for business-critical decisions.
Try Amazon Redshift’s SUPER datatype and leave your comments here.