AWS Big Data Blog
Working with nested data types using Amazon Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types. This post discusses which use cases can benefit from nested data types, how to use Amazon Redshift Spectrum with nested data types to achieve excellent performance and storage efficiency, and some of the limitations of nested data types.
This post uses a data set generated with dummy data. You can view its table schema. If you’d like to try the dataset, deploy a Redshift cluster, execute the DDLs there, and use the example queries from this post or build your own.
Data modeling
In many scenarios, data is generated in a hierarchy. For example, assume a customer bought several items. For analytic purposes, there are various data modeling approaches to save storage or speed up data processing. One popular approach to achieve storage efficiency is the dimensional model.
The following table shows dummy customer data.
username | name | sex | address | birthdate | ||
1 | erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 |
2 | shepherdlisa | Mark Lee | M | 754 Michelle Gateway Port Johnstad, ME 35695 | guerrerotodd@hotmail.com | 11/10/32 |
3 | palmerpaul | Jennifer Marshall | F | 869 Harrell Forges Apt. 111 East Monica, MO 01243 | heather65@hotmail.com | 3/11/07 |
4 | brettmcgee | Travis Wilson | M | 535 Lisa Flat East Andrew, ID 43332 | bellkim@gmail.com | 3/22/10 |
5 | torresdiana | Ashley Hoffman | F | 7815 Lauren Ranch Ambertown, FL 93225 | franklinjonathan@hotmail.com | 5/14/60 |
The following table contains dummy order data, which is linked to the customer table via a foreign key username.
username | transaction_date | shipping_date | items | price | |
1 | erin15 | 10/11/19 | 10/13/19 | 10 | 4794 |
2 | erin15 | 10/11/19 | 10/12/19 | 7 | 1697 |
3 | erin15 | 10/7/19 | 10/9/19 | 2 | 15 |
4 | erin15 | 10/6/19 | 10/10/19 | 5 | 1744 |
5 | erin15 | 10/5/19 | 10/10/19 | 7 | 6346 |
In the dimensional model, each customer’s information is stored only one time. There is no duplicated data, even though a customer could order multiple items at various times.
The dimensional model is optimal for storage. However, it can be challenging to process data efficiently. To get a full picture of your data, you need to join the two tables together to restore the hierarchy.
For example, to find out how many items customer Mark Lee bought and his total spending in the last three months, the query needs to join the customers and orders table. See the following code:
When there are millions of customers who might buy multiple items in each transaction, the join can be very expensive. A fast-growing dataset can be so large that you need to store it in a distributed system. To perform the join, you need to shuffle data through the network, and the cost becomes even more significant.
As storage becomes cheaper and cheaper, people are starting to use a flattened model. In this model, data is pre-joined to gain processing efficiency. The following table shows that the customer and order information is stored in one record and ready to be analyzed.
username | name | sex | address | birthdate | transaction_date | shipping_date | items | price | |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/14/19 | 10/12/19 | 2 | 1237 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/16/19 | 10/9/19 | 8 | 4824 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/17/19 | 10/10/19 | 9 | 4392 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/17/19 | 10/9/19 | 3 | 1079 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/25/19 | 10/7/19 | 1 | 208 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/2/19 | 10/5/19 | 10 | 3689 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/5/19 | 10/10/19 | 7 | 6346 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/6/19 | 10/10/19 | 5 | 1744 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/7/19 | 10/9/19 | 2 | 15 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/11/19 | 10/13/19 | 10 | 4794 |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 10/11/19 | 10/12/19 | 7 | 1697 |
palmerpaul | Jennifer Marshall | F | 869 Harrell Forges Apt. 111 East Monica, MO 01243 | heather65@hotmail.com | 3/11/07 | 9/14/19 | 9/22/19 | 6 | 4642 |
palmerpaul | Jennifer Marshall | F | 869 Harrell Forges Apt. 111 East Monica, MO 01243 | heather65@hotmail.com | 3/11/07 | 9/17/19 | 9/21/19 | 1 | 527 |
palmerpaul | Jennifer Marshall | F | 869 Harrell Forges Apt. 111 East Monica, MO 01243 | heather65@hotmail.com | 3/11/07 | 10/9/19 | 10/12/19 | 5 | 408 |
torresdiana | Ashley Hoffman | F | 7815 Lauren Ranch Ambertown, FL 93225 | franklinjonathan@hotmail.com | 5/14/60 | 9/17/19 | 9/28/19 | 9 | 5452 |
This model also works well on a distributed system. Because each row contains complete information, you can process it on any node, and don’t need to shuffle data. You can also use the columnar format to store data, which allows the query engine to read only the needed columns instead of the whole row. This technique improves analytics performance and is storage efficient.
Both models have their pros and cons. The dimensional model trades compute power for storage efficiency, and the flattened model trades storage for processing efficiency.
Some new data types are available that achieve the best of both. Instead of putting child records into another table, you can nest them into the parent record and get the full information without performing a join. It effectively denormalizes the data without duplicating the parent record.
The following diagram illustrates this workflow.
You can apply this model to a schemaful hierarchy dataset. Continuing with the customer and order example, although a customer might buy multiple items, each order item contains the same type of information, such as product ID, price, and vendor.
The hierarchy is clear and consistent. You can map data to a nested structured schema, which you can store and access efficiently via SQL language.
The following table is a nested data presentation of the previous example.
username | name | sex | address | birthdate | transaction_date | shipping_date | items | price | |
erin15 | Sarah Newman | F | 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 | josephlyons@yahoo.com | 4/24/10 | 9/14/19 | 10/12/19 | 2 | 1237 |
9/16/19 | 10/9/19 | 8 | 4824 | ||||||
9/17/19 | 10/10/19 | 9 | 4392 | ||||||
9/17/19 | 10/9/19 | 3 | 1079 | ||||||
9/25/19 | 10/7/19 | 1 | 208 | ||||||
10/2/19 | 10/5/19 | 10 | 3689 | ||||||
10/5/19 | 10/10/19 | 7 | 6346 | ||||||
10/6/19 | 10/10/19 | 5 | 1744 | ||||||
10/7/19 | 10/9/19 | 2 | 15 | ||||||
10/11/19 | 10/13/19 | 10 | 4794 | ||||||
10/11/19 | 10/12/19 | 7 | 1697 | ||||||
palmerpaul | Jennifer Marshall | F | 869 Harrell Forges Apt. 111 East Monica, MO 01243 | heather65@hotmail.com | 3/11/07 | 9/14/19 | 9/22/19 | 6 | 4642 |
9/17/19 | 9/21/19 | 1 | 527 | ||||||
10/9/19 | 10/12/19 | 5 | 408 | ||||||
torresdiana | Ashley Hoffman | F | 7815 Lauren Ranch Ambertown, FL 93225 | franklinjonathan@hotmail.com | 5/14/60 | 9/17/19 | 9/28/19 | 9 | 5452 |
The following graph compares the storage usage for the three models (all in parquet format).
The graph shows that nested structure is as storage efficient as the dimensional model.
Using nested data types
Nested data types are structured data types for some common data patterns. Nested data types support structs, arrays, and maps.
A struct is similar to a relational table. It groups object properties together. For example, if a customer profile contains their name, address, email, and birthdate, it appears as the following schema:
The data appears as the following code:
An array stores one-to-many relationships. For example, a customer may have multiple shipping addresses or phone numbers. If a customer has several phone numbers, it appears as the following schema:
The data appears as the following code:
[‘555-5555’, ‘555-1234’]
A map is a collection of key-value pairs. You can consider it as a list of struct<key, value> elements. For example, if a customer has particular reward preferences, it appears as the following schema:
The data appears as the following code:
Nested data could have another nested data type as a member. The most common one is an array of structs. For example, an order containing multiple items could appear as the following schema:
You can create a complex object by combining them. For example, a customer’s online transaction appears as the following schema:
Popular query engines such as Hive, Spark, Presto, and Redshift Spectrum support nested data types. The SQL syntax those engines support can be different. To make it straightforward and consistent, all query examples in this post use Amazon Redshift Spectrum. For more information, see Tutorial: Querying Nested Data with Amazon Redshift Spectrum.
Use cases for nested data types
Nested data types have many benefits: simplify your ETL, data modeling, and achieve the good performance. The following are some common use cases that can benefit from nested data types.
Parent-child relationship
Nested data types keep the parent-child (summary-details) relationship by storing them collocated. This often matches how you want to analyze the data. For example, to analyze customers’ purchasing habits, you may need to find the following:
- Customers who purchase often but buy only a few items each time. They likely want an annual membership that covers the shipping cost.
- Customers who purchase less frequently but buy many items in one transaction. They likely expect a free shipping benefit or discount.
You need support information from the orders data, such as how many items, on average, a customer buys per transaction.
To find a list of customers who order online at least once per week, with fewer than four items each time, use the following code:
With the nested order details, per item information is already grouped by customer per transaction. Children aggregation is straightforward; you can aggregate order details to categorize a customer. If you use a denormalized table, you have to do GROUP BY two times. The query could also take longer. See the following code:
To find customers who order only once per quarter with at least 10 items and high total spending, use the following code:
Another benefit of using nested data types for parent-child data analysis is resource usage reduction. If there are one million customer transactions, there could be over five times the item orders. For example, to find each day how many goods ship to Michigan, use the following code:
Assuming that 3% of customers ship orders to Michigan, after filtering the customer data, there could be approximately 3% of matching transactions. You only need to process 150 thousand item orders instead of 5 million. This greatly reduces the data to process and the resources to use when compared to a flattened model.
For the parent-child use case, nested data types provide straightforward aggregation on children, more efficient filtering, group by, windowing, and storage saving.
Many-to-many relationship
Customers could buy many items from various vendors, and a vendor could sell a product to many customers. This is a many-to-many relationship.
In a dimensional model, you need three tables: a customers
table, an orders
table, and a transactions
table. To find the top vendors who have the most customers, you need to join the three tables. See the following code:
With nested data types, the query is similar to the one using the dimensional model. However, because the orders data is collocated with customer transactions, you can join them on-the-fly without paying the cost. See the following code:
As another example, your vendor, Smith PLC, had a big sale event on October 10, 2019. You want to find out which customers bought your product during this sale and the top customers who spent the most. To do so, use the following code:
Compared to the dimensional model query, the nested model is two-to-three times faster. This is on a relatively small dataset with only a few million rows. For a larger dataset, the performance improvement is even greater, and with less resource usage.
Sparse and frequently changed data
Assume that you want to reward customers who order from your online store. For each transaction, the customer can choose one or more rewards, such as free shipping, one-day delivery, a discount, or a coupon. Depending on how effective a reward is, you have to frequently modify the reward types, add new ones, or remove ones that aren’t popular.
If you store the data in a flattened model, there are two common options to track this data. The first method is creating a table with one column for each type of reward. You have to think of all possible rewards at the outset and create those columns. This could lead to a wide table and very sparse data. Alternatively, you can modify your table schema when you want to add or remove a reward type. That adds more maintenance work and you may lose history data. The following table demonstrates this method (all transaction_id data in below table examples are faked one).
transaction_id | free_shipping | one_day_delivery | discount | coupon |
pklein35966659391853535 | FALSE | TRUE | TRUE | |
rebeccawiliams228880139768961 | FALSE | TRUE | ||
brooke39180013629693040 | TRUE | FALSE | TRUE | TRUE |
jchapman4283556333561927 | FALSE | TRUE | FALSE | FALSE |
mariamartin3515336516983566 | FALSE | FALSE | TRUE |
The second option is storing one reward per row. This avoids the wide table issue and the burden of constantly updating the schema. The approach is suitable if you only need to analyze a single reward. If you want to see whether there is any correlation between rewards, such as if more customers prefer free shipping and one-day delivery more than a discount and coupon, this option is more complicated. This model also needs more storage. The following table demonstrates this method.
transaction_id | rewordtype | value |
pklein35966659391853535 | free_shipping | FALSE |
pklein35966659391853535 | one_day_delivery | TRUE |
pklein35966659391853535 | coupon | TRUE |
rebeccawiliams228880139768961 | one_day_delivery | FALSE |
rebeccawiliams228880139768961 | coupon | TRUE |
brooke39180013629693040 | free_shipping | TRUE |
brooke39180013629693040 | one_day_delivery | FALSE |
brooke39180013629693040 | discount | TRUE |
brooke39180013629693040 | coupon | TRUE |
A compromise is to use a JSON string to store selected rewards together in one column, which avoids schema change. See the following code:
The following table shows how the data is stored in JSON string:
transaction_id | preference |
pklein35966659391853535 | {“coupon”:true, “free_shipping”:false,”one_day_delivery”:true} |
rebeccawiliams228880139768961 | {“coupon”:true, one_day_delivery”:false} |
brooke39180013629693040 | {“coupon”:true, “discount”:true, “free_shipping”:true,”one_day_delivery”:false} |
jchapman4283556333561927 | {“coupon”:false, “discount”:false, “free_shipping”:false, “one_day_delivery”:true} |
mariamartin3515336516983566 | {“discount”:true, “free_shipping”:false,”one_day_delivery”:false} |
You can analyze it by using a JSON function to extract the reward data. See the following code:
This solution is acceptable, but you could be more storage efficient and more performant by using the nested data type map. See the following code:
The following table shows how the data is stored in map:
transaction_id | preference |
pklein35966659391853535 | {coupon=true, free_shipping=false,one_day_delivery=true} |
rebeccawiliams228880139768961 | {coupon=true, one_day_delivery=false} |
brooke39180013629693040 | {coupon=true, discount=true, free_shipping=true,one_day_delivery=false} |
jchapman4283556333561927 | {coupon=false, discount=false, free_shipping=false, one_day_delivery=true} |
mariamartin3515336516983566 | {discount=true, free_shipping=false,one_day_delivery=false} |
You can analyze a single reward or multiple rewards using SQL. For example, to find how many customers prefer free shipping, use the following code:
To find how many customers prefer free shipping and one-day delivery more than a coupon or discount, use the following code:
The map type allows you to add any key-value pair. You can add a new reward type at any time without a schema change, and you can analyze the new reward right away.
The main advantage of the map type is that it supports flexible schema and eliminates the need to update the schema frequently. However, there is not much performance benefit. If performance is your top priority, a flattened table is recommended. You can also flatten the most-often accessed columns, and use map for the less frequently accessed columns.
Limitations of nested data types
Although nested data types are useful in many use cases, they have the following limitations:
- There is a hard limit on children size.
- You can only append, and updating data is difficult and slow. You need to rewrite the entire nested object even if you want to modify one child attribute.
- Processing is split at the parent record level. You may run into problems if the children data is heavily skewed.
- The query engine may not support all types of analytics on nested data.
- Amazon Redshift Spectrum Nested Data Limitations.
Summary
This post discussed the benefits of nested data types and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. There are many more use cases in which nested data types can be an ideal solution. Try it out and share your experiences!
About the Author
Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.