AWS Big Data Blog
Narrativ is helping producers monetize their digital content with Amazon Redshift
Narrativ, in their own words: Narrativ is building monetization technology for the next generation of digital content producers. Our product portfolio includes a real-time bidding platform and visual storytelling tools that together generate millions of dollars of advertiser value and billions of data points each month.
At Narrativ, we have seen massive growth in our platform usage with a similar order-of-magnitude increase in data generated by our products over the past 15 months. In this blog post, we share our evolution to a robust, scalable, performant, and cost-effective analytics environment with AWS. We also discuss the best practices that we learned along the way in data warehousing and data lake analytics.
Anticipating Narrativ’s continued growth acceleration, we began planning late last year for the next order of magnitude. We have been using Amazon Redshift as our data warehouse, which has served us very well. As our data continued to grow, we utilized Amazon S3 as a data lake and used external tables in Amazon Redshift Spectrum to query data directly in S3. We were excited that this allowed us to easily scale storage and compute resources separately to meet our needs without trading off against cost or complexity.
In the process, we created Spectrify, which simplifies working with Redshift Spectrum and encapsulates a number of best practices. Spectrify accomplishes three things in one easy command. First, it exports an Amazon Redshift table to S3 in comma-separated value (CSV) format. Second, it converts exported CSVs to Apache Parquet files in parallel. Third, it creates the external tables in the Amazon Redshift cluster. Queries can now span massive volumes of Parquet data in S3 with data in Amazon Redshift, and return results in just moments.
The preceding diagram shows how Spectrify simplifies querying across Parquet data in Amazon S3 and data in Amazon Redshift, returning results in just moments.
Amazon Redshift at Narrativ
Amazon Redshift has been the foundation of our data warehouse since Narrativ’s inception. We use it to produce traffic statistics, to feed the data-driven algorithms in our programmatic systems, and as a window to explore data insights. Amazon Redshift has scaled well to support the company’s needs. Our cluster has grown from 3 nodes to 36 nodes, and queries per hour have increased significantly without losing performance. As our workload moved from statistics and accounting toward data-driven insights, Amazon Redshift kept pace with increasing query complexity.
When planning the next iteration of our data strategy, our requirements were as follows:
- Be maintainable by a small engineering team (mostly hands-off)
- Be able to query all historic data at any time
- Be able to query the previous three months of data very quickly
- Maintain performance at 10x current data volume
We considered a few ideas to meet these requirements:
- Upgrade to a DC2 cluster and a DS2 cluster
- Upgrade to a very large DC2 cluster
- Move to an open-source alternative (difficult to maintain)
- Move to another big data provider (high barrier to entry, costly, risky)
Then we discovered Redshift Spectrum, which changes the game by separating compute costs from storage costs. With Redshift Spectrum, we can offload older, infrequently used data to S3 while maintaining a modestly sized, yet fast, Amazon Redshift cluster. Compared to other solutions, Redshift Spectrum reduces complexity by allowing us to join to the same set of dimension tables for all of our queries. And finally, it even preserves, if not improves, performance (depending on the query). All of our requirements were met with Redshift Spectrum, and it was easy, fast, and cost-effective to implement.
Moving forward, we view Redshift Spectrum as a critical tool to scaling our data capabilities on AWS to the next order of magnitude. Redshift Spectrum allows us to define Amazon Redshift tables backed by data stored in Amazon S3. In this way, Redshift Spectrum turns Amazon Redshift into a hybrid data warehouse/data lake. It provides the best of both worlds—unlimited storage separated from compute costs, plus the performance and converged workflow of Amazon Redshift itself.
Unlocking Redshift Spectrum
Using best practices can provide massive performance improvements in Redshift Spectrum. We shared our Spectrify implementation best practices in an open-source Python library in GitHub to help others reap the same rewards that we have. Following are further details on our top suggestions for optimizing Redshift Spectrum, which you will find incorporated into Spectrify.
Use Apache Parquet
Apache Parquet is a columnar file format. When you store your data in Parquet, Redshift Spectrum can scan only the column data relevant to your query. In contrast, row-oriented formats like CSV or JSON require a scan of every column, regardless of the query. For Narrativ’s workload, using Parquet led to large performance gains and drastically lower bills.
Columnar formats also enable more efficient data compression. Data compresses better when similar data is grouped together. In this case, we saw about an 80 percent reduction in gzipped Parquet files vs. gzipped CSVs. This reduction means data can be ingested, and thus scanned, much faster.
When we first investigated converting our data to Parquet, we found no options that fit our needs. We don’t use Spark or any other project from the Hadoop ecosystem, which is the most common toolset for creating Parquet files. A few Python projects had write support for Parquet. However, we had reservations about data consistency caveats, especially in relation to nullable columns. Additionally, we didn’t find a library that supported the correct timestamp format.
We built Spectrify to solve all these problems. We worked with the Apache Arrow project to ensure full compatibility between Redshift Spectrum and Parquet files generated by Spectrify.
Partition your data
Narrativ stores timestamped event data in Redshift Spectrum. We partition our data on event creation date, allowing Redshift Spectrum to skip inspecting most of our data files for many of our queries. Generally, our queries are concerned with a small subset of the time domain, such as a month or even just a day (for example, answering a specific question from Black Friday 2017). Queries are rarely concerned with to the entire years-long history.
This drastically reduces the amount of data scanned, and we see most queries complete in one-tenth the time. Also, our bill is much lower. Partitioning alone reduced our average cost per query by a factor of 20.
As we prepared for an order-of-magnitude increase in data volumes, we turned to a data lake strategy built on Amazon S3 using Amazon Redshift and Redshift Spectrum for analytics. Redshift Spectrum enabled us to query data across S3 and Amazon Redshift, maintain or improve query performance, and do it easily and cost-effectively. Along the way, we learned best practices and built Spectrify to streamline data preparation and analytics even further. We hope that sharing our code examples and what we learned will also help you achieve similar results with your growing data.
For more information, see our Spectrify Usage documentation and code samples, and our full example on GitHub.
If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.
About the Author
Colin Nichols leads big data initiatives at Narrativ. He enjoys working at the intersection of systems, and has a strong record of shaping engineering culture around the tenets of quality, automation, and continuous learning. In his spare time, Colin enjoys teaching his dog, MacGyver, new tricks.