Presto is an open-source distributed SQL query engine optimized for low-latency, ad hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3. Presto has two community projects – PrestoDB and PrestoSQL. Amazon EMR supports both projects. Learn more about PrestoDB here.
You can quickly and easily create managed Presto clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including fast Amazon S3 connectivity, integration with Amazon EC2 Spot instances, choice of a wide variety of Amazon EC2 instances, including the memory optimized instances, and resize commands to easily add or remove instances from your cluster.
Features and benefits
Interactive query performance
Presto uses a custom query execution engine with operators designed to support SQL semantics. Different from Hive/MapReduce, Presto executes queries in memory, pipelined across the network between stages, thus avoiding unnecessary I/O. The pipelined execution model runs multiple stages in parallel and streams data from one stage to the next as it becomes available.
Ease of use
You can launch an Amazon EMR cluster running Presto in minutes. You don’t need to worry about node provisioning, cluster setup, configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis. You can also use tools such as Airpal, a web-based query execution tool open-sourced by Airbnb. Airpal’s user interface simplifies data exploration and ad hoc analysis and supports features such as syntax highlighting, the ability to export results to CSV, saving queries for later use, and the ability to explore tables to visualize schema.
Integration with Amazon EMR feature set
Run interactive queries that directly access data in Amazon S3, save costs using Amazon EC2 Spot instance capacity, use EMR Managed Scaling to dynamically add and remove capacity, and launch long-running or ephemeral clusters to match your workload. You can also add other Hadoop ecosystem applications on your cluster.
ANSI SQL support
Presto supports the ANSI SQL standard, which makes it easy for data analysts and developers to query both structured and unstructured data at scale. Currently, Presto supports a wide variety of SQL functionality, including complex queries, aggregations, joins, and window functions.
Netflix has chosen Presto as their interactive, ANSI-SQL compliant query engine for big data. Presto scales well, is open source, and integrates with the Hive Metastore and Amazon S3 - the backbone of Netflix’s big data warehouse environment. Netflix runs Presto on persistent Amazon EMR clusters to quickly and flexibly query across their ~25 PB Amazon S3 data store. Netflix is an active contributor to Presto, and Amazon EMR provides Netflix with the flexibility to run their own build of Presto on Amazon EMR clusters. On average, Netflix runs ~3,500 queries per day on their Presto clusters.
Jampp is a mobile application marketing platform that uses advanced advertising retargeting techniques to drive engaged users to applications. Jampp achieves this by buying mobile media inventory via its own conversion-driven real-time bidding (RTB) engine, which dynamically bids on inventory across 18 RTB exchanges and over 150 mobile ad networks. Jampp leverages Presto running on Amazon EMR for advanced ad hoc log analysis, combining data from multiple sources and complex retargeting segment calculations. As Jampp's user base grew by 600%, so did the demand for complex analytical queries. Jampp moved from running a complex multi-core Python application on MySQL, to running Presto, resulting in 12x performance improvement. Jampp currently uses Presto on Amazon EMR to process 40 TB of data per day.
As a start-up incubator, Cogo Labs operates a platform for marketing analytics and business intelligence used by their portfolio companies and internal teams. To support an OLAP environment with a high rate of innovation, they standardized on SQL to interact with data. Cogo Labs chose Presto for its real-time query performance, support for ANSI-SQL and ability to process data directly from Amazon S3. Presto running on Amazon EMR allows their 100+ developers and analysts to run SQL queries on over 500 TB of data stored in Amazon S3 for data-exploration, ad hoc analysis, and reporting. Cogo Labs uses a combination of short-lived and permanent clusters and relies on Amazon EMR's integration with Spot instances to lower costs.
OpenSpan provides automation and intelligence solutions that help bridge people, processes and technology to gain insight into employee productivity, simplify transactions and engage employees and customers. OpenSpan migrated from HBase to Presto on Amazon EMR with data in Amazon S3. OpenSpan chose Presto because of its SQL interface and ability to query data in real-time directly from Amazon S3; it allowed them to quickly explore vast amounts of data and rapidly iterate on upcoming data products. OpenSpan uses the parquet file format, and also uses PrestogreSQL to connect to Presto. OpenSpan chose Amazon EMR and Amazon S3 to process the gigabytes of data they receive daily from their customers cost efficiently.
Kanmu is a Japanese startup in the financial services industry and provides card-linked offers based on consumers' credit card usage. Kanmu migrated from Hive to using Presto on Amazon EMR because of Presto’s ability to run exploratory and iterative analytics at interactive speed, good performance with Amazon S3, and scalability to query large data sets. Kanmu uses Fluentd-plugin-s3 to push data to Amazon S3, the optimized row columnar (ORC) format to store data and use shib, a node.js-based web client to run SQL queries.