What is Presto or PrestoDB?
Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.
Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq.
What is the history of Presto?
Presto started as a project at Facebook, to run interactive analytic queries against a 300PB data warehouse, built with large Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive, which it created and rolled out in 2008, to bring the familiarity of the SQL syntax to the Hadoop ecosystem. Hive had a significant impact on the Hadoop ecosystem for simplifying complex Java MapReduce jobs into SQL-like queries, while being able to execute jobs at high scale. However, it wasn’t optimized for fast performance needed in interactive queries.
In 2012, the Facebook Data Infrastructure group built Presto, an interactive query system that could operate quickly at petabyte scale. It was rolled out company-wide in spring, 2013. In November, 2013, Facebook open sourced Presto under the Apache Software License, and made it available for anyone to download on Github. Today, Presto has become a popular choice for doing interactive queries on Hadoop, and has a lot of contributions from Facebook, and other organizations. Facebook’s implementation of Presto is used by over a thousand employees, who run more than 30,000 queries, processing one petabyte of data daily.
How does Presto work?
Presto is a distributed system that runs on Hadoop, and uses an architecture similar to a classic massively parallel processing (MPP) database management system. It has one coordinator node working in synch with multiple worker nodes. Users submit their SQL query to the coordinator which uses a custom query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It is designed to support standard ANSI SQL semantics, including complex queries, aggregations, joins, left/right outer joins, sub-queries, window functions, distinct counts, and approximate percentiles.
After the query is compiled, Presto processes the request into multiple stages across the worker nodes. All processing is in-memory, and pipelined across the network between stages, to avoid any unnecessary I/O overhead. Adding more worker nodes allows for more parallelism, and faster processing.
To make Presto extensible to any data source, it was designed with storage abstraction to make it easy to build pluggable connectors. Because of this, Presto has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata. The data is queried where it is stored, without the need to move it into a separate analytics system.
What are the differences between Presto and Hadoop?
Presto is an open source, distributed SQL query engine designed for fast, interactive queries on data in HDFS, and others. Unlike Hadoop/HDFS, it does not have its own storage system. Thus, Presto is complimentary to Hadoop, with organizations adopting both to solve a broader business challenge. Presto can be installed with any implementation of Hadoop, and is packaged in the Amazon EMR Hadoop distribution.
Who uses Presto?
Presto is used in production at very large scale at many well-known organizations. You’ll find it used at Facebook, Airbnb, Netflix, Atlassian, Nasdaq, and many more. Facebook’s implementation of Presto is used by over a thousand employees, who run more than 30,000 queries, processing one petabyte of data daily. On average, Netflix runs around 3,500 queries per day on its Presto clusters. Airbnb built and open sourced, Airpal, a web-based query execution tool that works on top of Presto. The broader Presto community can be found on this forum and on the Presto page on Facebook.
How can you deploy Presto in the cloud?
Presto is an ideal workload in the cloud, because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. You can launch a Presto cluster in minutes. You don’t need to worry about node provisioning, cluster setup, Presto configuration, or cluster tuning.
How can AWS build your Presto Implementation in the cloud?
Amazon EMR and Amazon Athena are the best places to deploy Presto in the cloud, because it does the integration, and testing rigor of Presto for you, with the scale, simplicity, and cost effectiveness of AWS. With Amazon EMR, you can launch Presto clusters in minutes without needing to do node provisioning, cluster setup, Presto configuration, or cluster tuning. EMR enables you to provision one, hundreds, or thousands of compute instances in minutes. Amazon Athena lets you deploy Presto using the AWS Serverless platform, with no servers, virtual machines, or clusters to setup, manage, or tune. Simply point to your data at Amazon S3, define the schema, and start querying using the built-in query editor, or with your existing Business Intelligence (BI) tools. Athena automatically parallelizes your query, and dynamically scales resources for queries to run quickly. You pay only for the queries that you run.