AWS Startups Blog
Building Managed Services: Architecting Ahana Cloud for Presto with the In-VPC Deployment Model
Guest post by James Mesney, Solutions Engineer, Ahana & Gary Stafford, Solutions Architect, AWS
Ahana is the startup that provides the first cloud-native managed service for Presto, the fast-growing, open source distributed SQL engine. Backed by GV (formerly known as Google Ventures) and Lux Ventures, the Ahana team includes experts in Presto, AWS, and big data. This blog post discusses how AWS users have evolved their big data requirements and how the team architected our managed service offering, highlighting the best practice of providing an “In-VPC” deployment. We hope other infrastructure software startups can benefit from sharing some of the key learnings that led to the launch of Ahana Cloud for Presto on AWS.
For some more background, Presto is an open source system for federated data analytics. Federation means the system can map multiple data stores. It enables users to access data where it lives in a wide variety of sources via federated plug-in connectors without moving or copying the data. Presto was originally developed by Facebook. Today, it’s deployed in large-scale production at some of the world’s most data-driven companies, including Uber and Twitter. Presto addresses the business need of leveraging all data within an organization to generate insights and drive decision-making faster than ever before. Presto also leads in delivering on the technology trends of today: disaggregation of storage and compute, resulting in the rise of Amazon S3-based data lakes and on-demand cloud computing. You can learn more on the AWS Presto page.
While the SQL engine is the main component of an interactive ad hoc analytics system, the other components, such as the metadata catalog, the data sources, and the visualization tools or notebooks, require integration. Deploying and managing complex software in AWS can be challenging. Presto administrators must set-up, configure, and connect one data store for Presto’s metadata. Typically, this is Apache Hive or AWS Glue. They must also create and configure their connectors to access their data sources and then configure catalog entries for each data source. Presto requires the admins to deal with many properties files to achieve this, which is both laborious and error-prone.
Ahana Cloud for Presto addresses these complexities and more with an easy-to-use cloud-native managed service. In 60 minutes or less, Ahana allows users to build an end-to-end deployment: multiple clusters of Presto, their Glue or Hive metadata catalogs, their AWS data sources, and user-facing tooling. Customers get the power of Presto with the capabilities of AWS for faster, more iterative, and interactive data discovery—without the complexity. Analysts, data engineers, and data scientists enjoy the freedom to rapidly use any data in the organization and do so in a more self-service way. Additionally, AWS customers can procure services the way they’re used to—quickly and easily—on an hourly pay-as-you-go (PAYGO) listing on AWS Marketplace, simply billed to their AWS accounts.
The “In-VPC” Deployment Approach that Data-Driven Customers Want
As cloud service adoption has grown, the way companies store and analyze their data has evolved. Early adopters were focused around innovation: building and deploying applications quickly with AWS and other public cloud providers. Most of the mission-critical data was still produced and analyzed in data centers, mainly due to control-related concerns of the data, such as where that data could be copied, how it could be used, and who could access it. Now, as cloud adoption has become mainstream, we see companies with the majority of their data both created and stored in the cloud, especially in cost-efficient Amazon S3-based data lakes. Along with this shift, so have the concerns related to how and where data is sent, its use, and access controls. Users do not want to lose control of their data; they prefer to not have to ingest it to other environments. They want data to remain in their own Virtual Private Cloud (VPC).
A new cloud-native architecture model has emerged for data-focused managed services like Ahana. We call it the “In-VPC” deployment model, separating the control plane from the compute and data planes.
The Role of the Control Plane
Ahana Cloud has two major “planes,” the control plane which is delivered as a SaaS, and the compute plane where Presto clusters run, which is delivered as a managed service. The Ahana Control Plane, just as it sounds, oversees, orchestrates, and manages the rest of the environment. The control plane runs in its own VPC, in the Ahana account separate from the customer account VPC, where the compute plane and data live. This makes management much easier without the customers having to share control of user data with Ahana. This is important as users want their data to remain in their own VPC and not be ingested in any other environment (e.g., some 1st gen cloud data warehouse services). In fact, the Ahana control plane running in the Ahana VPC never sees any of the customer’s data; it is totally separate from the customer’s “In-VPC” compute plane deployment.
Integrated Metastore
For further ease-of-use, Ahana pre-integrates an Apache Hive metastore/catalog, which is automatically created, so it’s not essential to set-up other components like AWS Glue. But if users have an existing metastore including Glue, they can use that if they prefer.
Connectors Included
In terms of connectors, Ahana initially ships with support for AWS data services like Amazon S3 and Amazon RDS for MySQL and PostgreSQL, and others. More connectors for sources like MongoDB and Amazon Redshift will follow soon. Ahana automates the creation of connections and catalogs, removes the need to juggle configuration files, and eliminates the need for Presto restarts. Catalogs can be created once and used by multiple clusters.
In the diagram, there are two core components, both created and managed by Ahana Cloud:
1. The Ahana control plane (top) and its UI orchestrates the Presto environment. There’s consolidated application logging, query logging, and monitoring, which means users have full and easy management and control. There are security and access controls and pay-as-you-go hourly billing and support.
- The control plane runs in the Ahana Amazon account, external to the user’s environment.
- Ahana and its employees have no access to the user’s data.
- It is multi-tenant to scale with customer accounts.
- The control plane supports SSO with Amazon Cognito, LDAP authentication, and SQL-based authorization for Presto (RBAC). In the future, there will be Apache Ranger support.
2. The Ahana compute plane (bottom) runs in each user’s VPC, deployed as a single-tenant environment within the user’s account. The control plane first creates a dedicated VPC for the compute plane. It then deploys Amazon EKS for a highly elastic, highly available environment to create Presto clusters. Once the control plane completes the initial set-up of the compute plane, users can create and manage any number of Presto clusters, which then get provisioned into the compute plane in Amazon EKS.
- The compute plane, and the user data it interacts with, runs in the user’s account.
- Each cluster is created in an individual node group to utilize the most advanced autoscaling and high-availability capabilities EKS provides.
- Each Presto cluster comes pre-integrated with a Hive Metastore to store metadata for schemas and tables generated via Presto and an Amazon S3 data lake where data inserted into tables gets stored.
- In addition to the pre-integrated catalog and Amazon S3 bucket, users can attach external Hive Metastores or AWS Glue catalogs pre-populated with metadata for structured data stored in Amazon S3 and databases running on Amazon RDS for MySQL or PostgreSQL.
This separation of the control, compute, and data planes is enabled by Amazon’s recommended approach of cross-account access via external ID – a mechanism that uses trusted secure token exchange. Users simply update their policy to include the Ahana ARNs (Amazon Resource Names). The In-VPC deployment approach offers greater security and cleaner management to users. For further details, we recommend this AWS blog on architecting successful SaaS services.
Summary
Ahana Cloud for Presto is the fully managed, end-to-end environment for Presto. It gives users an interactive multi-cluster UI with single-click cluster and data source management. It provides automatic set-up, security features, and resilience features. It leverages the “In-VPC” deployment, which separates the control, compute, and data planes for customers. Finally, Ahana is procured using a simple and affordable pay-as-you-go usage-based licensing model on AWS.