Securely Querying Your Data Lake with Ahana Presto and AWS Lake Formation
By Wen Phan, Principal Product Manager – Ahana
By Ameer Elkordy, Lead Data Engineer – Metropolis
By Mert Hocanin, Principal Big Data Analyst – AWS
By Roy Hasson, Contributing Writer
AWS Lake Formation, meanwhile, is a service that makes it easy to set up a secure data lake in days. Customers can manage permissions to data in a single place, making it easier to enforce security across a wide range of tools and services.
Customers also use Lake Formation to securely share datasets with other AWS accounts, enabling better collaboration and self-service access to data. Ahana’s new integration with AWS Lake Formation allows customers to query their data protected with Lake Formation fine-grained permissions using Ahana Presto.
In this post, we’ll detail the integration between Ahana Cloud for Presto and AWS Lake Formation, and show how Metropolis uses Lake Formation and Ahana to build a data lake that allows their analysts and data scientists to develop a simple, hands-free parking experience for their customers.
Ahana + AWS Lake Formation Integration
In this section, we’ll detail the Ahana integration with AWS Lake Formation. At a high level, the Ahana integration does two things:
- Creates a mapping between Presto cluster users and AWS Identity and Access Management (IAM) roles associated with Lake Formation access control permissions.
Figure 1 – AWS IAM to Presto user mapping.
- Enforces Lake Formation permissions for queries executed by the IAM role.
The Ahana-managed Presto service makes it easy to enable the integration to allow users to authorize access with AWS Lake Formation.
Figure 2 – Enabling AWS Lake Formation to your Ahana deployment via the Ahana UI.
Presto catalogs are exposed in Ahana as data sources, and the Lake Formation integration is included with the AWS Glue Data Catalog source.
Ahana provides the ability to configure one or more Glue Data Catalog source to enable cross-account and AWS region access.
Figure 3 – Configure your Glue Data Catalog sources through the Ahana Compute Plane.
Ahana Cloud for Presto integrated with AWS Lake Formation offers customers a number of benefits:
- AWS native: Customers can bring Presto to their existing AWS stack and Amazon S3-backed data lake. Customers can define and manage their metadata and access permissions in a single place using AWS Glue Data Catalog and AWS Lake Formation. Users can securely find and query data using Ahana Presto, along with their choice of AWS analytics and machine learning (ML) services.
- Easy to use: Customers can define a data source enabled with Lake Formation in just a few clicks. The same configuration can be automatically applied to one or more Ahana clusters, making it easy, quick, and secure to give access to data to more users.
- Fine-grained data access control: Restrict access to data based on need-to-know to protect sensitive and private information. Ahana-managed Presto filters data based on access policies and returns only the data the user is allowed to see. With Ahana, the number of Presto clusters can easily scale to all users and teams in the organization without compromising data security.
Customer Case Study: Metropolis
Metropolis is using computer vision and mobile devices to reimagine the parking experience for drivers by eliminating the need for tickets and pay machines.
To deliver this experience, Metropolis partners with property owners and asset managers to stand up mobility hubs. Partners are then provided transparency to their operational data and metrics, powered by the Metropolis data platform.
Originally, Metropolis built their data platform on traditional databases, such as PostgreSQL and MySQL. To meet their changing business needs and support the rapid increase in data volume, they wanted to build a flexible and horizontally scalable data lake architecture as their new data platform.
Metropolis embarked on a data lake transformation to consolidate all of their data into Amazon S3 and secure access via IAM. Metropolis augments their own data with third-party sources like Zendesk, Heap, and Stripe, storing the resulting datasets in the data lake.
Data pipelines clean, augment, and enrich data through various tiers of quality, classified as bronze (raw data), silver (cleaned data), and gold (BI-ready, aggregates).
Figure 4 – Metropolis high-level architecture.
Metropolis manages staging and production environments that mirror each other’s configuration. Access to each tier is controlled via IAM roles that represent different dimensions, such as job function, geography, and partner.
Metropolis provides reports to partners who only have access to the data relevant to them. Internally, various partners are managed by Metropolis’ territory managers who can view data in the gold tier to produce the reports, which is limited to only the territory they manage.
BI analysts are the technical workhorse for all reporting and have access to silver and gold data tiers as well as the reporting data itself. Data scientists use silver and gold tier data to build models and help define useful transformations for silver to gold aggregates. Finally, data engineers build all of the data pipelines and have access to all tiers.
The following table shows the different roles, and the dimensions at which data access is segmented.
|BI Analyst||Job Function|
|Data Scientist||Job Function|
|Data Engineer||Job Function|
Metropolis selected the Ahana-managed Presto service to enable data engineers, data scientists, and analysts to run ad-hoc queries in each of the data lake tiers.
Presto’s ability to execute a large number of queries fast and with low latency is what Metropolis needed to enable their users to inspect data at each point in the pipeline. Ahana’s cloud-first managed Presto service made it simple for Metropolis’ data platform team to serve their users and partners without the need to scale and manage Presto clusters.
Solution Architecture and Access Control
Ahana-managed Presto clusters are configured with IAM roles that grant access to data stored in Amazon S3 and cataloged in the AWS Glue Data Catalog.
Managing varying access policies in S3 across many users and IAM roles is difficult and prone to mistakes. Also, restricting access at the column and row level, which gives Metropolis more control over how data is shared with territory managers and external partners, was not possible using S3 and IAM policies.
AWS Lake Formation enables them to manage access to the data lake using fine-grained permissions in a single place using grant and revoke semantics. The integration between Ahana and Lake Formation allows Metropolis to enforce fine-grained data lake access controls, securely share datasets with partners, and simplify their S3 permissions in the data lake.
The following diagram represents the high-level architecture where data is stored in S3, cataloged and made discoverable with AWS Glue Data Catalog, secured and made easily shareable with Lake Formation, and finally queryable using Ahana’s managed Presto service.
Figure 5 – AWS Lake Formation for Ahana high-level architecture.
Defining Access Permissions
There are two ways to define access permissions for AWS Lake Formation managed datasets: named resources and tags.
Named resources are databases, tables, and columns as represented by the AWS Glue Data Catalog. Lake Formation tags can be attached to catalog resources and provide a simple way to scale permission management.
For example, you can tag the phone number column of the customer table with “classification=sensitive”. You can then define a permission that allows access to all columns except those tagged with “classification=sensitive”.
Metropolis implements LF-TBAC to segment environments (production, staging) and personas (data engineers, data scientists). Personally identifiable information (PII) data is protected via column filtering.
With Lake Formation row-level filtering, Metropolis will be able to implement fine-grained access control for territories and external partners, restricting general managers and partners to only rows relevant to them. Metropolis is considering further segmentation of data access via purpose, such as finances and customer care.
Metropolis, while modernizing their data platform, selected Ahana-managed Presto because it’s easy to manage, scales to meet their data and query volume, and provides a fast, easy-to-use, self-service experience for their data analysts and data scientists.
The Metropolis modern data platform blueprint takes advantage of a central storage for structured and semi-structured data. They manage data access permissions in a single place using AWS Lake Formation to enable diverse workloads, including reporting, ad-hoc analysis, and model training.
The cloud-first elasticity built on a disaggregated stack provides Metropolis with the flexibility to scale or change workload compute to meet near-term surges and longer-term SLAs in the face of rapid growth.
To learn more about the future parking experience, visit Metropolis. If you want to get started with Presto in less than 30 minutes, check out Ahana Cloud for Presto available through AWS Marketplace—it’s free to try for 14 days, and then it’s pay-as-you-go.
Ahana – AWS Partner Spotlight
Ahana is an AWS Data and Analytics Competency Partner that provides a fully managed and easy-to-use service for running Presto on AWS.
*Already worked with Ahana? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.