Q: What is a data lake?
A data lake is a scalable central repository of large quantities and varieties of data, both structured and unstructured. Data lakes let you manage the full lifecycle of your data. The first step of building a data lake is ingesting and cataloging data from a variety of sources. The data is then enriched, combined, and cleaned before analysis. This makes it easy to discover and analyze the data with direct queries, visualization, and machine learning (ML). Data lakes complement traditional data warehouses, providing more flexibility, cost-effectiveness, and scalability for ingestion, storage, transformation, and analysis of your data. The traditional challenges around the construction and maintenance of data warehouses and limitations in the types of analysis can be overcome using data lakes.
Read more about "What is a data lake?"
Q: What is AWS Lake Formation?
Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and ML. Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services. Lake Formation automatically manages access to the registered data in Amazon S3 through services including AWS Glue, Amazon Athena, Amazon Redshift, Amazon QuickSight, and Amazon EMR using Zeppelin notebooks with Apache Spark to ensure compliance with your defined policies. If you’ve set up transformation jobs spanning AWS services, Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the jobs. With Lake Formation, you can configure and manage your data lake without manually integrating multiple underlying AWS services.
Q: Why should I use Lake Formation to build my data lake?
Lake Formation makes it easy to build, secure, and manage your AWS data lake. Lake Formation integrates with underlying AWS security, storage, analysis, and ML services and automatically configures them to comply with your centrally defined access policies. It also gives you a single console to monitor your jobs and data transformation and analytic workflows.
Lake Formation can manage data ingestion through AWS Glue. Data is automatically classified, and relevant data definitions, schema, and metadata are stored in the central data catalog. AWS Glue also converts your data to your choice of open data formats to be stored in Amazon S3 and cleans your data to remove duplicates and link records across datasets. Once your data is in your S3 data lake, you can define access policies, including table-and-column-level access controls, and enforce encryption for data at rest. You can then use a wide variety of AWS analytic and ML services to access your data lake. All access is secured, governed, and auditable.
Q: Can I see a presentation on AWS Lake Formation?
Yes. You can watch the full recording of the "Intro to AWS Lake Formation" session from re:Invent.
Q: What kind of problems does the FindMatches ML Transform solve?
FindMatches generally solves record linkage and data deduplication problems. Deduplication is necessary when you’re trying to identify records in a database that are conceptually the same but for which you have separate records. This problem is trivial if duplicate records can be identified by a unique key (for instance, if products can be uniquely identified by a UPC Code), but it becomes very challenging when you have to do a “fuzzy match.”
Record linkage is basically the same problem as data deduplication, but this term usually means that you’re doing a “fuzzy join” of two databases that don’t share a unique key rather than deduplicating a single database. As an example, consider the problem of matching a large database of customers to a small database of known fraudsters. FindMatches can be used on both record linkage and deduplication problems.
For instance, Lake Formation's FindMatches ML Transform can help you with the following problems:
Link patient records: Link patient records between hospitals so doctors have more background information and are better able to treat patients. Use FindMatches on separate databases that both contain common fields such as name, birthday, home address, and phone number.
Deduplicate data: Deduplicate a database of movies containing columns such as “Title,” “Plot synopsis,” “Year of release,” “Run time,” and “Cast.” For example, there might be variations in how the title or the cast names are listed, resulting in duplicates rather than a clean dataset.
Group products: Automatically group all related products together in your storefront by identifying equivalent items in an apparel product catalog, where you want to define “equivalent” to mean that they’re the same when ignoring certain differences. For instance, you might consider all pants to be equivalent despite differences in size and color.
Q: How does Lake Formation deduplicate my data?
Lake Formation's FindMatches ML Transform makes it easy to find and link records that refer to the same entity but don’t share a reliable identifier. Before FindMatches, developers would commonly solve data-matching problems deterministically, by writing huge numbers of hand-tuned rules. FindMatches uses ML algorithms behind the scenes to learn how to match records according to each developer's business criteria. FindMatches first identifies records for you to label as to whether they match or don’t match and then uses ML to create an ML Transform. You can then run this Transform on your database to find matching records, or you can ask FindMatches to give you additional records to label to push your ML Transform to higher levels of accuracy.
Q: What are ML Transforms?
ML Transforms provide a destination for creating and managing machine-learned transforms. Once created and trained, these ML Transforms can be run in standard AWS Glue scripts. You select an algorithm (for example, the FindMatches ML Transform) and input datasets and training examples, and the tuning parameters needed by that algorithm. AWS Lake Formation uses those inputs to build an ML Transform that can be incorporated into a normal ETL job workflow.
Q: How do ML Transforms work?
Lake Formation includes specialized ML-based dataset transformation algorithms that you can use to create your own ML Transforms. These include record deduplication and match finding.
First, navigate to the ML Transforms tab in the Lake Formation console (or use the ML Transforms service endpoints, or access ML Transforms training through the AWS Command Line Interface [CLI]) to create your first ML Transforms model. The ML Transforms tab provides a user-friendly view for management of user transforms. ML Transforms require distinct workflow requirements from other transforms, including the need for separate training, parameter tuning, and run workflows; the need for estimating the quality metrics of generated transformations; and the need to manage and collect additional truth labels for training and active learning.
To create an ML Transform with the console, first select the transform type (such as Record Deduplication or Record Matching) and provide the appropriate data sources previously discovered in the data catalog. Depending on the transform, you may be asked to provide ground truth label data for training or additional parameters. You can monitor the status of your training jobs and view quality metrics for each transform. (Quality metrics are reported using a hold-out set of the label data you provided.)
Once you’re satisfied with the performance, you can promote ML Transforms models for use in production. ML Transforms can then be used during ETL workflows, both in code autogenerated by the service and in user-defined scripts submitted with other jobs, such as prebuilt transforms offered in AWS Glue libraries.
Q: Can I see a presentation on using AWS Lake Formation to find matches and deduplicate records?
Yes. The full recording of the AWS Online Tech Talk "Fuzzy Matching and Deduplicating Data with ML Transforms for AWS Lake Formation" is available here.
Q: How does Lake Formation relate to other AWS services?
Lake Formation manages data access for registered data that is stored in Amazon S3 and manages query access from AWS Glue, Athena, Redshift, Amazon QuickSight, and EMR using Zeppelin notebooks with Apache Spark through a unified security model and permissions. Lake Formation can ingest data from S3, Amazon RDS databases, and AWS CloudTrail logs, understand their formats, and make data clean and able to be queried. Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the jobs.
Read more about "Data Lakes and Analytics on AWS" including how to build a customized data lake.
Q: How does Lake Formation relate to AWS Glue?
Lake Formation uses a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, blueprints to create workflows for data ingest, the same data catalog, and a serverless architecture. Although AWS Glue focuses on these types of functions, Lake Formation encompasses all AWS Glue features and provides additional capabilities designed to help build, secure, and manage a data lake. See the AWS Glue features page for more details.
ETL and catalog
Q: How does Lake Formation help me discover the data I can move into my data lake?
Lake Formation automatically discovers all AWS data sources to which it is provided access by your AWS IAM policies. It crawls Amazon S3, Amazon RDS, and AWS CloudTrail sources, and through blueprints it identifies them to you as data that can be ingested into your data lake. No data is ever moved or made accessible to analytic services without your permission. You can also use AWS Glue to ingest data from other sources, including S3 and Amazon DynamoDB.
You can also define JDBC connections to allow Lake Formation to access your AWS databases and on-premises databases including Oracle, MySQL, Postgres, SQL Server, and MariaDB.
Lake Formation ensures that all your data is described in a central data catalog, giving you one location to browse the data that you have permission to view and query. The permissions are defined in your data access policy and can be set at the table and column level.
In addition to the properties automatically populated by the crawlers, you can add labels (including business attributes such as data sensitivity) at the table or column level, and add field-level comments.
Q: How does Lake Formation organize my data in a data lake?
You can use one of the blueprints available in Lake Formation to ingest data into your data lake. Lake Formation creates Glue workflows that crawl source tables, extract the data, and load it to Amazon S3. In S3, Lake Formation organizes the data for you, setting up partitions and data formats for optimized performance and cost. For data already in S3, you can register those buckets with Lake Formation to manage them.
Lake Formation also crawls your data lake to maintain a data catalog and provides an intuitive user interface for you to search entities (by type, classification, attribute, or free-form text).
Q: How does Lake Formation use machine learning to clean my data?
Lake Formation provides jobs that run ML algorithms to perform deduplication and link matching records. Creating ML Transforms is as easy as selecting your source, selecting a desired transform, and providing training data for the desired changes. Once trained to your satisfaction, the ML Transforms can be run as part of your regular data movement workflows, with no ML expertise required.
Q: What are other ways I can ingest data to AWS for use with Lake Formation?
You can move petabytes to exabytes of data from your data centers to AWS using physical appliances with AWS Snowball, AWS Snowball Edge, and AWS Snowmobile. You can also connect your on-premises applications directly to AWS with AWS Storage Gateway. You can accelerate data transfer using a dedicated network connection between your network and AWS with AWS Direct Connect, or boost long-distance global data transfers using Amazon’s globally distributed edge locations with Amazon S3 Transfer Acceleration. Amazon Kinesis also provides a useful way to load streaming data to S3. Lake Formation Data Importers can be set up to perform ongoing ETL jobs and prepare ingested data for analysis.
Q: Can I use my existing data catalog or Hive Metastore with Lake Formation?
Lake Formation provides a way for you to import your existing catalog and metastore into the data catalog. However, Lake Formation requires your metadata to reside in the data catalog to ensure governed access to your data.
Security and governance
Q: How does Lake Formation protect my data?
Lake Formation protects your data by giving you a central location where you can configure granular data access policies that protect your data, regardless of which services are used to access it.
To centralize data access policy controls using Lake Formation, first shut down direct access to your buckets in Amazon S3 so all data access is managed by Lake Formation. Next, configure data protection and access policies using Lake Formation, which enforces those policies across all the AWS services accessing data in your lake. You can configure users and roles and define the data these roles can access, down to the table and column level.
Lake Formation currently supports server-side encryption on S3 (SSE-S3, AES-265). Lake Formation also supports private endpoints in your Amazon Virtual Private Cloud (VPC) and records all activity in AWS CloudTrail, so you have network isolation and auditability.
Q: How does Lake Formation work with AWS IAM?
Lake Formation integrates with IAM so authenticated users and roles can be automatically mapped to data protection policies that are stored in the data catalog. The IAM integration also lets you use Microsoft Active Directory or LDAP to federate into IAM using SAML.
Q: How do I convert an existing table in Amazon S3 to a governed table?
If you have existing Amazon S3–based tables cataloged in the AWS Glue Data Catalog, you can convert them to governed tables by running the AWS Glue blueprint available on the AWS Labs Github page. Additionally, you can create a new governed table and update the manifest information in Lake Formation using the AWS SDK and CLI. The manifest information contains a list of S3 objects and associated metadata that represent the current state of your table. You can also use AWS Glue ETL to read from an existing table and create a copy of it as a Governed Table. This allows you to migrate your applications and users to the Governed Table at your own pace.
Enabling data access
Q: How does Lake Formation help an analyst or data scientist discover what data they can access?
Lake Formation ensures that all your data is described in the data catalog, giving you a central location to browse the data that you have permission to view and query. The permissions are defined in your data access policy and can be set at the table and column level.
Q: Can I use third party business intelligence tools with Lake Formation?
Yes. You can use third-party business applications, such as Tableau and Looker, to connect to your AWS data sources through services such as Athena or Redshift. Access to data is managed by the underlying data catalog, so regardless of which application you use, you’re assured that access to your data is governed and controlled.
Q: Does Lake Formation provide APIs or a CLI?
Yes. Lake Formation provides APIs and a CLI to integrate Lake Formation functionality into your custom applications. Java and C++ SDKs are also available to let you integrate your own data engines with Lake Formation.
Q: What is the AWS Lake Formation Storage API, and why should I use it?
The Lake Formation Storage API, provides a single interface for AWS services, ISV solutions and applications developers to securely and reliably read and write data in the data lake. To write data, the Storage API exposes ACID (atomic, consistent, isolated, and durable) transactions that allows you to write data into Governed Tables, a new type of Amazon S3 table, in a reliable and consistent manner. To read data, the Storage API allows you to query data in Governed Tables and standard S3 tables secured with Lake Formation fine-grained permissions. The Storage API will automatically enforce permissions before returning the filtered results to the calling application. Access permissions are enforced consistently across a wide range of services and tools.
Learn about AWS Lake Formation pricing by visiting the product pricing page.
Instantly get access to the AWS Free Tier.
Start building with AWS Lake Formation in the AWS Management Console.