AWS Lake Formation features

Build data lakes quickly

Import data from databases already in AWS

Once you specify where your existing databases are and provide your access credentials, AWS Lake Formation reads the data and its metadata (schema) to understand the contents of the data source. It then imports the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, PostgreSQL, SQL Server, MariaDB, and Oracle databases running in Amazon Relational Database Service (RDS) or hosted in Amazon Elastic Compute Cloud (EC2). Both bulk and incremental data loading are supported. 

Import data from other external sources

You can use Lake Formation to move data from on-premises databases by connecting with Java Database Connectivity (JDBC). Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with AWS Glue. 

Import data from other AWS services

Using Lake Formation, you can also pull in semi-structured and unstructured data from other Amazon Simple Storage Service (S3) data sources. You can identify existing Amazon S3 buckets containing data to copy into your data lake. Once you specify the S3 path to register your data sources and authorize access, Lake Formation reads the data and its schema. Lake Formation can collect and organize datasets, such as logs from AWS CloudTrail, AWS CloudFront, Detailed Billing Reports, and AWS Elastic Load Balancing (ELB). You can also load your data into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.

Catalog and label your data

Lake Formation crawls and reads your data sources to extract technical metadata (such as schema definitions) and creates a searchable catalog to describe this information for users so they can discover available datasets. You can also add your own custom labels to your data (at the table and column level) to define attributes, such as “sensitive information” and “European sales data.” Lake Formation provides a text-based search over this metadata so your users can quickly find the data they need to analyze. 

Transform data

Lake Formation can perform transformations on your data, such as rewriting various date formats for consistency, to ensure that the data is stored in an analytics-friendly fashion. Lake Formation creates transformation templates and schedules jobs to prepare your data for analysis. Your data is transformed with AWS Glue and written in columnar formats, such as Parquet and ORC, for better performance. Less data needs to be read for analysis when data is organized into columns as opposed to scanning entire rows. You can create custom transformation jobs with AWS Glue and Apache Spark to suit your specific requirements.

Clean and deduplicate data

Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning (ML) Transform called FindMatches for deduplication and finding matching records. For example, use FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about ML to do this. FindMatches will simply ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a match and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.

Optimize partitions

Lake Formation also optimizes the partitioning of data in Amazon S3 to improve performance and reduce costs. Raw data that is loaded may be in partitions that are too small (requiring extra reads) or too large (reading more data than needed.) With Lake Formation, your data is organized by size, time period, and/or relevant keys. This enables both fast scans and parallel, distributed reads for the most commonly used queries.

Row and Cell-level security
Lake Formation provides data filters that allow you to restrict access to a combination of columns and rows. Use row and cell-level security to protect sensitive data like Personal Identifiable Information (PII).

Simplify security management

Enforce encryption

Lake Formation uses the encryption capabilities of Amazon S3 for data in your data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). S3 encrypts data in transit when replicating across Regions and lets you use separate accounts for source and destination Regions to protect against malicious insider deletions. These encryption capabilities provide a secure foundation for all data in your data lake.

Define and manage access controls
Lake Formation provides a single place to manage access controls for data in your data lake. You can define security policies that restrict access to data at the database, table, column, row, and cell levels. These policies apply to AWS Identity and Access Management (IAM) users and roles, and to users and groups when federating through an external identity provider. You can use fine-grained controls to access data secured by Lake Formation within Amazon Redshift Spectrum, Amazon Athena, AWS Glue ETL, and Amazon EMR for Apache Spark.
Implement audit logging

Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and ML services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and console.

Governed tables
Use ACID (atomic, consistent, isolated, and durable) transactions to allow multiple users and jobs to reliably and consistently insert data, across multiple tables on Amazon S3. Transactions for Governed Tables automatically manage conflicts and errors and ensures consistent views for all users. You can query Governed Tables using transactions from Amazon Redshift, Amazon Athena, and AWS Glue.

Provide self-service access to data

Label your data with business metadata

With Lake Formation, you can designate data owners, such as data stewards and business units, by adding a field in table properties as custom attributes. Your owners can augment the technical metadata with business metadata that further defines appropriate uses for the data. You can specify appropriate use cases and label the sensitivity of your data for enforcement by using Lake Formation security and access controls.

Enable self-service access

Lake Formation facilitates requesting and vending access to datasets to give your users self-service access to the data lake for a variety of analytics use cases. You can specify, grant, and revoke permissions on tables defined in the central data catalog. The same data catalog is available for multiple accounts, groups, and services.

Discover relevant data for analysis

With Lake Formation, your users enjoy online, text-based search and filtering of datasets recorded in the central data catalog. They can search for relevant data by name, contents, sensitivity, or any other custom labels you have defined.

Combine analytics approaches for more insights
With Lake Formation, you can give your analytics users the ability to directly query datasets with Athena for SQL, Redshift for data warehousing, AWS Glue for data integration and preparation and EMR for Apache Spark–based big data processing and ML (Zeppelin notebooks). Once you point these services to Lake Formation, the datasets available are shown in the catalog and access controls are enforced consistently, allowing your users to readily combine analytics approaches on the same data.
Learn more about AWS Lake Formation pricing

Learn about AWS Lake Formation pricing by visiting the product pricing page.

Learn more 
Sign up for an account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console

Start building with AWS Lake Formation in the AWS Management Console.

Sign in