AWS Lake Formation Documentation

Build data lakes

Import data from databases already in AWS

Once you specify where your existing databases are and provide your access credentials, Lake Formation is designed to read the data and its metadata (schema) to understand the contents of the data source, and then import the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. Both bulk and incremental data loading is supported. 

Import data from other external sources

Lake Formation is designed to move data from on-premises databases by connecting with Java Database Connectivity (JDBC.) Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with AWS Glue. 

Import data from other AWS services

Formation is designed to pull in semi-structured and unstructured data from other S3 data sources. You can identify existing Amazon S3 buckets containing data to copy into your data lake. Once you specify the S3 path to register your data sources and authorize access, Lake Formation reads the data and its schema. Lake Formation is designed to collect and organize data sets, like logs from AWS CloudTrail, AWS CloudFront, Detailed Billing Reports, and AWS Elastic Load Balancing. You can also load your data into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.

Catalog and label your data

Lake Formation is designed to crawl and reads your data sources to extract technical metadata (like schema definitions) and create a searchable catalog to describe this information for users, so they can discover available data sets. You can also add your own custom labels, at the table- and column-level, to your data to define attributes, like “sensitive information” and “European sales data.” Lake Formation is designed to provide a text-based search over this metadata, so your users can find the data they need to analyze. 

Transform data

Lake Formation is designed to perform transformations on your data, such as rewriting various date formats for consistency, so that the data is stored in an analytics-friendly fashion. Lake Formation creates transformation templates and schedules jobs to prepare your data for analysis. Your data is transformed with AWS Glue and written in columnar formats, such as Parquet and ORC. Typically, less data needs to be read for analysis when data is organized into columns versus scanning entire rows. You can create custom transformation jobs with AWS Glue and Apache Spark to suit your specific requirements.

Clean and deduplicate data

Lake Formation can help clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records. For example, use Lake Formation's FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about machine learning to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system is designed to learn your criteria for calling a pair of records a “match” and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.

Optimize partitions

Lake Formation is also designed to optimize the partitioning of data in S3 to improve performance. With Lake Formation, your data is organized by size, time period, and/or relevant keys. This enables both fast scans and parallel, distributed reads for the most commonly used queries.

Security management

Enforce encryption

Lake Formation leverages the encryption capabilities of S3 for data in your data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). You can configure S3 to encrypt data in transit when replicating across regions, and use separate accounts for source and destination regions to protect against malicious insider deletions. These encryption capabilities are designed to provide a secure foundation for all data in your data lake.

Define and manage access controls

Lake Formation provides central access controls for data in your data lake. You can define security policy-based rules for your users and applications by role in Lake Formation, and integration with AWS IAM authenticates those users and roles. Once the rules are defined, Lake Formation is designed to enforce your access controls at table- and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. 

Implement audit logging

Lake Formation is designed to provide comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and machine learning services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and Console.

Provide self-service access to data

Label your data with business metadata

Lake Formation is designed to provide you with the ability to designate data owners, such as data stewards and business units, by adding a field in table properties as custom attributes. Your owners can augment the technical metadata with business metadata that further defines appropriate uses for the data. You can specify appropriate use cases and label the sensitivity of your data for enforcement by using Lake Formation security and access controls.

Enable self-service access

Lake Formation is designed to facilitate requesting and vending access to datasets to give your users self-service access to the data lake for a variety of analytics use cases. You can specify, grant, and revoke permissions on tables defined in the central data catalog. The same data catalog is available for multiple accounts, groups, and services.

Discover relevant data for analysis

With Lake Formation, your users can use online, text-based search and filtering of data sets recorded in the central data catalog. They can search for relevant data by name, contents, sensitivity, or other any other custom labels you have defined.

Combine analytics approaches for more insights

Lake Formation is designed so that you can give your analytics users the ability to directly query datasets with Athena for SQL, and Redshift for data warehousing. Once you point these services to Lake Formation, the data sets available are shown in the catalog and access controls are enforced consistently, allowing your users to combine analytics approaches on the same data. 

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.