Build data lakes quickly
Once you specify where your existing databases are and provide your access credentials, AWS Lake Formation reads the data and its metadata (schema) to understand the contents of the data source. It then imports the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, PostgreSQL, SQL Server, MariaDB, and Oracle databases running in Amazon Relational Database Service (RDS) or hosted in Amazon Elastic Compute Cloud (EC2). Both bulk and incremental data loading are supported.
You can use Lake Formation to move data from on-premises databases by connecting with Java Database Connectivity (JDBC). Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with AWS Glue.
Using Lake Formation, you can also pull in semi-structured and unstructured data from other Amazon Simple Storage Service (S3) data sources. You can identify existing Amazon S3 buckets containing data to copy into your data lake. Once you specify the S3 path to register your data sources and authorize access, Lake Formation reads the data and its schema. Lake Formation can collect and organize datasets, such as logs from AWS CloudTrail, AWS CloudFront, Detailed Billing Reports, and AWS Elastic Load Balancing (ELB). You can also load your data into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.
Lake Formation crawls and reads your data sources to extract technical metadata (such as schema definitions) and creates a searchable catalog to describe this information for users so they can discover available datasets. You can also add your own custom labels to your data (at the table and column level) to define attributes, such as “sensitive information” and “European sales data.” Lake Formation provides a text-based search over this metadata so your users can quickly find the data they need to analyze.
Lake Formation can perform transformations on your data, such as rewriting various date formats for consistency, to ensure that the data is stored in an analytics-friendly fashion. Lake Formation creates transformation templates and schedules jobs to prepare your data for analysis. Your data is transformed with AWS Glue and written in columnar formats, such as Parquet and ORC, for better performance. Less data needs to be read for analysis when data is organized into columns as opposed to scanning entire rows. You can create custom transformation jobs with AWS Glue and Apache Spark to suit your specific requirements.
Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning (ML) Transform called FindMatches for deduplication and finding matching records. For example, use FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about ML to do this. FindMatches will simply ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a match and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.
Lake Formation also optimizes the partitioning of data in Amazon S3 to improve performance and reduce costs. Raw data that is loaded may be in partitions that are too small (requiring extra reads) or too large (reading more data than needed.) With Lake Formation, your data is organized by size, time period, and/or relevant keys. This enables both fast scans and parallel, distributed reads for the most commonly used queries.
Simplify security management
Lake Formation uses the encryption capabilities of Amazon S3 for data in your data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). S3 encrypts data in transit when replicating across Regions and lets you use separate accounts for source and destination Regions to protect against malicious insider deletions. These encryption capabilities provide a secure foundation for all data in your data lake.
Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and ML services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and console.
Provide self-service access to data
With Lake Formation, you can designate data owners, such as data stewards and business units, by adding a field in table properties as custom attributes. Your owners can augment the technical metadata with business metadata that further defines appropriate uses for the data. You can specify appropriate use cases and label the sensitivity of your data for enforcement by using Lake Formation security and access controls.
Lake Formation facilitates requesting and vending access to datasets to give your users self-service access to the data lake for a variety of analytics use cases. You can specify, grant, and revoke permissions on tables defined in the central data catalog. The same data catalog is available for multiple accounts, groups, and services.
With Lake Formation, your users enjoy online, text-based search and filtering of datasets recorded in the central data catalog. They can search for relevant data by name, contents, sensitivity, or any other custom labels you have defined.
Learn about AWS Lake Formation pricing by visiting the product pricing page.
Instantly get access to the AWS Free Tier.
Start building with AWS Lake Formation in the AWS Management Console.