AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake lets you break down data silos and combine different types of analytics to gain insights and guide better business decisions.
Setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, reorganizing data into a columnar format, deduplicating redundant data, and matching linked records. Once data has been loaded into the data lake, you need to grant fine-grained access to datasets, and audit access over time across a wide range of analytics and machine learning (ML) tools and services.
Creating a data lake with Lake Formation is as simple as defining data sources and what access and security policies you want to apply. Lake Formation then helps you collect and catalog data from databases and object storage, move the data into your new Amazon Simple Storage Service (S3) data lake, clean and classify your data using ML algorithms, and secure access to your sensitive data using granular controls at the column, row, and cell-levels. Your users can access a centralized data catalog that describes available datasets and their appropriate usage. They then use these datasets with their choice of analytics and ML services, such as Amazon Redshift, Amazon Athena, Amazon EMR for Apache Spark, and Amazon QuickSight. Lake Formation builds on the capabilities available in AWS Glue.
Build data lakes quickly
With Lake Formation, you can move, store, catalog, and clean your data faster. You simply point Lake Formation at your data sources, and it crawls those sources and moves the data into your new Amazon S3 data lake. Lake Formation organizes data in S3 around frequently used query terms and into right-sized chunks to increase efficiency. It also changes data into formats such as Apache Parquet and ORC for faster analytics. In addition, Lake Formation has built-in ML to deduplicate and find matching records (two entries that refer to the same thing) to increase data quality.
Simplify security management
Lake Formation provides a single place to define and enforce access controls that operate at the table, column, row, and cell-level for all the users and services that access your data. Your policies are consistently implemented, eliminating the need to manually configure them across security services such as AWS Identity and Access Management (IAM) and AWS Key Management Service (KMS), storage services such as S3, and analytics and ML services such as Redshift, Athena, AWS Glue, and EMR for Apache Spark. This reduces the effort in configuring policies across services and provides consistent enforcement and compliance.
Provide self-service access to data
With Lake Formation, you build a data catalog that describes the different datasets available, along with which groups of users have access to each. This makes your users more productive by helping them find the right dataset to analyze. By providing a catalog of your data with consistent security enforcement, Lake Formation makes it easier for your analysts and data scientists to use their preferred analytics service. They can use EMR for Apache Spark, Redshift, Athena, AWS Glue, and Amazon QuickSight on diverse datasets now housed in a single data lake. Users can also combine these services without having to move data between silos.
How it works
Lake Formation helps to build, secure, and manage your data lake. First, identify existing data stores in S3 or relational and NoSQL databases, and move the data into your data lake. Then crawl, catalog, and prepare the data for analytics. Next, provide your users with secure self-service access to the data through their choice of analytics services. Other AWS services and third-party applications can also access data through the services shown. Lake Formation manages all of the tasks shown in the orange box and is integrated with the data stores and services shown in the blue boxes.
Build data lakes quickly
Use blueprints in Lake Formation to move, store, catalog, clean, and organize your data faster. Convert data into formats such as Parquet and ORC for faster analytics, and use built-in ML to de-duplicate and find matching records. Simplify how you store and maintain your data using Governed Tables, a new type of Amazon S3 table. Governed Tables use ACID (atomic, consistent, isolated, and durable) transactions that automatically manage conflicts and ensure consistent data views for all users. Governed Tables also monitor and automatically optimizes your data to improve engine performance when querying the Governed Tables.
Centrally define and manage access controls
Enforce data classification and fine-grained access
Lake Formation enforces policies without having to configure data access controls in each consuming service. Lake Formation automatically filters data and only reveals data permitted by the defined policy to authorized users, without having to duplicate data.
Enable continuous data management, time travel, and storage optimization
Enhance data lake reliability and trustworthiness for updating batch and streaming data. Query historical data versions and audit changed data. Auto-compact small files and enable push-down filters to reduce data scans and improve query performance.
Enable federated data lakes with cross-account sharing
Deliver decentralized, domain-oriented data products across your organization using well-governed data sharing with minimal to no data movement.
Refer to "What is a data lake?" for more information.
Nu Skin Enterprises is a global direct selling company that distributes more than 200 premium-quality anti-aging products in the personal care and nutritional supplements categories.
"We were challenged with expanding capability and scaling throughput of our existing analytics systems. Our data was distributed amongst various disconnected databases and SaaS solutions, making it difficult to analyze data at scale while restricting access to sensitive data. To overcome this challenge, we built a data lake solution on AWS. This allowed us to aggregate data from various data silos into Amazon S3, where we cataloged and secured all data using AWS Lake Formation. Without AWS Lake Formation, it would have been impossible to achieve the goal of a scalable, easy-to-use security layer for all data on Amazon S3. It was easy to set up and apply fine grained access controls based on user personas."
Joe Sueper, VP Enterprise Architecture, Global Technology Services – Nu Skin Enterprises
Panasonic Avionics Corporation is the world's leading supplier of in-flight entertainment and communication systems.
“We wanted to create a data platform with the ability to manage the security settings for all the different applications in our environment. With AWS Lake Formation, we can now define policies once and enforce them in the same way, everywhere, for multiple services we use, including AWS Glue and Amazon Athena. The enhanced level of control gives us secure access to data and meta-data for columns and tables, not just for bulk objects, which is an important part of our data security and governance standard.”
Anand Desikan, Director of Cloud and Data Services – Panasonic Avionics
Accenture is a leading global professional services company, providing a broad range of services and solutions in strategy, consulting, digital, technology, and operations.
“I focus on helping clients in their ‘Data on Cloud’ journey. Specific to that, we have seen that organizations are dealing with a lack of trusted data when they need to perform analytics on data coming from multiple sources. Data cleansing is a critical step in data analytics and can greatly impact the business outcome and decision making. The new features in AWS Lake Formation have been hugely beneficial to address the challenge of data veracity and securing access to the data lake. We found it tremendously useful to make use of the advanced machine learning techniques for data preparation to find matching records, clean, and deduplicate data from different data sources. This will help reduce the time, effort, and cost, while improving the quality and accuracy of the data in a customer’s data lakes.”
Namrata Maheshwary, Senior Architect for the Data Business Group – Accenture
Zalando is Europe’s leading online platform for fashion and lifestyle.
“As Europe’s most fashionable tech company, we work hard to find digital solutions for every aspect of the fashion journey. AWS Lake Formation gave us a scalable central point of control for data access through Amazon Redshift that not only simplified the process, but improved it through granular control over how our data is being used. Now we can discover, access, and analyze data in our data lake with our preferred tools, and leverage it for business intelligence and data science. This streamlined workflow helps our executives make the right decisions on time, and fosters innovation through machine learning.”
Alberto Miorin, Engineering Lead – Zalando SE
Life360 is the world's leading peace of mind service for families. The Life360 app brings families closer with smart features designed to protect and connect the people who matter most.
“We wanted to use AWS Lake Formation to build our data lake for supporting location-based time-series data, and make it much easier to load data. The prefabricated blueprints helped get data into the data lake without our data engineering team having to write code from scratch, so they could focus on operationalizing ingest, not reinventing the wheel. With AWS Lake Formation, we were able to quickly unlock data available in Amazon S3 and make it available to analyze across a broad spectrum of AWS data services. The data remains in place in Amazon S3, we can analyze it in many different ways, and we maintain full control over it.”
Richard Chennault, Head of Cloud and Data Services – Life360, Inc.
Change Healthcare is a leading independent healthcare technology company that provides data and analytics-driven solutions that reach approximately 2,100 government and commercial payer connections, 5,500 hospitals, 900,000 physicians, and 33,000 pharmacies.
“We handle data from millions of transactions daily while maintaining compliance with healthcare industry regulations, including HIPAA. We are very excited about the launch of AWS Lake Formation, which provides a central point of control to easily load, clean, secure, and catalog data from thousands of clients to our AWS-based data lake, dramatically reducing our operational load. The data access controls in Lake Formation will make it easy for us to define our policies once and have them be enforced across all the analytics and machine learning services we use, with audit logs to show compliance.”
Aaron Symanski, CTO – Change Healthcare
Fender Digital is a part of Fender, the iconic guitar brand, that makes apps, websites, platforms, and tools to complement the guitars, amps, and audio gear that Fender makes.
“We are generating tons of user and usage data from our digital applications and devices. We are planning to build a data lake on AWS to operate alongside our Amazon Redshift–based data warehouse. I can’t wait for my team to get our hands on AWS Lake Formation. Lake Formation will make it easy for us to load, transform, and catalog our data and make it securely available within our organization, across a wide portfolio of AWS services. With an enterprise-ready option like Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.”
Joshua Couch, VP Engineering – Fender Digital
Supercharged by migration and management software platform Cloudamize, Cloudreach brings simplicity and absolute confidence to data-driven decision making.
“AWS Lake Formation is democratizing the data lake and creating a point of acceleration for enterprise data strategy. AWS Lake Formation centralizes security and governance of services, streamlining management and reducing operational overhead. By accelerating the process of integrating data across the enterprise, other data initiatives, such as machine learning, start to drive greater business value.”
Kevin Davis, CTO AWS Practice – Cloudreach
Amgen is the world's largest independent biotechnology company.
“At Amgen we've been heavy users of Amazon Redshift and Amazon EMR clusters for over three years. Setting up security and access controls for each AWS account, service, user, and dataset at the level of detail that was required could be cumbersome. AWS Lake Formation streamlines the process with a central point of control while also enabling us to manage who is using our data, and how, with more detail. AWS Lake Formation allows us to manage permissions on Amazon S3 objects like we would manage permissions on data in a database. Our users will be able to find, access, and analyze the data they need with the tools they prefer. This new workflow can make everyone more productive when using Amgen’s data.”
Kerby Johnson, Enterprise Data Lake Product Owner – Amgen
Alcon is a leader in innovation and development of life-changing vision and eye care products.
“Like a lot of companies, we started our data lake initiative to get away from having inaccessible silos of data. With AWS Lake Formation, we can quickly add access to existing Amazon S3 buckets and define what's in them and how it can be used. The data remains in place in S3, but we have full control over it for other uses.”
Srinivas Ravilisetty, IT Analytics Lead – Alcon
Quantiphi is an artificial intelligence and big data software and services company driven by the desire to solve complex business problems. Quantiphi specializes in building data lakes and AI solutions for customers to deliver quantifiable value.
“AWS Lake Formation allows us to deliver a secure data lake with access to relevant data in days. We now have the ability to deliver the best of both worlds for our customers—full security, plus simplified access to relevant data for their users to make decisions easily. Our customers can focus on making smarter, analysis-driven business decisions by tapping into a powerful, centralized data source.”
Arnav Gupta, AWS Practice Lead – Quantiphi
Curvo is a software-as-a-service company focused exclusively on the healthcare supply chain. With deep domain expertise and agile development practices, they build the analytics, the workflow, and the automation to make spend management in healthcare faster and easier.
“Data normalization is a critical step in providing better patient outcomes by bringing transparency into benchmark pricing data for clinical and medical products. Using ML Transformations in AWS Lake Formation, we now process datasets in four hours, down from one week, and our degree of accuracy improved to near 100%. This speed and accuracy allows our healthcare customers to quickly respond to market changes, ultimately delivering more affordable care without sacrificing patient outcomes. We deliver to them in one day what takes our competitors 4–6 weeks.”
Nic Sagez, CTO – Curvo
Learn more about the features of AWS Lake Formation by visiting the features page.
Instantly get access to the AWS Free Tier.
Start building with AWS Lake Formation in the AWS Management Console.