AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.
However, setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.
Creating a data lake with Lake Formation is as simple as defining where your data resides and what data access and security policies you want to apply. Lake Formation then collects and catalogs data from databases and object storage, moves the data into your new Amazon S3 data lake, cleans and classifies data using machine learning algorithms, and secures access to your sensitive data. Your users can then access a centralized catalog of data which describes available data sets and their appropriate usage. Your users then leverage these data sets with their choice of analytics and machine learning services, like Amazon EMR for Apache Spark, Amazon Redshift, Amazon Athena, Amazon Sagemaker, and Amazon QuickSight.
Build data lakes quickly
With Lake Formation, you can move, store, catalog, and clean your data faster. You simply point Lake Formation at your data sources, and Lake Formation crawls those sources and moves the data into your new Amazon S3 data lake. Lake Formation organizes data in S3 around frequently used query terms and into right-sized chunks to increase efficiency. Lake Formation also changes data into formats like Apache Parquet and ORC for faster analytics. In addition, Lake Formation has built-in machine learning to deduplicate and find matching records (two entries that refer to the same thing) to increase data quality.
Simplify security management
You can use Lake Formation to centrally define security, governance, and auditing policies in one place, versus doing these tasks per service, and then enforce those policies for your users across their analytics applications. Your policies are consistently implemented, eliminating the need to manually configure them across security services (AWS Identity and Access Management and AWS Key Management Service), storage services (S3), and analytics and machine learning services (Redshift, Athena, and EMR for Apache Spark.) This reduces the effort in configuring policies across services and provides consistent enforcement and compliance.
Make self-service access to data both easy and secure
With Lake Formation you build a data catalog that describes the different data sets that are available along with which groups of users have access to each. This makes your users more productive by helping them find the right data set to analyze. By providing a catalog of your data with consistent security enforcement, Lake Formation makes it easier for your analysts and data scientists to use their preferred analytics service.
They can use EMR for Apache Spark, Redshift, Athena, Sagemaker, or QuickSight on diverse data sets now housed in a single data lake. Users can also combine these services without having to move data between silos.
How it works
Lake Formation helps to build, secure, and manage your data lake. First, identify existing data stores in S3 or relational and NoSQL databases, and move the data into your data lake. Then crawl, catalog, and prepare the data for analytics. Then provide your users secure self-service access to the data through their choice of analytics services. Other AWS services and third-party applications can also access data through the services shown. Lake Formation manages all of the tasks in the orange box and is integrated with the data stores and services shown in the blue boxes.
Change Healthcare is a leading independent healthcare technology company that provides data and analytics-driven solutions that reach approximately 2,100 government and commercial payer connections, 5,500 hospitals, 900,000 physicians, and 33,000 pharmacies.
“We handle data from millions of transactions daily while maintaining compliance with healthcare industry regulations, including HIPAA,” said Aaron Symanski, CTO of Change Healthcare. “We are very excited about the launch of AWS Lake Formation, which provides a central point of control to easily load, clean, secure, and catalog data from thousands of clients to our AWS-based data lake, dramatically reducing our operational load. The data access controls in Lake Formation will make it easy for us define our policies once and have them be enforced across all the analytics and machine learning services we use, with audit logs to show compliance. Additionally, Lake Formation will be HIPAA compliant from day one, meeting our security requirements and offering a compelling way for us to build and manage our data lake.”
Fender Digital is a part of Fender, the iconic guitar brand, that makes apps, websites, platforms and tools to complement the guitars, amps and audio gear that Fender makes.
“We are generating tons of user and usage data from our digital applications and devices. We are planning to build a data lake on AWS to operate alongside our Amazon Redshift based data warehouse” said Joshua Couch, VP Engineering at Fender Digital. “I can’t wait for my team to get our hands on AWS Lake Formation. Lake Formation will make it easy for us to load, transform, and catalog our data and make it securely available within our organization, across a wide portfolio of AWS services. With an enterprise-ready option like Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.”