What is a Data Lakehouse?
A data lakehouse is a data management system that offers cost-efficient, flexible storage at scale, while also providing analytics capabilities like structuring, governance, and reporting. It allows you to store raw data in a range of formats from thousands or even hundreds of thousands of sources more cost-effectively in a central location. The data is further usable by analytics tools for training AI models and generating reports and dashboards. A data lakehouse provides many capabilities that allow you to process the raw data within the lakehouse for further analytics.
What is the difference between a data lake, a data warehouse, and a data lakehouse?
A data lakehouse architecture emerged by combining the strengths of two traditional centralized data stores: the data warehouse and the data lake.
Data warehouse
A data warehouse is a data storage system that stores structured data based on standard data schemas. Schemas are predefined blueprints that determine the data format, relationship, and structure of information in a relational database.
Organizations use data warehouse systems for quick access to data processing, business intelligence analytics, and enterprise reporting. Data warehousing provides access to advanced analytics tools, robust data governance, and ease of use for non-technical users. For example, you can retrieve marketing performance reports using a dashboard in the data warehouse.
However, data warehousing introduces additional steps in the data lifecycle. To gain analytics-ready insights, data undergoes several extract, transform, load (ETL) pipelines before being stored in a data warehouse. Moreover, a data warehouse cannot handle unstructured and semi-structured data, which artificial intelligence and machine learning workloads need. In a data warehouse setup, storage and compute power are tightly coupled, which increases the costs of scaling the infrastructure.
Data lake
A data lake is a storage system that retains data in its original format. Data scientists use a data lake to store structured, unstructured, and semi-structured data. Storing data in a data lake is fast because information doesn’t go through an ETL pipeline. Instead, raw data is stored as they are. Therefore, a data lake can store massive volumes of information at high pace, including real-time data streams.
Because of the volume of data, cloud data lakes are ideal for data exploration, machine learning, and other data science applications. A data lake is also more affordable to scale because of its low-cost storage hosting.
Unlike a data warehouse, accessing data stored in a data lake requires technical expertise, which limits data access to a smaller group of users. This means that only users proficient in data science can extract, manipulate, and analyze the raw data for business insights. Additionally, an unmanaged data lake can lead to data swamps. Data swamps are a state of disorganized data that makes it harder to extract meaningful insights.
Data lakehouse
A data lakehouse is a unified data architecture that combines the advantages of a data warehouse and a data lake. It provides high-performance, affordable, and governance-friendly storage space for various data types.
Unlike a data warehouse, a data lakehouse can store semi-structured and unstructured data for machine learning purposes. Additionally, the data lakehouse architecture consists of SQL analytics tools that business managers use for reporting and extracting actionable insights.
What are the key features of a data lakehouse?
Data lakehouses provide data management features for organizations to build scalable, complex, and low-latency data processing hubs. We share some key features of a data lakehouse below.
Supports diverse data types and workloads
Data lakehouses can store diverse data types, including text, images, videos, and audio files, without additional transformation steps or a rigid schema. This enables fast data ingestion, ensuring data freshness for connected applications.
To support data diversity, a data lakehouse stores the raw data in an object-based storage. Object-based storage is a type of data storage architecture optimized for handling high volumes of unstructured data.
Transaction support
A data lakehouse provides data management features for storing ACID-compliant transactions, similar to those found in conventional databases. ACID stands for atomicity, consistency, isolation, and durability.
- Atomicity treats all data transactions as a single unit, which means they are either successfully implemented or not.
- Consistency refers to the predictable behavior of the database that occurs when updating a specific data table. Every update follows predefined rules, which ensure data consistency.
- Isolation allows multiple transactions to happen without interfering with each other. Even if several users are updating the database simultaneously, each operation runs independently, which means one transaction ends before the next one begins.
- Durability is the database’s capability to retain and save changes even if the system fails.
Together, ACID ensures data integrity, allowing software teams to build applications that rely on reliable transactional data storage.
Streaming ingestion
Data streams are a continuous flow of information originating from data sources such as Internet of Things (IoT) devices, financial transactions, and application services.
Some applications require data streaming to reflect and visualize data changes in near-real-time. The data lakehouse architecture can ingest data streams and make them available for user-facing applications. Additionally, data scientists can build analytics tools on top of data streams and visualize them with charts, tables, and graphs.
Zero ETL integration
Zero ETL is a data process that bypasses complex data transformation pipelines when moving data. A data lakehouse infrastructure enables zero ETL integration.
Conventionally, organizations build their workloads on a data warehouse and a data lake. These data setups require additional ETL pipelines to query and transform data. With zero ETL integration, data scientists can query different data silos without building additional data pipelines.
When a data lakehouse ingests data, it automatically transforms it into formats that align with business analytics requirements. For example, Amazon Redshift supports zero ETL integration with Amazon Aurora. Redshift is a data warehouse, while Aurora is a relational database management system. When integrated, the data that Aurora ingests replicates automatically on Redshift within seconds. This way, organizations can increase time-to-insights while maintaining a simple, cost-effective data infrastructure.
Unified analytics
A data lakehouse provides a unified data platform to access all stored data. It helps data architects overcome data duplication, inconsistency, and fragmentation across multiple systems.
Another key benefit of centralized analytics is avoiding unnecessary data movements between cloud storage. Instead of querying siloed data, data teams store, analyze, and share data from a single interface that connects to the data lakehouse. For example, you can retrieve unstructured data for a machine learning workload and generate marketing performance reports from a single copy of data.
Query editor
Data analysts, machine learning engineers, and data users can easily access data in a data lakehouse by using a SQL query editor. They can author SQL commands for data analysis, visualization, browse historical data, create database schemas, and more. A query editor also improves collaboration by enabling data engineers to easily share the queries they create.
ML/AI support
Data lakehouses are engineered for building, testing, and scaling artificial intelligence and machine learning (AI/ML) workloads. In addition to providing direct access to unstructured data, many data lakehouse providers offer machine learning libraries, tools, and analytics that simplify AI development.
For example, Amazon SageMaker Lakehouse integrates seamlessly with Amazon SageMaker Unified Studio, providing access to tools and analytics to accelerate AI/ML workflows.
How does a data lakehouse work?
A data lakehouse combines the advanced analytics capabilities of data warehouses with the flexibility of data lakes, providing a scalable, affordable, and powerful data platform. Instead of maintaining separate data lakes and data warehouse infrastructures, organizations choose a data lakehouse to obtain business insights more rapidly.
The data lakehouse ingests data from various resources, organizes it internally, and serves the data to various data users in different formats. Moreover, a data lakehouse’s compute is separate from storage. With separate storage and compute, you can scale these functions independently to maximize cost savings.
Below, we share the data layers that form a data lakehouse.
Ingestion layer
The ingestion layer connects the data lakehouse to various types of data sources, including application logs, databases, and social media feeds. At this layer, data is preserved in the original format.
Storage layer
The storage layer receives incoming raw data and stores it in a low-cost, scalable storage. In a data lakehouse setup, this layer often links to a cloud object storage. An object storage supports diverse types of data, including structured, semi-structured, and unstructured data.
Depending on the use cases, some data undergoes transformation after storage in the object storage. For example, if you want to train a machine learning model using the ingested data, the data lakehouse will transform and store the data in Parquet format. Parquet is an open file format designed to store and process structured data efficiently by segregating it into columns.
Staging layer
The staging layer, or metadata layer, provides schema support to govern, organize, and optimize data stored in the data lakehouse. This layer allows you to define policies to ensure data quality and create auditable trails for compliance purposes. Additionally, data teams can create reliable data workflows using ACID transactions, file indexing, data versioning, and caching, similar to those found in a traditional data warehouse.
API layer
The application programming interface (API) layer allows software developers and applications to query data stored in the data lakehouse. It provides granular access to data that enables more advanced analytics to be built programmatically from the data. For example, software teams can make API calls to retrieve data streams in real time to power the dashboard of an investment application.
Semantic layer
The semantic layer is the topmost layer of the data lakehouse. Also known as the data consumption layer, it consists of data analytics tools and apps that provide access to stored data and schema. Business users can generate reports, create charts, query for insights, and conduct other data analysis with the tools they find at this layer.
How can AWS support your data lakehouse requirements?
Amazon SageMaker Lakehouse is a data lakehouse that organizations use to process exabytes of data for business insights and power AI workloads. Amazon SageMaker Lakehouse is deeply integrated with AWS data storage, analytics, and machine learning services to help you:
- Access data in place for near-real-time analytics
- Build artificial intelligence and machine learning models on a single data hub
- Securely access, combine, and share data with minimal movement or copying
With an architecture that separates compute and storage for efficient scaling, Amazon SageMaker Lakehouse delivers better price performance than other cloud data lakehouses.
Amazon SageMaker Lakehouse integrates with AWS data warehouses and data lakes:
- Amazon Redshift is a data warehouse solution that delivers unmatched price-performance at scale with SQL for your data lakehouse
- Amazon S3 is a data lake object storage built to retrieve any amount of data from anywhere
Get started with data lakehouse on AWS by creating a free account today.