What is a Data Lakehouse?

A data lakehouse is a unified data architecture that combines data warehouses and data lakes, providing analytics capabilities such as structuring, governance, and reporting. The data lakehouse can store raw data in a range of formats from thousands to hundreds of thousands of sources in a central location. The data is usable by integrated analytics tools, for training artificial intelligence (AI) models, and for generating reports and dashboards. A shared metadata layer combines different data sources for analytics. The design of a data lakehouse provides interoperability across storage systems for combined analytics activities.

What is the difference between a data lake, a data warehouse, and a data lakehouse?

A data lakehouse architecture combines the strengths of two traditional centralized data stores: the data warehouse and the data lake.

Data warehouse

A data warehouse is a data storage system that stores structured data based on standard data schemas. Schemas are predefined blueprints that determine the data format, relationship, and structure of information in a relational database.

Organizations use data warehouse systems as the underlying data service for data processing, business intelligence analytics, and enterprise reporting. Data warehousing can combine with controls-based data governance and advanced analytics tools for non-technical users. For example, with added tooling such as Amazon QuickSight, you can retrieve marketing performance reports from the data warehouse using a dashboard.

However, data warehousing introduces more steps in the data lifecycle. To gain analytics-ready data, data undergoes several transformational pipelines, such as extract, transform, load (ETL), before being stored in a data warehouse. Moreover, a data warehouse might not be able to properly handle raw, unstructured data, which you might need in your analytics, artificial intelligence, and machine learning workloads.

Data lake

A data lake is a storage system that retains data in its original format. You can use a data lake to store any type of data, including structured, unstructured, and semi-structured data. Storing data in a data lake is fast because information doesn’t go through a data transformation pipeline. Instead, raw data is stored in its original format. A data lake can store massive volumes of information at high pace, including real-time data streams.

Because of the volume of data, cloud data lakes are ideal for data exploration, machine learning, and other data science applications. A data lake is also more affordable to scale as a data store because of its low-cost storage hosting.

Unlike a data warehouse, running analytics on the data stored in a data lake requires further technical expertise, which limits this activity to a smaller group of users. Only users who are proficient in data science can extract, manipulate, and analyze the raw data for business insights. Additionally, an unmanaged data lake can lead to data swamps. Data swamps are a state of disorganized data that makes it harder to extract meaningful insights.

Data lakehouse

A data lakehouse is a unified data architecture that combines the advantages of a data warehouse and a data lake. It provides high-performance, affordable, and governance-friendly storage space for various data types.

Unlike a data warehouse, a data lakehouse can store all types of semi-structured and unstructured data for machine learning purposes. Additionally, the data lakehouse architecture consists of SQL and other analytics tools that you can use for reporting and extracting actionable insights.

What are the key features of a data lakehouse?

Data lakehouses provide data management features for organizations to build scalable, complex, and low-latency data processing hubs.

Here are some key features of a data lakehouse.

Supports diverse data types and workloads

Data lakehouses can store diverse data types, including text, images, videos, and audio files, without added transformation steps or a rigid schema. Not having that transformation step helps with fast data ingestion and data freshness for connected applications.

To support data diversity, a data lakehouse stores the raw data in object-based storage. Object-based storage is a type of data storage architecture optimized for handling high volumes of unstructured data.

Transaction support

A data lakehouse provides data management features for storing ACID-compliant transactions, similar to those found in conventional databases. ACID stands for atomicity, consistency, isolation, and durability.

Atomicity treats all data transactions as a single unit, which means they are either successfully implemented or not.
Consistency refers to the predictable behavior of the database that occurs when updating a specific data table. Every update follows predefined rules, which help maintain data consistency.
Isolation allows multiple transactions to happen without interfering with each other. If several users are updating the database simultaneously, each operation runs independently, which means one transaction ends before the next one begins.
Durability is the database’s capability to retain and save changes even if the system fails.

Together, ACID provides data integrity, allowing software teams to build applications that rely on reliable transactional data storage.

Streaming ingestion

Data streams are a continuous flow of information originating from data sources such as Internet of Things (IoT) devices, financial transactions, and application services.

Some applications need data streaming to reflect and visualize data changes in near-real time. The data lakehouse architecture can ingest data streams and make them available for user-facing applications. Additionally, data scientists can build analytics tools on top of data streams and visualize them with charts, tables, and graphs.

Zero-ETL integration

Zero-ETL is a data process that bypasses complex data transformation pipelines when moving data. Instead of creating and running data extraction, transformation, and loading processes, zero-ETL automates these processes, needing just a data source and target data store.

A data lakehouse infrastructure supports various zero-ETL workflows, depending on source and target architectures. When using zero-ETL, a data lakehouse ingests data and automatically transforms it into more useful formats. For example, Amazon Redshift supports zero-ETL integration with Amazon Aurora. Redshift is a data warehouse, whereas Aurora is a relational database management system. When integrated, the data ingested by Aurora replicates automatically on Redshift within seconds. This way, organizations can decrease both time-to-insights and operational overheads in data transformation.

Unified analytics

A data lakehouse provides a unified data solution to access all stored data. It helps your organization overcome data quality issues such as duplication, inconsistency, and fragmentation across multiple systems.

A key benefit of centralized analytics is avoiding unnecessary data movements between cloud storage. Instead of querying siloed data, data teams store, analyze, and share data from a single interface that connects to the data lakehouse. For example, you can retrieve unstructured data for a machine learning workload and generate marketing performance reports from a single copy of data.

Query editor

Data analysts, machine learning engineers, and data users can easily access data in a data lakehouse by using a SQL query editor. They can author SQL commands for data analysis, visualization, browse historical data, create database schemas, and more. A query editor also improves collaboration by enabling data engineers to share the queries they create easily.

AI/ML support

Data lakehouse architectures allow you to build, test, and scale artificial intelligence and machine learning (AI/ML) workloads. In addition to providing access to all types of data, many data lakehouse providers offer machine learning libraries, tools, and analytics that simplify AI development.

For example, Amazon SageMaker’s lakehouse architecture integrates seamlessly with Amazon SageMaker Unified Studio, providing access to tools and analytics to accelerate AI/ML workflows.

How does a data lakehouse work?

A data lakehouse combines the features of both data warehouses and data lakes, providing a scalable and integrated data solution. Instead of working with separate data lakes and data warehouse infrastructures, organizations choose a data lakehouse to obtain business insights across data sources more rapidly.

The data lakehouse ingests data from various sources, organizes it internally, and serves the data to various users in different formats. A data lakehouse’s compute is separate from storage. With separate storage and compute, you can scale these functions independently to maximize cost savings.

In the following, we share the data layers that form a data lakehouse.

Ingestion layer

The ingestion layer connects the data lakehouse to various types of data sources, including application logs, databases, and social media feeds. At this layer, data is preserved in the original format.

Storage layer

The storage layer receives incoming raw data and stores it in cloud object storage. This cloud object storage supports diverse types of data, including structured, semi-structured, and unstructured data.

Depending on the use cases, some data transformations are performed after storage in the object storage. For example, if you want to train a machine learning model using the ingested data, the data lakehouse can transform and store the data in Parquet format. Parquet is an open file format designed to store and process structured data efficiently by segregating it into columns.

Metadata layer

The metadata layer, or staging layer, provides schema support to govern, organize, and optimize data stored in the data lakehouse. This layer allows you to define policies to improve data quality and create auditable trails for compliance purposes. Additionally, data teams can create reliable data workflows using ACID transactions, file indexing, data versioning, and caching, similar to those found in a traditional data warehouse.

API layer

The application programming interface (API) layer allows software developers and applications to query data stored in the data lakehouse. It provides granular access to data that enables more advanced analytics to be built programmatically from the data. For example, software teams can make API calls to retrieve data streams in real time to power the dashboard of an investment application.

Semantic layer

The semantic layer is the topmost layer of the data lakehouse. Also known as the data consumption layer, it consists of data analytics tools and apps that provide access to stored data and schema. Business users can generate reports, create charts, query for insights, and conduct other data analysis with the tools they find at this layer.

How can AWS support your data lakehouse requirements?

The Amazon SageMaker’s lakehouse architecture unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes, including S3 Tables, and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data.

Amazon SageMaker is deeply integrated with AWS data storage, analytics, and machine learning services to help you:

Access data in place for near-real-time analytics
Build artificial intelligence and machine learning models on a single data hub
Securely access, combine, and share data with minimal movement or copying

Get started with building your data lakehouse architecture on AWS by creating a free account today.

What is a Data Lakehouse?