Making Data-Driven Decisions with IBM watsonx.data, an Open Data Lakehouse on AWS

By Eduardo Monich Fronza, Partner Solutions Architect – AWS
By Kevin Shen, Lead Product Manager – IBM
By Richard Porter, Product Manager – IBM

IBM

We are in the midst of an artificial intelligence (AI) revolution. Organizations are looking for ways to use data to transform their business and harness the potential of generative AI and foundation models to drive productivity, develop innovative solutions, enhance customer experience, and gain a competitive edge.

To address these customer needs, IBM announced IBM watsonx, an AI and data platform designed to scale and accelerate the impact of AI with trusted data. As a part of watsonx, IBM has launched watsonx.data.

With watsonx.data, customers can scale analytics and AI with a fit-for-purpose data store, built on an open lakehouse architecture. It provides querying, governance, and open data formats for easy data access and sharing. Customers can connect to their data within minutes, gain trusted insights quickly, and reduce data warehouse costs.

IBM watsonx.data is available on Amazon Web Services (AWS) as a fully managed software-as-a-service (SaaS) solution, as well as on Red Hat OpenShift Service on AWS (ROSA). In this post, we’ll explore the transformative capabilities of IBM watsonx.data on AWS.

Unprecedented Data Challenges to Scale AI Workloads

Data utilization has evolved significantly, serving a wide range of purposes. According to a report by Precisely and Corinium Global Intelligence, 82% of Chief Data Officers (CDOs) consider data quality a hurdle in their data integration projects. Moreover, according to the IDC Global DataSphere Forecast, a staggering 250% growth in stored data by 2025.

This growing volume of data is generated on various formats, including structured, unstructured, and semi-structured data, with differing levels of quality across diverse data repositories, whether on AWS, on-premises, or other data repositories.

Consequently, customers are looking for efficient ways to implement data governance, optimize their storage and data management costs, overcome limitations, and reduce complexity of data lakes and data warehouses.

To address the challenges posed by this distributed data landscape, the data lakehouse architecture has emerged as a valuable solution. By combining the enterprise-level features and high-performance capabilities of a data warehouse with the openness, flexibility, and scalability of data lakes, it offers customers an effective way to tackle these complexities head-on.

Data lakehouses are inherently open, combining the commodity of cloud object storage, like Amazon Simple Storage Service (Amazon S3), open data and table formats, and high-performance open-source query engines. Additionally, data lakehouses provide governance capabilities that allow customers to selectively share data with authorized individuals or groups, ensuring controlled access and maintaining data security.

First-generation lakehouses still presented constraints to address cost and complexity challenges like single-query engines, design for specific workload types, lack of support for hybrid implementations, and minimal governance capabilities to deploy across your entire ecosystem.

The figure below shows how data ecosystems have evolved over decades to a more holistic approach to the data management lifecycle.

Figure 1 – The emergence of data lakehouse architectures.

This is where IBM watsonx.data can help you. It’s an open, hybrid, governed data store built on a data lakehouse architecture to overcome the limitations of first-generation data lakehouses.

With watsonx.data, you can:

Access data across AWS and hybrid environments: Access all of your data through a single point of entry with a shared metadata layer across AWS and on-premises environments, leveraging open data and open table formats.
Get started in minutes: Connect to storage and analytics environments in minutes and enhance trust in data with built-in governance, security, and automation. Leverage a simple user experience (UX) and “click-and-go” console to ingest, access, and transform data and run workloads (see Figure 2).
Reduce data warehouse cost through optimization: Optimize your data warehouses workloads by taking advantage of low-cost object storage like Amazon S3, and fit-for-purpose query engines that scale automatically. IBM has measured up to 50% cost reduction for compute hours of watsonx.data, relative to other data warehouse vendors (depending on configurations, workloads and vendors).

Figure 2 – Simple UX and console to ingest, access, and transform data and run workloads.

IBM watsonx.data Combines IBM with Open Source

Figure 3 illustrates the key components of IBM watsonx.data and how it combines IBM and open-source technologies to empower customers.

Figure 3 – Overview of key components of the IBM watsonx.data.

Let’s take a closer look at each of these components:

Infrastructure: Watsonx.data is now available to customers. Availability includes AWS cloud and on-premises environments, giving you the flexibility to run your business processes in order to optimize available resources.
Storage: The layer that physically stores the data. The most common data lake/lakehouse storage types are Amazon S3-compatible object storage. In this layer, data is stored as files and could be stored in open data file formats such as Parquet, Avro, or Apache ORC. An open table format like Apache Iceberg helps to bring ACID characteristics to data in an open data lakehouse.
Open formats: Watsonx.data is truly open and interoperable. It leverages open-source technologies and those with open-source project governance and diverse communities of users and contributors, like Apache Iceberg and Presto–hosted by the Linux Foundation. Support for open data formats and Iceberg open table format allows different engines to access and share the same data, at the same time.
Governance and metadata: These are required to understand what data is available in the storage layer, and who is allowed to access it. The query engine needs the unstructured data and table metadata to understand where the data is located, what it looks like, and how to read it. The de-facto open metadata storage solution is the Hive metadata store.
Fit-for-purpose query engine(s): These are at the heart of the data lakehouse. The engines execute queries against the data and are often referred to as the “compute” component. There are many open-source query engines for lakehouses in the market, such as Presto and Apache Spark. These multiple engines provide for a breadth of workload coverage, ranging from data exploration, data transformation, AI model training and tuning, and interactive querying.
.
IBM Db2 Warehouse and Netezza as a Service on AWS have also been enhanced to support the Iceberg open table format to coexist seamlessly as part of the lakehouse. The query engine is fully modular and ephemeral, meaning the engine can be dynamically scaled to meet big data workload demands and concurrency. SQL query engines can attach to any number of catalogs and storage.

What Can You Expect from watsonx.data?

You can provision watsonx.data as a fully managed SaaS solution on AWS, from AWS Marketplace. This includes the following capabilities:

Availability as SaaS on AWS: You can accelerate your data modernization strategy in the cloud by combining the openness, performance, and governance of IBM watsonx.data for data, analytics, and AI workloads with the scale, agility, and cost efficiency of the AWS cloud.
Presto engine: Presto is an open-source, fast, reliable, and highly scalable SQL query engine that provides one simple ANSI SQL interface for all your data analytics. Your open lakehouse and is contributed to by some of the biggest companies in the world including Meta, Uber, Intel, and more.
Multi-engine integration: You’ll no longer need to keep multiple copies of data for various workloads or across database and data lake repositories for analytics and AI use cases. Presto, Apache Spark, Db2, and Netezza engines are fully integrated with shared metadata and data storage, and work off Iceberg table format to access and query a single copy of data across the multiple engines. For example, customers can run resource-intensive machine leaning (ML) model builds in watsonx.data without impacting business intelligence (BI) and dashboard workloads.
Open data and table format support: Store vast amounts of data in vendor-agnostic open formats, such as Parquet, Avro, and Apache ORC, while leveraging Iceberg table format to share large volumes of data through an open table format built for high performance analytics. Support for Iceberg brings the reliability and simplicity of SQL to big data. Iceberg time travel and rollback features in watsonx.data enable you to examine table changes over time and rollback quickly to a previous state.
Enterprise compliance and security: Protect your data, manage compliance, and maintain trust with watsonx.data’s built in-governance, automation, and enterprise security capabilities. It integrates with IBM’s centralized governance capabilities for automatic policy enforcement, and enables responsible, transparent, and explainable data and AI workflows across the enterprise.
Easy to use, integrated data console: Bring your own data and stay in control of it. In just a few clicks, connect to existing analytics environments and start deploying fit-for-purpose query engines with integrated metadata and storage through a single point of entry. IBM’s simple UX and console makes it easy to ingest, access, and transform data for analytics and AI workloads within minutes.
Insights powered by generative AI: Finally, watsonx.data provides non-technical users in addition to data scientists and engineers with self-service access to high-quality, trustworthy, governed data, in a single collaborative platform. Later this year, watsonx.data will leverage watsonx.ai foundation models to simplify and accelerate the way users interact with data, with the ability to use natural language to discover, augment, refine, and visualize data and metadata in a conversational user experience.

Integrating IBM watsonx.data with AWS services allows you to increase the value of your IBM investments. These integrations currently include Amazon S3 and Amazon EMR, with more to come.

The figure below shows an example of a lakehouse solution that combines these AWS services with IBM watsonx.data, Netezza Performance Server, and IBM Db2 Warehouse.

Figure 4 – Data analytics architecture on AWS with Amazon EMR and IBM watsonx.data.

Conclusion

In this post, you learned how using IBM watsonx.data on AWS enables data engineers, enterprise architects, database administrators (DBAs), and data analysts to access their data and better derive value from it, through workload optimization and access to BI capabilities within a unified and governed ecosystem.

For your mission-critical workloads, you can rest easy knowing AWS and IBM are working together to deliver you with highly available and secure analytics and AI workloads.

Visit the IBM watsonx.data software-as-a-service (SaaS) listing in AWS Marketplace. You can also sign up for a demo or get started with an IBM watsonx.data trial on AWS from the watsonx.data product page.

To learn more about IBM solutions available on AWS, visit the IBM on AWS partner page.

.

.

IBM – AWS Partner Spotlight

IBM Software and Technology is an AWS Partner and leading global provider of enterprise technology and services.

Contact IBM | Partner Overview | AWS Marketplace | Case Studies

AWS Partner Network (APN) Blog

Making Data-Driven Decisions with IBM watsonx.data, an Open Data Lakehouse on AWS

Unprecedented Data Challenges to Scale AI Workloads

IBM watsonx.data Combines IBM with Open Source

What Can You Expect from watsonx.data?

Conclusion

IBM – AWS Partner Spotlight

Resources

Follow

Learn

Resources

Developers

Help