Skip to main content

What is Data Virtualization?

Data virtualization is the process of abstracting data operations from underlying data storage. Modern organizations store data in multiple formats, from traditional tables to real-time messages and files,  across various systems and platforms. Physically moving this data to a single central system may not always be practical or cost-effective.

Data virtualization uses metadata, data about data, to create a virtual layer for data manipulation. End users can read and modify data in an integrated manner within the virtual layer without needing to understand the underlying technicalities. Instead of the end user, the virtual layer interacts with the underlying storage layer to push or retrieve data as needed. 

Why is data virtualization important?

Organizations today often have data spread across disparate data sources in on-premises systems, cloud services, and other siloed systems. Physical data merging capabilities are limited due to the following challenges:

  • Manually managing source data across multiple platforms can be time-consuming and prone to errors.
  • Access control for multiple independent sources can be complex due to mandated data governance.
  • Maintaining direct connections between data sources can be challenging when new sources or users are added.

Other traditional data integration methods require moving data into data warehouses or data lakes. This approach does offer centralization, but requires maintaining multiple copies in synchronization, which in turn can impact real-time reporting capabilities.

Data virtualization systems offer several key advantages over these other approaches.

Abstraction

Querying is abstracted from the actual sources, so you can work with complex datasets without users or developers needing to understand every technical detail behind them.

Unified governance

Since data virtualization operates using metadata, you can implement centralized governance within the virtualization layer. It is also easy to build and iterate data models that are available quickly and can be reused for future projects.

Real-time access

Data virtualization enables you to query multiple sources in real time. You don't need to wait for scheduled synchronizations. Your business users can interact with a single application instead of connecting to each system individually.

Single source of truth

You eliminate redundancies and confusion caused by outdated data in one system due to synchronization delays with another system. You also reduce storage costs by not copying data into centralized data warehouses or lakes.

What are the use cases of data virtualization?

By making real-time data access easier, virtualization can support several important functions.

Analytics & business intelligence

Analytics initiatives, such as for internal reporting or regulatory compliance, often require integrating data from many sources within an organization. Virtualized data access enables analysts and BI teams to easily explore data and refine queries without negatively impacting production data sources.

Cloud migration support

Migrating large systems to the cloud can be a slow and error-filled process. Data virtualization is a powerful tool for effective migration planning. Your team can test cutover scenarios and validate data integration processes without disrupting live systems.

Simplifying major system upgrades

Building test environments for major projects, such as an enterprise resource planning (ERP) system upgrade, can be time-consuming and require extensive coordination among multiple teams. Using data virtualization technology, teams can quickly generate complex data structures for efficient work. That can help reduce infrastructure costs and shorten deployment times.

Production system support

Troubleshooting complex issues in production systems sometimes requires recreating full data services for testing. Data virtualization technology allows your IT teams to quickly build and test environments without the need to copy data. That will allow them to verify fixes and identify unintended side effects.

DevOps workflows

Developers and testers can work with a complete virtual data environment when preparing applications for release. They can model how software operates in the real world without needing to replicate large datasets.

What are the capabilities of a data virtualization layer?

Data virtualization software can provide several key capabilities that simplify data management.

Semantic modeling

Meaningful business concepts, such as a "customer" or a "product line," can be represented in virtual data that is fragmented across multiple systems. A virtualization layer allows you to use data to define meaningful concepts across multiple sources more easily.

Universal connectivity

By accessing data sources within your organization through a virtualization layer, you can more easily break down data silos and provide every team with real-time access to a unified data set.

High-performance querying

Data virtualization software can utilize smart performance techniques to optimize complex queries into a single, efficient statement. It won't make redundant queries to different systems.

Data catalogs

Virtualization enables you to store metadata, or information about your data, within the same system. You can use the data to track information about your existing data set and build a data catalog that supports data discoverability.

How does data virtualization work?

Data virtualization is a type of data integration. Instead of working with data directly, data virtualization services operate only on metadata, such as information about where your data is stored, how it is categorized, and how it connects to other data.

User query

Let's say your business has a customer relationship management (CRM) database and a separate inventory system for managing your products. But you want to find all of the orders placed by customers named "Smith" in the last two months, a request that straddles the two systems. You input your query into your data virtualization service.

Data integration

The virtualization service decomposes the query into smaller components. Using its metadata, the service identifies the location of the data for each component of the query within your various sources. It generates subqueries to retrieve customer information from your CRM and order information from the inventory.

Data presentation

As the sources return data, the data virtualization service transforms it in working memory, adjusting formatting and naming as needed. It filters out redundancies identified by metadata. Then, once transformations are complete, the service delivers an integrated result to your application.

What are data virtualization approaches in the cloud?

You have three broad approaches to implementing data virtualization in the cloud: custom-built solutions, commercial tools, or cloud-native solutions.

Custom-built data virtualization

Your first option is to custom build your own data virtualization solution using cloud infrastructure. While it can offer more control over design and features, it also requires significant development and maintenance.

Commercial data virtualization tools

Another option is to use a prebuilt data virtualization platform from a vendor. These tools typically offer prebuilt connectors to many data sources and performance optimizations. They might also support integration with existing corporate metadata standards.

Cloud-native data virtualization

This approach utilizes managed services provided by cloud vendors, such as Amazon Web Services (AWS), to simplify deployment and ongoing operations. It enables organizations that already work in the cloud or are transitioning to it to adopt data virtualization without requiring extensive technical expertise.

How can AWS support your data virtualization requirements?

AWS offers native capabilities that align with many of those provided by commercial data virtualization services. These native features can potentially support a wide range of data virtualization use cases.

Amazon Redshift powers modern data analytics at scale. Whether your growing data is stored in operational data stores, data lakes, streaming services, or in third-party datasets, Amazon Redshift helps you securely access, combine, and share data with minimal movement or copying.

Amazon Athena is an interactive analytics service that works directly with data stored in Amazon S3. It is serverless, so there is no infrastructure to set up or manage, and you can start analyzing data immediately.

AWS Glue is a serverless data integration service that simplifies the process of discovering, preparing, and combining data. Amazon Athena and Amazon Redshift have native integration with AWS Glue Data Catalog, a central metadata repository that supports virtualization.

AWS Lake Formation makes it easier to centrally govern, secure, and globally share data for analytics and machine learning (ML). You can centralize data security and governance using the AWS Glue Data Catalog, managing metadata and data permissions in one place with familiar database-style features. It also delivers fine-grained data access control.

Get started with data virtualization on AWS by creating a free account today.