How to Get Real-Time SAP Data into Amazon Redshift with HVR
By Ganesh Suryanarayanan, Sr. Partner Solutions Architect – SAP at AWS
By Andre Reiger, Director, Field and Marketing at HVR
By Josh Robinson, Solutions Architect at HVR
Technological advances such as cloud computing are simplifying lives and enhancing business operations.
The massive scale and efficiencies offered by cloud data lakes are best served by a continuous replication mechanism from on-premises and cloud-based enterprise resource planning (ERP) applications.
With all of this innovation, it’s great to see technologies being used together to serve customer needs. This post explores how SAP ERP, Amazon Redshift, and HVR Change Data Capture (CDC) add up to more than the sum of the individual parts.
In this post, we’ll dive deep into HVR’s architecture and the unique value proposition for SAP customers building their data lakes with Amazon Web Services (AWS).
SAP: The ERP Standard for Many Customers
SAP has been around for almost five decades. During that time, SAP has grown to be one of the most pervasive ERP applications in the world. Tens of thousands of companies use SAP systems to operate their business and organize data.
A global water technology (GWT) company saw these advantages because they chose to move their SAP data with HVR into Amazon Redshift, a cloud date warehouse that makes it as easy to gain new insights from all your data.
The salient features of this CDC-based replication are:
- Near-zero overhead load on the source SAP transactional database for all tables. Load once, stream changes. Cluster and pool tables are supported.
- Lower latency between source and target – Change data queued off box and updated on target once per hour.
- Improved data quality – Log-based CDC guarantees zero change data loss, including deletes and transient updates.
- Flexibility with analytics – Support for heterogeneous platforms enables high volume data centers to span on-premises and in the cloud.
- Data trust – Built-in data validation and repair capabilities.
- Security – Industry best practices to encrypt data in transit.
For more details about the GWT customer story, see this HVR blog post: 7 Major Benefits of Using Log-Based CDC for Data Replication from SAP.
AWS: The Cloud Standard for Many Customers
With more than 20,000 data lakes already deployed on AWS, customers are benefiting from storing their data in Amazon Simple Storage Service (Amazon S3) and analyzing that data with analytics and machine learning (ML) services to increase their pace of innovation.
AWS offers the broadest and most complete portfolio of native ingest and data transfer services, plus more partner ecosystem integrations with Amazon S3.
The initial process of loading involves bringing data from several commercial off-the-shelf (COTS) applications and non-COTS applications. There are several approaches to extracting SAP data, as outlined in this AWS blog post: Building Data Lakes with SAP on AWS.
Removing Data Silos
For a long time, critical business data was locked in SAP systems. To run on different databases and provide portability between them, SAP used a proprietary encoding in the database system.
This encoding means that to read the logical table, you must utilize proprietary APIs that are outside the database system. These tables within a table are called pooled and clustered tables. Most CDC and extract, transform, and load (ETL) solutions have a hard time getting data out of SAP systems, because of this encoding.
Both pooled and clustered tables contain critical business data that can’t be traditionally collected and replicated. To access this data, SAP requires users to go through BAPI (Business Application Programming Interface), which puts the load on the source systems and can degrade the performance of the ERP system.
For some customers, this load is acceptable, but for most it can mean the data in those tables are locked up and can only be replicated out of SAP during certain windows of time. This leads to not having real-time updates for non-SAP analytics systems. Even when using BAPI, some critical data is not available, so an alternative is required.
Pooled and clustered tables are a part of SAP systems that have been around for decades. In 2010, SAP released HANA, an in-memory, column-oriented, relational database to replace other database systems underneath SAP.
SAP HANA allowed SAP systems to get away from these pooled and clustered tables, but many SAP systems have not or are not planned to be migrated to HANA.
For SAP systems that have migrated to HANA or started with HANA in the first place, there is a new challenge. HANA uses a multiversion concurrency control (MVCC) system sometimes called Copy-on-Write. This means whenever a row is updated rather than overwriting the existing data, HANA marks the existing data as obsolete and writes the data to a new row.
This leads to row ids for a given customer, order, or item will change over time. Secondly, the transaction logs in HANA only provide the old row id, the new row id, and the new data. This means you have incomplete data unless you query the database or utilize the application layer, which both incur load on the source system.
HVR avoids causing this load because of its ability to track the row id and complete the data on the target system.
In the next section, we’ll cover the architecture and steps deployed to support log-based replication causing minimal overhead to source systems.
Unlocking Your Data with HVR
Capturing the inserts, updates, and deletes as they happen is critical to working in real-time. It’s also critical to not apply load or require changes to the application that creates this data.
HVR accomplishes this by parsing the relational database’s transaction log, allowing HVR to capture the inserts, updates, and deletes in real-time, with little to no additional load on the relational database management system (RDBMS), and without requiring any change to the application.
Beyond the ability to capture and integrate inserts, updates, and deletes, HVR provides “rich replication” that includes the ability to enrich, optimize, and bulk load your data. You no longer need one tool for bulk loads, another to keep the data in sync, and a third tool to enrich the data.
HVR also has the ability to extract data from SAP’s pooled and clustered tables directly from the underlying database system and expand them to traditional tables.
Figure 1 – HVR architecture.
HVR’s biggest customers regularly see 100-300 million rows of change data captured per hour. Meaning tens to hundreds of GBs of data are flowing from the source to the target per hour. Additional details can be found in this doc on real-world performance.
HVR has a unique distributed architecture allowing for the use of the elasticity of the AWS Cloud. The HVR hub could be installed on an Amazon Elastic Compute Cloud (Amazon EC2) instance to feed data into Amazon Redshift continuously.
As data volumes grow and shrink, you can grow and shrink pools of HVR agents to handle the load. This provides real-time CDC. You no longer have to wait for a batch window or for an ETL process to complete to start analyzing the latest data. Real-time data moves from the on premises SAP system into Amazon Redshift where the data can be accessed instantly for analysis.
As data is captured from the source SAP system, it’s compressed (10x compression is regularly seen), encrypted, and delivered to the HVR hub. The hub can instantly send it on to the target system in its fully compressed and encrypted format, or it can queue up the changes waiting for an optimal time or size to send them off to the target.
If the target system is not available, HVR keeps the changes queued until it can connect again. This allows changes not to be missed due to loss of connectivity to the target.
Finally, sources and targets in HVR do not have a one-to-one relationship. As a result, it’s easy and efficient to have multiple sources and targets that can evolve with your data.
Once the changes are sent on to the target, they are quickly and atomically applied to the system. It can include logic to enrich your data, coerce data types, or apply DDL changes.
HVR’s solution includes the ability to compare data on the source and target systems to verify they are in sync. This comparison can be made at the granularity of the whole table or per row.
With the bulk granularity, HVR will make a hash of the entire table on the source and target systems and compare the hashes. If the row granularity is used, HVR will tell you the data is not synced and the exact SQL to run to get it back in sync. Then, you can decide if you need to update a few rows or refresh the entire table.
By combining these three powerful technologies—SAP ERP, Amazon Redshift, and HVR CDC—business units can derive more value than previously thought.
Users still have the power of the SAP ERP system in their on-premises deployment. They can still benefit from the powerful analytics capability of having their data in Amazon Redshift. User can access their data in real-time thanks to the power of HVR’s CDC technology.
For more information on how HVR can support your SAP to AWS data integration project, contact HVR for a consultation.
HVR – AWS Partner Spotlight
HVR is an AWS Competency Partner that provides real-time data replication software across cloud and on-premises platforms like Amazon Redshift.
*Already worked with HVR? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.