Building data lakes with SAP on AWS
September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.
Data is ubiquitous and is being exponentially generated in enterprise. However, the majority of this data is dark data. The Gartner glossary defines dark data as the information assets that organizations collect, process, and store during regular business activities, but which they generally fail to use for other purposes. Those other purposes could include analytics, business relationships, and direct monetization.
Some of our customers, such as Thermo Fisher, are already tapping into the data generated across their enterprise to build a scalable and secure data platform on AWS. Their data includes medical instruments and various software applications. This data platform is helping medical researchers and scientists to conduct research, collaborate, and improve medical treatment for patients. For more details, see Thermo Fisher Case Study.
We have thousands of customers running their business-critical SAP workloads on AWS and realizing big business benefits, as evident in the SAP on AWS Case Studies. Although the business case for migrating SAP workloads to AWS is well-articulated, many customers are also looking to build stronger business cases around transformations powered by data and analytics platforms. We hear firsthand from customers that they are looking at ways to tap into SAP data along with non-SAP application data. They want real-time streaming data generated by internet-powered devices to build data and analytics platforms on AWS.
In this post, I cover various data extraction patterns supported by SAP applications. I also cover reference architectures for using AWS services and other third-party solutions to extract data from SAP into data lakes on AWS.
Data lakes on AWS are powered by various AWS services:
- Amazon S3 for storing structured and unstructured data
- AWS Glue for extracting, transforming, and loading (ETL) data
- Amazon Kinesis for ingesting streams data
- Amazon Redshift for data warehousing
- Amazon Athena for querying data directly from S3
- Amazon EMR for running big data frameworks, such as Apache Spark, Hadoop, and others
- Amazon QuickSight for business intelligence and visualizations
- AWS Lake Formation to accelerate setting up data lakes
Because S3 acts as both the starting point and a landing zone for all data for a data lake, I focus here on design patterns for extracting SAP data into S3. I also cover some of the key considerations in implementing these design patterns.
Data extraction patterns for SAP applications
For this post, I focus only on SAP ERP (S/4HANA, ECC, CRM, and others) and SAP BW applications as the source. This is where we see a vast majority of our customer requirements in extracting the data to S3. Although customers are using SaaS applications as data sources, the mere fact that these are built API first makes it easier to integrate them with AWS services.
I also only focus on data replication patterns in this post and leave data federation patterns for another day, as it is a topic on its own. Data federation is a pattern where applications can access data virtually from another data source instead of physically moving the data between them. Following is a high-level architecture of the various integration patterns and extraction techniques. Let’s dive-deep in to each of these patterns.
Database-level extraction, as the name suggests, taps in to SAP data at database level. There are various APN Partner solutions—Attunity Replicate, HVR for AWS, and others—that capture raw data as it is written to the SAP database transaction logs. They transform it with required mappings, and store the data in S3. These solutions are also able to decode SAP cluster and pool tables.
For HANA databases, especially those that have prebuilt calculation views, customers can use Python Support SAP HANA client libraries for native integration with AWS Glue and AWS Lambda. Or, they use SAP HANA Java Database Connectivity (JDBC) drivers.
The key considerations for database-level extraction include the following:
- As the third-party adapters pull data from transaction logs, there is minimal performance impact to the SAP database application.
- Change data capture is supported out of the box based on database change logs. This is true even for those tables where SAP doesn’t capture updated date and time at the application level.
- Certain database licenses (for example, runtime licenses) may prevent customers from pulling data directly from the database.
- This pattern doesn’t retain the SAP application logic, which is usually maintained in the ABAP layer, potentially leading to re-mapping work outside SAP. Also, changes to the SAP application data model could result in additional maintenance effort due to transformation changes.
In SAP ERP applications, business logic largely resides in the ABAP layer. Even with the code push-down capabilities of SAP HANA database, the ABAP stack still provides an entry point for API access to business context.
Application-level extractors like SAP Data Services extract data from SAP applications using integration frameworks in ABAP stack and store it in S3 through default connectors. Using Remote Function Call (RFC SDK) libraries, these extractors are able to natively connect with SAP applications to pull data from remote function modules, tables, views, and queries. SAP Data Services can also install arbitrary ABAP code in the target SAP application and push data from SAP application rather than pulling it. The push pattern helps with better performance in certain cases.
SAP applications also support HTTP access to function modules, and you can use AWS Glue or Lambda to access these function modules using HTTP. SAP has also published PyRFC library that can be used in AWS Glue or Lambda to natively integrate using RFC SDK. SAP IDOCs can be integrated with S3 using an HTTP push pattern. I wrote about this technique in an earlier post, SAP IDoc integration with Amazon S3 by using Amazon API Gateway.
The key considerations for application-level extraction include the following:
- Extractions can happen with business context in place as the extractions happen at the application level. For example, to pull all sales order data for a particular territory, you could do so with all related data and their associations mapped through function modules. This reduces additional business logic–mapping effort outside SAP.
- Change data capture is not supported by default. Not all SAP function modules or frameworks support change data capture capabilities.
- Using AWS native services like AWS Glue or Lambda removes the requirement for a third-party application, hence reducing the total cost of ownership. However, customers might see an increase in custom development effort to wire the HTTP or RFC integrations with SAP applications.
- Potential performance limitations exist in this pattern as compared to database-level extraction because of application-level integration. Also, additional performance load in the SAP application servers is caused due to pulling data using function modules and other frameworks.
Operational data provisioning–based extraction
The Operational data provisioning (ODP) framework enables data replication capabilities between SAP applications and SAP and non-SAP data targets using a provider and subscriber model. ODP supports both full data extraction as well as change data capture using operational delta queues.
The business logic for extraction is implemented using SAP DataSources (transaction code RSO2), SAP Core Data Services (CDS) Views, SAP HANA Information Views, or SAP Landscape Replication Server (SAP SLT). ODP, in turn, can act as a data source for OData services, enabling REST-based integrations with external applications. The ODP-Based Data Extraction via OData document details this approach.
Solutions like SAP Data Services and SAP Data Hub can integrate with ODP, using native remote function call (RFC) libraries. Non-SAP solutions like AWS Glue or Lambda can use the OData layer to integrate using HTTP.
The key considerations for extraction using ODP include the following:
- Because business logic for extractions is supported at application layer, the business context for the extracted data is fully retained.
- All table relationships, customizations, and package configurations in the SAP application are also retained, resulting in less transformation effort.
- Change data capture is supported using operation delta queue mechanisms. Full data load with micro batches is also supported using OData query parameters.
- Using AWS native services like AWS Glue or Lambda removes the requirement for a third-party application, hence reducing the total cost of ownership. However, customers might see an increase in custom development effort to build OData-based HTTP integrations with SAP applications. I published a sample extractor code using Python that can be used with AWS Glue and Lambda in the aws-lambda-sap-odp-extractor GitHub repository to accelerate your custom developments.
- Data Services and Data Hub might have better performance in pulling the data from SAP because they have access to ODP integration using RFC layer. SAP hasn’t opened the native RFC integration capability to ODP for non-SAP applications, so AWS Glue and Lambda have to rely on HTTP-based access to OData. Conversely, this might be an advantage for certain customers who want to standardize on open integration technologies. For more information about ODP capabilities and limitations, see Operational Data Provisioning (ODP) FAQ.
SAP Landscape Transformation Replication Server–based extraction
SAP Landscape Transformation Replication Server (SLT) supports near real-time and batch data replication from SAP applications. Real-time data extraction is supported by creating database triggers in the source SAP application. For replication targets, SAP SLT supports by default SAP HANA, SAP BW, SAP Data Services, SAP Data Hub, and a set of non-SAP databases. For a list of non-SAP targets, see Replicating Data to Other Databases documentation.
For replicating data to targets that are not supported by SAP yet, customers can implement their own customizations using the Replicating Data Using SAP LT replication server SDK. A detailed implementation guide is available in SAP support note 2652704 – Replicating Data Using SAP LT Replication Server SDK (requires SAP ONE Support access).
In this pattern, you can use AWS Glue to pull data from SAP SLT supported target databases into S3. Or, use SAP Data Services or SAP Data Hub to store the data in S3. You can also implement ABAP extensions (BADIs) using the SAP LT replication server SDK to write replicated data in S3.
The key considerations for extraction using SAP SLT include the following:
- It supports both full data extraction and change data capture. Trigger-based extraction supports change data capture even on source tables that don’t have updated date and timestamp.
- Additional custom development in ABAP is required to integrate with targets not supported by SAP.
- Additional licensing cost for SAP Data Hub, SAP Data Services, or other supported databases to replicate data to S3.
- Additional custom development effort in AWS Glue when replicating from an SAP-supported database to S3.
- An SAP SLT enterprise license might be required for replicating to non-SAP-supported targets.
End-to-end enterprise analytics
Ultimately, customers are looking to build end-to-end enterprise analytics using data lakes and analytics solutions on AWS.
SAP provides operational reporting capabilities using embedded analytics within its ERP applications. But, customers are increasingly looking at integrating data from SAP, non-SAP applications, the Internet of Things (IoT), social media streams, and various SaaS applications. They want to drive process efficiencies and build newer business models using machine learning.
A sample high-level architecture for end-to-end enterprise analytics is shown below. In this architecture, customers can extract data from SAP applications using the patterns discussed in this post. Then, they can combine it with non-SAP application using AWS services to build end-to-end enterprise analytics. These services can include Amazon S3, Amazon Redshift, Amazon Athena, Amazon Elasticsearch Service, and Amazon QuickSight. Other visualization solutions like Kibana and SAP Fiori Apps can also be a part of the solution.
Today, our customers have to deal with multiple roles, such as data scientists, tech-savvy business users, executives, marketers, account managers, external partners, customers. Each of these user groups requires access to different kinds of data from different sources. They access it through multiple channels—web portals, mobile apps, voice enabled apps, chat bots, and APIs.
Our goal at AWS is to work with customers to simplify these integrations so that you can focus on what matters the most—innovating and transforming your business.
Watch the recording of my session at AWS re:Invent 2019, GPSTEC338 – Building data lakes for your customers with SAP on AWS, to learn more about the patterns that I discussed in this post. Until then, keep on building!