Accelerating Data Warehouse Migration to Amazon Redshift Using Cognizant Intelligent Data Works
By Arun Kannan, Partner Solutions Architect at AWS
By Dipankar Biswas, Principal Architect at Cognizant Technology Solutions
By Anirban De, Principal Architect at Cognizant Technology Solutions
Many organizations are increasingly looking to migrate existing, on-premises enterprise data warehouse systems to cloud-based data warehouse systems such as Amazon Redshift.
While migrating to a cloud-based data warehouse, one of the most important goals is to move away from database-stored procedures and create a modular architecture by decoupling code from data persistence components.
The business logic that traditionally resided within the database can then be implemented outside of the database and managed independently. Usually, the stored procedures contain business logic to implement the extract, transform, and load (ETL) processes for incremental data processing.
It’s very important that active business processes not be disrupted due to technology migration. Converting database-stored procedures to Python scripts is essentially a manual activity. Manual conversion of code from a proprietary language is extremely effort intensive and error prone.
In this post, we discuss how Cognizant’s Intelligent Migration Workbench (IMW) can be used to accelerate data warehouse migrations while converting Oracle PL/SQL and Tetradata BTEQ scripts.
Cognizant is a AWS Partner Network (APN) Premier Consulting Partner with multiple AWS Competencies and AWS Service Delivery designations. Cognizant is also a member of the AWS Managed Services Provider (MSP) Partner Program.
Cognizant AI Data Modernization Method
Cognizant AI Data Modernization Method (AI-DM) proposes a reference model that customers can use to reimagine their data and analytics foundation using a set of attributes that redefine their data ecosystem in a structured way.
AI-DM also features Intelligent Data Works (IDW), a foundry of data engineering tools and accelerators that leverage artificial intelligence and machine learning. IDW powers Cognizant’s data engineers to accelerate outcomes faster.
IDW also helps accelerate technology migration projects, particularly those involving migration from on-premises database or data warehouse to cloud-based data and analytics platforms.
Intelligent Migration Workbench (IMW) is a component of Intelligent Data Works that parses SQL dialects and procedural constructs from procedural SQL code and generates equivalent PySpark code. This technique of code conversion improves the productivity manifold while enhancing code performance using Spark-based distributed computing.
These tools help customers build a modern data ecosystem based on the principles of AI-DM.
The end-to-end process of migrating from an on-premises database or data warehouse to the Amazon Web Services (AWS) Cloud comprises several tasks. Some are concerned with re-factoring upstream and downstream processes for data ingestion and consumption, while others are specific for migrating data from the source data warehouse system to Amazon Redshift.
IMW focuses on one specific task from the list below:
- Schema Migration: For schema migration, more specifically the tables and views, AWS Schema Conversion Tool (SCT) is the tool of choice.
- Physical Data Migration: Use the SCT Data Extractor for extracting data from your on-premises data warehouse to Redshift; AWS Snowball is a secure device to migrate petabytes of data from on-premises to AWS; AWS Database Migration Service (DMS) can help to migrate your database quickly and securely to AWS.
For a legacy proprietary Unix-based system like AIX, use traditional ETL tools or Unix/SQL scripts to migrate the data if none of the AWS services above can be used.
- Programmable Objects: Cognizant’s IMW focuses on conversion of programmable objects like stored procedures and user-defined functions.
IMW delivers automation through machine-driven conversion of code from proprietary dialects such as Teradata BTEQ and Oracle PL/SQL to PySpark, thereby freeing the business logic locked in proprietary languages.
The refactored code can run on fully managed ETL services, such as AWS Glue, that provide customers the flexibility to manage the code execution environment.
How Intelligent Migration Workbench (IMW) Works
The diagram below shows the high-level internal process flow of IMW.
Figure 1 – Intelligent Migration Workbench (IMW) workflow.
How to Use IMW
For effective use of IMW, we recommend the following sequence of tasks:
The user identifies the code in the source database that needs to be converted using IMW. For a Teradata environment, it may be Teradata BTEQ scripts, as well as embedded stored procedures. For an Oracle data warehouse environment, it may be Oracle PL/SQL packages and stored procedures.
This step involves dependency analysis, so that whenever a program module is to be converted we identify all the dependent program modules that need to be converted along with it. For example, in Oracle, the ALL_DEPENDENCIES view can be queried to find out the dependencies between objects in a database.
The next step is extracting the identified code, either BTEQ scripts or Oracle PL/SQL code, into files and placing them in a common storage repository. This location is configured in IMW as the source code location.
The conversion step is where IMW analyzes the BTEQ or PL/SQL scripts and generates the corresponding PySpark code. During this conversion, the converter takes care of semantic conversion of logical constructs in the database code to corresponding Python code.
The SQL statements are converted into Spark SQL, and IMW also takes care of converting native database functions in Teradata or Oracle to their corresponding PySpark equivalents. The output is not just a lexical translation but a logical replication of the business logic contained in the source code.
The screenshots below illustrate how source code is uploaded to IMW through the web user interface. Form there, the converted code is made available for download.
IMW supports conversion of scripts from variety of sources like Oracle PL/SQL, Teradata BTEQ scripts, and PySpark scripts that can be executed on AWS Glue or Amazon EMR.
Figure 2 – Selecting the source and target in IMW.
Once the selected scripts are converted to the target format (PySpark), you can download the converted scripts.
Figure 3 – Download the converted scripts in IMW.
The following diagram shows the extracts of BTEQ source code and functionally equivalent PySpark code automatically converted using IMW.
Figure 4 – Source BTEQ script and converted PySpark script.
IMW creates a conversion log that holds important information about conversion success. The converter may encounter source code that it cannot convert successfully. In such a case, the conversion log contains information about the code snippets that requires manual conversion.
As a first step, the PySpark code generated by IMW is unit-tested using a Jupyter notebook, followed by system integration testing.
The system integration testing needs to be automated, as it could involve multiple complex and error prone steps like running the existing ETL process on a source dataset, which refreshes a target database.
The next task is to run the newly-generated PySpark code on the same source dataset to load or refresh a different target database. Finally, the results from the corresponding tables of the two target databases are compared to validate if the new PySpark code produces same results as existing ETL.
For automating the system integration testing for IMW code conversion, Cognizant leverages Jupiter, a component of AI-DM. Jupiter is a continuous data testing accelerator enabling data test automation and quality engineering at scale by aligning to a DevOps delivery mode.
5. Scheduling and Execution
Once the PySpark scripts are ready, they are executed through AWS Glue or as Spark jobs on Amazon EMR.
The following diagram depicts how IMW is deployed and leverages native AWS services like AWS Glue to run ETL jobs.
Figure 5 – Sample architecture of IMW on AWS.
Benefits of IMW for Customers
Some of the key benefits of IMW are:
- IMW provides a compelling solution for customers looking to refactor proprietary data transformation code from an existing on-premises data warehouse to AWS. It delivers cost savings by reducing the effort needed for implementation and execution of a migration solution through automation.
A leading retail enterprise customer with an inventory of 10 BTEQ scripts with 10,000+ lines of code and reasonable complexity distribution currently uses IMW to convert the complete codebase with less than 20 person hours of manual effort. This translates to code conversion effort savings of more than 70 percent.
- IMW substantially improves productivity and code quality by enabling a fast, error-free, and consistent output requiring manual intervention only for exception scenarios.
For example, with Teradata BTEQ source, typically 60-65 percent of the source code is converted by IMW directly, while the remaining code may require human effort for further analysis and conversion. Faster conversion and fewer bugs mean shorter QA cycle and faster time to market.
- The stable version of IMW supports conversion of Teradata BTEQ and PL/SQL to PySpark, while a beta version is available supporting Oracle and Netezza. The IMW utility will be further supplemented in future by extending its code conversion capability for Amazon Aurora as target.
Organizations have realized they need to move data assets and associated processes to the cloud as soon as possible to remain competitive in this digital era.
Cognizant has been helping customers modernize their data platform leveraging AWS, big data, fast data, NoSQL databases, and different polyglot strategies for the past several years.
Cognizant AI Data Modernization Method offers an AI-driven, data-first approach to simplify and transform existing data landscapes into a future-ready responsive data foundation capable of faster intelligence across the enterprise.
Customers can leverage AI-DM, and more specifically Intelligent Migration Workbench (IMW), to accelerate migration of their enterprise on-premises data warehouse to AWS.
IMW makes it easy to move mission critical proprietary code to AWS, giving customers competitive edge through faster time to market.
Cognizant – APN Partner Spotlight
Cognizant is an APN Premier Consulting Partner. They transform customers’ business, operating, and technology models for the digital era by helping organizations envision, build, and run more innovative and efficient businesses.
*Already worked with Cognizant? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.