Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

This post is co-written with Preshen Goobiah and Johan Olivier from Capitec.

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. It finds frequent application among Spark developers working with Amazon EMR, Amazon SageMaker, AWS Glue and custom Spark applications.

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications.

With the Amazon Redshift integration for Apache Spark, you can quickly get started and effortlessly develop Spark applications using popular languages like Java, Scala, Python, SQL, and R. Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Capitec, South Africa’s biggest retail bank with over 21 million retail banking clients, aims to provide simple, affordable and accessible financial services in order to help South Africans bank better so that they can live better. In this post, we discuss the successful integration of the open source Amazon Redshift connector by Capitec’s shared services Feature Platform team. As a result of utilizing the Amazon Redshift integration for Apache Spark, developer productivity increased by a factor of 10, feature generation pipelines were streamlined, and data duplication reduced to zero.

The business opportunity

There are 19 predictive models in scope for utilizing 93 features built with AWS Glue across Capitec’s Retail Credit divisions. Feature records are enriched with facts and dimensions stored in Amazon Redshift. Apache PySpark was selected to create features because it offers a fast, decentralized, and scalable mechanism to wrangle data from diverse sources.

These production features play a crucial role in enabling real-time fixed-term loan applications, credit card applications, batch monthly credit behavior monitoring, and batch daily salary identification within the business.

The data sourcing problem

To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

During feature development, data engineers require a seamless interface to the EDW. This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features.

Previous solution process

In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark. The steps included the following:

Construct a predicated query in Python.
Submit an UNLOAD query via the Amazon Redshift Data API.
Catalog data in the AWS Glue Data Catalog via the AWS SDK for Pandas using sampling.

This approach posed issues for large datasets, required recurring maintenance from the platform team, and was complex to automate.

Current solution overview

Capitec was able to resolve these problems with the Amazon Redshift integration for Apache Spark within feature generation pipelines. The architecture is defined in the following diagram.

The workflow includes the following steps:

Internal libraries are installed into the AWS Glue PySpark job via AWS CodeArtifact.
An AWS Glue job retrieves Redshift cluster credentials from AWS Secrets Manager and sets up the Amazon Redshift connection (injects cluster credentials, unload locations, file formats) via the shared internal library. The Amazon Redshift integration for Apache Spark also supports using AWS Identity and Access Management (IAM) to retrieve credentials and connect to Amazon Redshift.
The Spark query is translated to an Amazon Redshift optimized query and submitted to the EDW. This is accomplished by the Amazon Redshift integration for Apache Spark.
The EDW dataset is unloaded into a temporary prefix in an Amazon Simple Storage Service (Amazon S3) bucket.
The EDW dataset from the S3 bucket is loaded into Spark executors via the Amazon Redshift integration for Apache Spark.
The EDL dataset is loaded into Spark executors via the AWS Glue Data Catalog.

These components work together to ensure that data engineers and production data pipelines have the necessary tools to implement the Amazon Redshift integration for Apache Spark, run queries, and facilitate the unloading of data from Amazon Redshift to the EDL.

Using the Amazon Redshift integration for Apache Spark in AWS Glue 4.0

In this section, we demonstrate the utility of the Amazon Redshift integration for Apache Spark by enriching a loan application table residing in the S3 data lake with client information from the Redshift data warehouse in PySpark.

The dimclient table in Amazon Redshift contains the following columns:

ClientKey – INT8
ClientAltKey – VARCHAR50
PartyIdentifierNumber – VARCHAR20
ClientCreateDate – DATE
IsCancelled – INT2
RowIsCurrent – INT2

The loanapplication table in the AWS Glue Data Catalog contains the following columns:

RecordID – BIGINT
LogDate – TIMESTAMP
PartyIdentifierNumber – STRING

The Redshift table is read via the Amazon Redshift integration for Apache Spark and cached. See the following code:

import pyspark.sql.functions as F
from pyspark.sql import SQLContext
sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

read_config = {
    "url": url,
    "tempdir": "s3://<capitec-redshift-temp-bucket>/<uuid>/",
    "unload_s3_format": "PARQUET"
}

d_client = (
    spark.read.format("io.github.spark_redshift_community.spark.redshift")
    .options(**read_config)
    .option("query", f"select * from edw_core.dimclient")
    .load()
    .where((F.col("RowIsCurrent") == 1) & (F.col("isCancelled") == 0))
    .select(
        F.col("PartyIdentifierNumber"),
        F.col("ClientCreateDate")
    )
    .cache()
)

Loan application records are read in from the S3 data lake and enriched with the dimclient table on Amazon Redshift information:

import pyspark.sql.functions as F
from awsglue.context import GlueContext
from pyspark import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

push_down_predicate = (
    f"meta_extract_start_utc_ms between "
    f"'2023-07-12"
    f" 18:00:00.000000' and "
    f"'2023-07-13 06:00:00.000000'"
)

database_name="loan_application_system"
table_name="dbo_view_loan_applications"
catalog_id = # Glue Data Catalog

# Selecting only the following columns
initial_select_cols=[
            "RecordID",
            "LogDate",
            "PartyIdentifierNumber"
        ]

d_controller = (glue_ctx.create_dynamic_frame.from_catalog(catalog_id=catalog_id,
                                            database=database_name,
                                            table_name=table_name,
                                            push_down_predicate=push_down_predicate)
                .toDF()
                .select(*initial_select_cols)
                .withColumn("LogDate", F.date_format("LogDate", "yyyy-MM-dd").cast("string"))
                .dropDuplicates())

# Left Join on PartyIdentifierNumber and enriching the loan application record
d_controller_enriched = d_controller.join(d_client, on=["PartyIdentifierNumber"], how="left").cache()

As a result, the loan application record (from the S3 data lake) is enriched with the ClientCreateDate column (from Amazon Redshift).

How the Amazon Redshift integration for Apache Spark solves the data sourcing problem

The Amazon Redshift integration for Apache Spark effectively addresses the data sourcing problem through the following mechanisms:

Just-in-time reading – The Amazon Redshift integration for Apache Spark connector reads Redshift tables in a just-in-time manner, ensuring the consistency of data and schema. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. By combining these Redshift tables with the source system AWS Glue Data Catalog tables from the EDL within production PySpark pipelines, the connector enables seamless integration of data from multiple sources while maintaining data integrity.
Optimized Redshift queries – The Amazon Redshift integration for Apache Spark plays a crucial role in converting the Spark query plan into an optimized Redshift query. This conversion process simplifies the development experience for the product team by adhering to the data locality principle. The optimized queries use the capabilities and performance optimizations of Amazon Redshift, ensuring efficient data retrieval and processing from Amazon Redshift for the PySpark pipelines. This helps streamline the development process while enhancing the overall performance of the data sourcing operations.

Gaining the best performance

The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize performance. You can gain performance improvements by using the default Parquet format used for unloading with this integration.

For additional details and code samples, refer to New – Amazon Redshift Integration with Apache Spark.

Solution Benefits

The adoption of the integration yielded several significant benefits for the team:

Enhanced developer productivity – The PySpark interface provided by the integration boosted developer productivity by a factor of 10, enabling smoother interaction with Amazon Redshift.
Elimination of data duplication – Duplicate and AWS Glue cataloged Redshift tables in the data lake were eliminated, resulting in a more streamlined data environment.
Reduced EDW load – The integration facilitated selective data unloading, minimizing the load on the EDW by extracting only the necessary data.

By using the Amazon Redshift integration for Apache Spark, Capitec has paved the way for improved data processing, increased productivity, and a more efficient feature engineering ecosystem.

Conclusion

In this post, we discussed how the Capitec team successfully implemented the Apache Spark Amazon Redshift integration for Apache Spark to simplify their feature computation workflows. They emphasized the importance of utilizing decentralized and modular PySpark data pipelines for creating predictive model features.

Currently, the Amazon Redshift integration for Apache Spark is utilized by 7 production data pipelines and 20 development pipelines, showcasing its effectiveness within Capitec’s environment.

Moving forward, the shared services Feature Platform team at Capitec plans to expand the adoption of the Amazon Redshift integration for Apache Spark in different business areas, aiming to further enhance data processing capabilities and promote efficient feature engineering practices.

For additional information on using the Amazon Redshift integration for Apache Spark, refer to the following resources:

About the Authors

Preshen Goobiah is the Lead Machine Learning Engineer for the Feature Platform at Capitec. He is focused on designing and building Feature Store components for enterprise use. In his spare time, he enjoys reading and traveling.

Johan Olivier is a Senior Machine Learning Engineer for Capitec’s Model Platform. He is an entrepreneur and problem-solving enthusiast. He enjoys music and socializing in his spare time.

Sudipta Bagchi is a Senior Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket. Connect with him on LinkedIn.

Syed Humair is a Senior Analytics Specialist Solutions Architect at Amazon Web Services (AWS). He has over 17 years of experience in enterprise architecture focusing on Data and AI/ML, helping AWS customers globally to address their business and technical requirements. You can connect with him on LinkedIn.

Vuyisa Maswana is a Senior Solutions Architect at AWS, based in Cape Town. Vuyisa has a strong focus on helping customers build technical solutions to solve business problems. He has supported Capitec in their AWS journey since 2019.

AWS Big Data Blog