How PayU Finance Built a Modern Data Platform with AWS Data Lab
The AWS Data Lab collaborated with us and helped us create the architecture and foundation for a data platform that is custom-fit to our needs and has scaled to meet our growing customer base."
Chief Data Scientist, PayU Finance
This is a guest post by Praveen Singh, Director of Data Engineering at PayU Finance.
PayU Finance is a FinTech organization that has products for both Transactional Credit and Personal Loans. It extends credit to Indian customers not covered by traditional banks via its various lending programs, including unsecured personal loans. We leverage AWS cloud infrastructure, analytics, and machine learning services to power our suite of products for digital lending. This case study outlines how PayU Finance was able to reinvent its data platform to empower various stakeholders in the organization to make data-driven decisions and provide a great customer experience. The data platform includes features that give us complete control over our data and help us get the most out of it.
In this case study, we'll cover:
- An overview of the core architecture
- How PayU Finance designed and built the core architecture leveraging AWS Data Lab
- How PayU Finance is unlocking business value with the new architecture
Overview of the core architecture
The AWS Data Lab helped PayU Finance build our core data platform architecture, which set the foundation for us to continue evolving the solution to support new features, scalability, and performance requirements over time. With the help of the AWS Data Lab, PayU Finance started by designing and building an architecture following the de-identified data lake (DIDL) approach. We used this approach to ensure proper security and compliance elements were in place for handling sensitive Personal Identifiable Information (PII) throughout the data lifecycle. The architecture powers an internal data marketplace using a hub and spoke design: the hub is our central data lake that hosts the raw data assets after data de-identification and data cleansing in a producer AWS account and the spokes are our team and business-specific consumer AWS accounts that consume raw data from the central data lake for specific business processing requirements. The benefit of this architecture is its flexibility to adapt to a wide range of use cases and allow the replacement of any component or service as we grow.
AWS Data Lab experience
In February 2022, PayU Finance participated in a Build Lab with the AWS Data Lab to receive guidance on how to build this data platform. A dedicated AWS Data Lab Architect was assigned to our engagement and we began a series of prep calls in the weeks leading up to our lab to set our team up for a successful prototype implementation. The PayU Finance team prioritized a few representative use cases to focus on in the lab and during these prep calls, we provided detailed functional and non-functional requirements for our use cases. Based on the requirements, our Data Lab Architect deep-dived to understand the data lifecycle, reports mapping, and transformation requirements. Our Data Lab Architect then proposed an initial architecture as a starting point and we defined the scope of the prototype from there.
Figure 1: The initial architecture designed and developed with the AWS Data Lab
The initial architecture, depicted in Figure 1, included the following components:
- The data from our source relational database tables is ingested into the Amazon Simple Storage Service (Amazon S3) landing zone in the central data lake account (hub). AWS Data Migration Service (AWS DMS) is used for a one-time full load and ongoing change data capture.
- The source data includes sensitive PII. AWS Glue ETL jobs pseudonymize specific PII in the data, creating additional attributes in the records. The output is written to the Amazon S3 raw zone. AWS Glue uses job bookmarks to track data already processed and support incremental processing.
- AWS Glue ETL jobs cleanse the data and prepare it to perform business transformations. Additionally, the jobs strip and separate PII into a specific Amazon S3 zone called a "vault zone". The vault zone stores the mapping of actual PII and hashed values. This helps provide controlled access to the PII. The cleansed and prepared pseudonymized data is written into our Amazon S3 "trusted zone". The trusted zone is available to various data consumers for their specific business processing needs.
- Data in our Amazon S3 trusted zone is catalogued using AWS Glue Data Catalog to allow consumption by other AWS services. AWS Lake Formation is used for access control at the database, table, and column level.
- In a separate consumer AWS account, AWS Glue ETL jobs are written to perform business transformations storing the output into an Amazon S3 "refined zone". Hudi integrated with AWS Glue ETL jobs allows for efficient transactional updates on the refined zone. AWS Lake Formation allows cross-account data access using Lake Formation tags and AWS Glue Data Catalog stores the table schemas.
- The PayU Finance team runs queries on the Amazon S3 refined zone using Amazon Athena and Amazon Redshift Spectrum and consume them via our BI tool of choice.
- The data pipeline of Glue ETL jobs in each account is orchestrated using AWS Step Functions.
Architecture evolution post-lab
From the beginning, we wanted to be able to quickly roll out the initial version of this platform. Only two weeks after our Build Lab, we were able to start using our solution in production for a small share of jobs. The solution worked well for our use cases, and soon we were eager to go even deeper and optimize each level for performance, scale, and security. We wanted to introduce even more elements of a modern data lake - features like enterprise metadata management and central orchestration.
These are some of the core principles that we applied post-lab to take our solution to the next level:
- Re-use the foundation of the current architecture and build more functionality and solutions on top of it. We did minimal changes to the current functionality, so we didn’t need to modify the already running data pipelines.
- Run native Apache Spark, Version 3.X with computing on-demand basis. The best-fit choice for us is Apache Spark on Kubernetes.
- Use configuration to drive data transformations. Most commonly used transformations, like flattening a JSON column, are reused by different pipelines by passing configuration.
- Maximize efforts from developers to develop and test the core logic, which is shared by businesses that are consumers of the datasets in the data lake.
- Group approval requests together to minimize the number of times individual approvals are submitted (being a FinTech company means we have many compliance requirements to follow regarding approval processes).
- Avoid data replication, but still maintain a replica of the source. Effectively and easily manage dependencies between datasets.
- Optimize for cost. Leverage Amazon EC2 Spot Instances to the maximum extent.
- Save and index the logs for each job in our central logging system for reference, and to facilitate simple debugging of failed jobs.
- Maximize ease of deployment for the central orchestration tool, and minimize effort on integrations.
- Include alerts via email and messaging, and have highly effective monitoring and auditing functions in place.
Figure 2: PayU Finance's final architecture post-lab
Unlocking business value
We started this project with the desire to design and develop a best-in-class modern data platform that could operate at scale for several years and provide us with the level of governance and auditing we require. The architecture that we co-created with AWS Data Lab was not only a perfect fit, but it also gave us a foundation to continue building on in the future. After quickly bringing the initial solution to production, we continued learning in iterations. We ideated, developed, integrated, and deployed several additional features in the months that followed our lab. And today, we have reached a stage where we have a best-in-class modern data platform ready and live in production.
The Data Engineering team at PayU Finance is impressed to see how the AWS Data Lab was able to help us architect the foundation of a data platform that is custom-built for FinTech requirements and that scales with our needs. It was a great collaboration on a challenging project.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
About the Author
Praveen Singh is a Director of Data Engineering at PayU Finance. He leads the data platform initiatives at PayU Finance. He, along with PayU Finance’s Data team, worked with the AWS Data Lab to build this data platform. Praveen has around a decade of experience architecting and shipping distributed applications and Big Data applications for both large enterprises and startups. He has built data-driven solutions across business-critical verticals and empowered management to make smarter and more informed decisions. In the past, he has led Data Engineering at Dunzo and Fractal.ai.
About AWS Data Lab
AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate databases, analytics, AI/ML, and application & infrastructure modernization initiatives. During the lab, AWS Data Lab Solutions Architects and AWS service experts support the customer by providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks. Customers leave the engagement with a prototype that is custom fit to their needs, a path to production, deeper knowledge of AWS services, and new relationships with AWS service experts.
Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.