The unified data environment provides ease of access to over 350 registered users, who use the solution for reporting, production applications, machine learning (ML), and exploratory data analysis. Previously, accessing data meant obtaining permissions for numerous separate replicas.
To create a single source of truth, PayU built an extract, transform, load (ETL) pipeline using AWS services to load data into Amazon Redshift from around 40 production databases. (See Figure 1.) Data is sent to Amazon Managed Streaming for Apache Kafka (Amazon MSK), a service to securely stream data with fully managed, highly available Apache Kafka. Amazon MSK receives the data and sends it to Amazon EMR—a cloud big data solution—which stores it in Amazon Simple Storage Service (Amazon S3), an object storage solution. Data from Amazon S3 then goes to a central Amazon Redshift cluster.
PayU uses Amazon Redshift Data Sharing—which companies use to share data securely across warehouses without copying data—to help provide read workload isolation, governance, scaling, and seamless collaboration across multiple business intelligence or analytics clusters. For shared clusters, PayU uses Amazon Redshift RA3 instances with managed storage, instances that scale compute and storage independently. To provide fast query performance, lower costs, and reduced operational overhead, PayU uses Amazon Redshift Serverless, a service to get insights from data in seconds without having to manage data warehouse infrastructure. Using Amazon Redshift Serverless simplified cluster management and reduced costs by around 2,500 dollars per month.
PayU also uses other Amazon Redshift features to solve challenges and streamline performance. For example, PayU was the first company in India to use Amazon Redshift streaming ingestion, which generates near real-time insights through streaming data ingestion into data warehouses and data visualizations. This feature makes data that is ingested using Amazon MSK available for analysis in Amazon Redshift within seconds without needing to be stored in a relational database. It also uses materialized views, which let users achieve significantly faster query performance for iterative or predictable workloads such as dashboarding and ETL data processing jobs. “We were at the forefront of evaluating the latest AWS Redshift features. We have explored it from the ground up,” says Priyank.
Using Amazon Redshift, PayU’s data environment is more robust than before. The company’s current Amazon Redshift configuration has 5 clusters with a total of 18 nodes. Two clusters are ETL clusters for data processing and write workloads, and the other three are consumer clusters for read workloads. These consumer clusters include a cluster for exploratory data analytics, one for business reporting, and one for specialized data scientists. Together all five clusters create a cohesive data sharing environment. PayU deals with billions of records, scanning around 200 TB of data daily. In March 2024, the solution handled 150,000 queries per day. “These results would not have been possible in the previous cluster implementations,” says Priyank. The company then reduced this volume to 35,000 by rationalizing unnecessary queries. “In 1 month, we cut down queries by 77 percent, which would have been a 6-month exercise in the previous environment,” says Priyank.
The current environment is highly reliable, with a failure rate under 1 percent compared with a failure rate of 7–10 percent in PayU’s previous environment, where complex queries often got stuck and canceled. Queries that previously took 10–15 minutes now take less than 1 minute on Amazon Redshift, resulting in reports more quickly reaching merchants. Data is now available in under 30 minutes—and in some cases, such as data streaming for ML, in under 5 seconds—whereas previously data was updated only once per day. By implementing Amazon Redshift, PayU saved 20,000 dollars per month and reduced the time it took to manage and maintain the environment.
PayU uses Amazon Redshift to better understand and use its data. As a result, the data science team built an ML model to recommend payment gateways to consumers on merchant pages. The company can also observe potential fraud use cases more keenly, and it built an ML model to predict the authenticity of international transactions. “We are able to prevent fraudulent transactions from taking place,” says Priyank.