AWS Big Data Blog

How Paytm modernized their data pipeline using Amazon EMR

This post was co-written by Rajat Bhardwaj, Senior Technical Account Manager at AWS and Kunal Upadhyay, General Manager at Paytm.

Paytm is India’s leading payment platform, pioneering the digital payment era in India with 130 million active users. Paytm operates multiple lines of business, including banking, digital payments, bill recharges, e-wallet, stocks, insurance, lending and mobile gaming. At Paytm, the Central Data Platform team is responsible for turning disparate data from multiple business units into insights and actions for their executive management and merchants, who are small, medium or large business entities accepting payments from the Paytm platforms.

The Data Platform team modernized their legacy data pipeline with AWS services. The data pipeline collects data from different sources and runs analytical jobs, generating approximately 250K reports per day, which are consumed by Paytm executives and merchants. The legacy data pipeline was set up on premises using a proprietary solution and didn’t utilize the open-source Hadoop stack components such as Spark or Hive. This legacy setup was resource-intensive, having high CPU and I/O requirements. Analytical jobs took approximately 8–10 hours to complete, which often led to Service Level Agreements (SLA) breaches. The legacy solution was also prone to outages due to higher than expected hardware resource consumption. Its hardware and software limitations impacted the ability of the system to scale during peak load. Data models used in the legacy setup processed the entire data every time, which led to an increased processing time.

In this post, we demonstrate how the Paytm Central Data Platform team migrated their data pipeline to AWS and modernized it using Amazon EMR, Amazon Simple Storage Service (Amazon S3) and underlying AWS Cloud infrastructure along with Apache Spark. We optimized the hardware usage and reduced the data analytical processing, resulting in shorter turnaround time to generate insightful reports, all while maintaining operational stability and scale irrespective of the size of daily ingested data.

Overview of solution

The key to modernizing a data pipeline is to adopt an optimal incremental approach, which helps reduce the end-to-end cycle to analyze the data and get meaningful insights from it. To achieve this state, it’s vital to ingest incremental data in the pipeline, process delta records and reduce the analytical processing time. We configured the data sources to inherit the unchanged records and tuned the Spark jobs to only analyze the newly inserted or updated records. We used temporal data columns to store the incremental datasets until they’re processed. Data intensive Spark jobs are configured in incremental on-demand deduplicating mode to process the data. This helps to eliminate redundant data tuples from the data lake and reduces the total data volume, which saves compute and storage capacity. We also optimized the scanning of raw tables to restrict the scans to only the changed record set which reduced scanning time by approximately 90%. Incremental data processing also helps to reduce the total processing time.

At the time of this writing, the existing data pipeline has been operationally stable for 2 years. Although this modernization was vital, there is a risk of an operational outage while the changes are being implemented. Data skewing needs to be handled in the new system by an appropriate scaling strategy. Zero downtime is expected from the stakeholders because the reports generated from this system are vital for Paytm’s CXO, executive management and merchants on a daily basis.

The following diagram illustrates the data pipeline architecture.

Benefits of the solution

The Paytm Central Data Office team, comprised of 10 engineers, worked with the AWS team to modernize the data pipeline. The team worked for approximately 4 months to complete this modernization and migration project.

Modernizing the data pipeline with Amazon EMR 6.3 helped efficiently scale the system at a lower cost. Amazon EMR managed scaling helped reduce the scale-in and scale-out time and increase the usage of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for running the Spark jobs. Paytm is now able to utilize a Spot to On-Demand ratio of 80:20, resulting in higher cost savings. Amazon EMR managed scaling also helped automatically scale the EMR cluster based on YARN memory usage with the desired type of EC2 instances. This approach eliminates the need to configure multiple Amazon EMR scaling policies tied to specific types of EC2 instances as per the compute requirements for running the Spark jobs.

In the following sections, we walk through the key tasks to modernize the data pipeline.

Migrate over 400 TB of data from the legacy storage to Amazon S3

Paytm team built a proprietary data migration application with the open-source AWS SDK for Java for Amazon S3 using the Scala programming language. This application can connect with multiple cloud providers , on-premises data centers and migrate the data to a central data lake built on Amazon S3.

Modernize the transformation jobs for over 40 data flows

Data flows are defined in the system for ingesting raw data, preprocessing the data and aggregating the data that is used by the analytical jobs for report generation. Data flows are developed using Scala programming language on Apache Spark. We use an Azkaban batch workflow job scheduler for ordering and tracking the Spark job runs. Workflows are created on Amazon EMR to schedule these Spark jobs multiple times during a day. We also implemented Spark optimizations to improve the operational efficiency for these jobs. We use Spark broadcast joins to handle the data skewness, which can otherwise lead to data spillage, resulting in extra storage needs. We also tuned the Spark jobs to avoid a  large number of small files, which is a known problem with Spark if not handled effectively. This is mainly because Spark is a parallel processing system and data loading is done through multiple tasks where each task can load into multiple partition. Data-intensive jobs are run using Spark stages.

The following is the code snippet for the Scala jobs:

  - name: jobC
    type: noop
    # jobC depends on jobA and jobB
      - jobA
      - jobB

  - name: jobA
    type: command
      command: echo "This is an echoed text."

  - name: jobB
    type: command
      command: pwd

Validate the data

Accuracy of the data reports is vital for the modern data pipeline. The modernized pipeline has additional data reconciliation steps to improve the correctness of data across the platform. This is achieved by having greater programmatic control over the processed data. We could only reconcile data for the legacy pipeline after the entire data processing was complete. However, the modern data pipeline enables all the transactions to be reconciled at every step of the transaction, which gives granular control for data validation. It also helps isolate the cause of any data processing errors. Automated tests were done before go-live to compare the data records generated by the legacy vs. the modern system to ensure data sanity. These steps helped ensure the overall sanity of the processed data by the new system. Deduplication of data is done frequently via on-demand queries to eliminate redundant data, thereby reducing the processing time. As an example, if there are transactions which are already consumed by the end clients but still a part of the data-set, these can be eliminated by the deduplication, resulting in processing of only the newer transactions for the end client consumption.

The following sample query uses Spark SQL for on-demand deduplication of raw data at the reporting layer:

Insert over table  <<table>>
select col1,col2,col3 ---...coln 
from (select t.*
            ,row_number() over(order by col) as rn 
      from <<table>>
     ) t
where rn = 1

What we achieved as part of the modernization

With the new data pipeline, we reduced the compute infrastructure by 400% which helps to save  compute cost. The earlier legacy stack was running on over 6,000 virtual cores. Optimization techniques helped to run the same system at an improved scale, with approximately 1,500 virtual cores. We are able to reduce the compute and storage capacity for 400 TB of data and 40 data flows after migrating to Amazon EMR. We also achieved Spark optimizations, which helped to reduce the runtime of the jobs by 95% (from 8–10 hours to 20–30 minutes), CPU consumption by 95%, I/O by 98% and overall computation time by 80%. The incremental data processing approach helped to scale the system despite data skewness, which wasn’t the case with the legacy solution.


In this post, we showed how Paytm modernized their data lake and data pipeline using Amazon EMR, Amazon S3, underlying AWS Cloud infrastructure and Apache Spark. Choice of these cloud & big-data technologies helped to address the challenges for operating a big data pipeline because the type and volume of data from disparate sources adds complexity to the analytical processing.

By partnering with AWS, the Paytm Central Data Platform team created a modern data pipeline in a short amount of time. It provides reduced data analytical times with astute scaling capabilities, generating high-quality reports for the executive management and merchants on a daily basis.

As next steps, do a deep dive bifurcating the data collection and data processing stages for your data pipeline system. Each stage of the data pipeline should be appropriately designed and scaled to reduce the processing time while maintaining integrity of the reports generated as an output.

If you have feedback about this post, submit comments in the Comments section below.

About the Authors

Rajat Bhardwaj is a Senior Technical Manager with Amazon Web Services based in India, having 23 years of work experience with multiple roles in software development, telecom, and cloud technologies. He works along with AWS Enterprise customers, providing advocacy and strategic technical guidance to help plan and build solutions using AWS services and best practices. Rajat is an avid runner, having competed several half and full marathons in recent years.

Kunal Upadhyay is a General Manager with Paytm Central Data Platform team based out of Bengaluru, India. Kunal has 16 years of experience in big data, distributed computing, and data intelligence. When not building software, Kunal enjoys travel and exploring the world, spending time with friends and family.