How Razorpay achieved 11% performance improvement and 21% cost reduction with Amazon EMR

This is a guest post by Narendra Kumar, Head of Platform – Data at Razorpay, in partnership with AWS.

In this post, we explore how Razorpay, India’s leading FinTech company, transformed their data platform by migrating from a third-party solution to Amazon EMR, unlocking improved performance and significant cost savings. We’ll walk through the architectural decisions that guided this migration, the implementation strategy, and the measurable benefits Razorpay achieved.

Founded in 2014, Razorpay has become a powerhouse in comprehensive payment solutions, enabling businesses to accept, process, and disburse payments online. With offerings like RazorpayX for business banking and Razorpay Capital for lending solutions, the company has experienced explosive growth, now serving millions of businesses. This rapid expansion brought significant data challenges. When Razorpay’s data platform began straining under the weight of more than 1PB daily processing demands, the engineering team faced a critical decision: continue scaling their existing third-party solution or modernize with a platform offering greater flexibility and control. They chose Amazon EMR to build a comprehensive data architecture spanning batch warehousing, real-time stream processing, and interactive analytics – all running on Apache Spark with open-source Delta Lake for ACID transactions. This wasn’t simply an ETL migration; it was a complete platform transformation that gave Razorpay’s 800 daily users access to more than 60 concurrent streaming pipelines, more than 3,000 orchestrated workflows, and the ability to query 6PB of data daily. The results validated their architectural choices: 11% better overall performance, 21% cost reduction, and the operational flexibility to optimize Spark resource allocation, leverage EC2 Spot instances, and implement advanced features like liquid clustering – all without vendor lock-in.

Achieving data insights cost-effectively with AWS

The data architecture has a data ingestion layer, data processing layer, and data consumption layer. Razorpay ingests more than 20 TB of new data every day, processes more than 1 PB of daily data using more than 60 data stream processing pipelines. This data is then consumed by querying more than 6 PB of daily data through more than 3,000 scheduled workflows.

Data flows from a variety of sources such as online transaction processing (OLTP) databases – traditional transactional or entity stores, events such as clickstream and application events, and third-party events like reverse extract, transform, and load (ETL). Most of the data consumption use cases power merchant reporting and internal analytics of the organization. The architecture powers a variety of data science use cases and financial infrastructure around a reconciliation service.

Solution overview

As shown in the following diagram, in its early stages, Razorpay operated on a small scale, using Sqoop to dump transactional data daily into a data lake and managing a Presto layer for querying this data. As they grew, the demand for near real-time data increased, prompting the setup of a change data capture (CDC) collector using Maxwell to stream data manipulation language (DML) events to Kafka. To further enhance data processing, Razorpay built a processing layer that consumed data from Kafka to UPSERT information into the lake using Apache Hudi.

Architecture diagram showing a five-layer big data processing pipeline: data stores feed into Kafka for message streaming, which connects to Apache Spark, Apache Hudi, and Sqoop for stream and batch processing, followed by a data storage and query layer using Apache Hive and Apache Presto, and finally a visualization layer with Looker, redash, and Qubole.

Additionally, the company onboarded data from third-party sources such as Freshdesk and Google Sheets and automated event ingestion from frontend applications using Lumberjack, thereby streamlining their data management processes.

As Razorpay scaled its operations, the demand for multiple real-time use cases became mission-critical, prompting the development of a robust data warehouse ingestion framework to efficiently ingest data into TiDB. To enhance service reliability and support dashboard querying, a low-latency, high-throughput service called Harvester was created, which stored pre-aggregated data for effective monitoring. Over time, reporting use cases emerged, leading to the use of a warehouse service to establish a denormalized report data layer while also exploring a real-time layer for dynamic insights. Additionally, to facilitate a smooth transition to microservices, Razorpay built a unified storage layer capable of supporting data from both its existing monolithic architecture and the new microservices, ensuring seamless integration and improved data accessibility across the organization.

Razorpay implemented a comprehensive data service migration to Amazon EMR using a phased approach. The solution architecture as shown in the following diagram comprises multiple layers handling data ingestion, processing, and consumption.

Technical implementation

A modern and scalable analytics platform focuses on real-time data ingestion, petabyte-scale processing, and cost-optimized storage – all orchestrated with robust workflow management:

Data ingestion layer

To handle large-scale and diverse data sources, they implemented a combination of CDC and file ingestion patterns:

CDC using Amazon Aurora MySQL-Compatible Edition – Used Debezium and Maxwell for low-latency replication and streaming of database changes
High-volume streaming pipelines – Configured streaming pipelines capable of processing more than 20 TB of daily inbound data
Third-party data integration: Implemented secure file push mechanisms to ingest partner and software as a service (SaaS) data into the service

Data processing layer

Razorpay designed the processing stack on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) with Spark as the primary compute engine

Batch warehousing – Daily ETL and aggregation jobs processing more than 1 PB of data
Stream processing – Real-time analytics pipelines across more than 60 concurrent processing streams
Delta merge operations – High-performance incremental updates across more than 25 Delta Lake tables

Data storage and organization

Their data storage follows the medallion architecture pattern layered on an Amazon Simple Storage Service (Amazon S3):

Raw zone – Immutable ingestion zone for original source data
Processed and aggregated zone – Optimized datasets ready for analytics and reporting
Open source software (OSS) Delta Lake format – Implemented open source Delta Lake for ACID transactions, schema enforcement, and faster query performance

Workflow orchestration

Complex data workflows are automated and monitored using a hybrid orchestration approach:

Apache Airflow integration – Scheduling and coordinating more than 3,000 workflows per day
dbt on Amazon EMR – SQL-based transformations for business logic and metric definitions
Specialized compliance jobs – Dedicated workflows meeting the 15-minute SLA for sensitive regulatory reporting

Performance optimizations

To ensure cost efficiency and high throughput, the following optimizations were applied:

Spark tuning – Custom configurations for executor memory, shuffle partitions, and serialization to maximize hardware utilization
Liquid clustering – Implemented in delta lake tables to improve query performance over large datasets
Optimized delta merges – Reduced merge latency for incremental updates.
Auto scaling – Dynamic scaling policies based on workload patterns to balance performance and cost

To enable a secure migration, they implemented Amazon EMR security best practices following AWS guidance on encryption, authentication, and authorization as documented in the Amazon EMR security best practices.

This architecture delivers low-latency ingestion, petabyte-scale processing, and robust workflow orchestration so that analytics teams can derive faster insights while maintaining compliance and optimizing for cost.

The combination of Debezium and Maxwell for CDC, Spark on Amazon EMR, OSS Delta Lake on Amazon S3, and Airflow with dbt has proven to be a scalable and resilient approach for modern data analytics workloads

Business Impact: What Amazon EMR Enabled

11% performance improvement enabling faster insights for 800 daily active users
13-15% faster execution for large warehouse jobs, accelerating time-to-insight for critical business decisions
21% cost reduction reinvested into product innovation for merchant customers
Seamless scaling from 20 TB to 1 PB+ daily processing without performance degradation
Enterprise reliability supporting 350,000 operational reports and compliance requirements

Key learnings and best practices

Throughout their migration to Amazon EMR, Razorpay learned valuable lessons that helped optimize their data platform. We are sharing these insights to help other customers accelerate their own modernization journeys while avoiding common pitfalls.

Infrastructure Stability and Performance

Optimizing Spark Resource Allocation – Razorpay initially assumed that Spark’s dynamic allocation would automatically optimize resource utilization. However, they discovered it introduced overhead that degraded performance for certain workload patterns. To address this challenge, they took two approaches depending on workload characteristics – setting explicit maxExecutors values for predictable workloads, and enabling maximizeResourceAllocation to create “fat executors” that fully utilized available cluster resources. These targeted configurations improved job execution times by 13-15% for large-scale data processing workloads.
Ensuring Stability with Yet Another Resource Negotiator (YARN) node labels – When using EC2 Spot instances for cost optimization, Razorpay encountered a critical issue in which Spot instance interruptions occasionally terminated nodes running critical driver containers, causing entire job failures. Their solution was elegant and effective. They configured YARN node labels to ensure driver containers always spawn on On-Demand Instances, while task nodes use cost-effective Spot capacity. This architecture delivered both cost efficiency and reliability, making their jobs resilient to Spot interruptions while maintaining 21% cost savings.
Managing Spot Instances Effectively – Razorpay’s initial approach of switching entirely to On-Demand Instances during Spot availability constraints eliminated the cost benefits they were seeking. They implemented several best practices to address this such as using instance fleets with allocation strategies (price-capacity optimized and capacity optimized) to maximize Spot availability, spreading primary instances across multiple Availability Zones for fault tolerance, and accepting that heterogeneous executors create varying executor sizes while planning capacity accordingly. They maintained high Spot utilization rates while ensuring workload continuity, achieving optimal price performance.

Cost Optimization

Achieving Sustainable Cost Efficiency – As data volumes grew to more than 20 TB daily, Razorpay needed to scale infrastructure while controlling costs. They implemented a comprehensive cost optimization strategy that included multiple components. First, they right-sized primary nodes by avoiding over-provisioning and selecting instance types matching actual workload requirements. They consolidated workloads by combining multiple jobs on fewer large clusters to maximize resource utilization. For SLA-sensitive jobs, they migrated to Amazon EKS and Amazon EMR Serverless for automatic scaling and pay-per-use pricing. They adopted Graviton instances, migrating compatible workloads to AWS Graviton processors for superior price-performance. Finally, they diversified instance fleets by employing multiple instance types to reduce Spot interruption impact.

These optimizations delivered 21% cost savings while supporting 800 daily active users and processing 1 PB of data daily. This enabled Razorpay to invest savings back into product innovation for their merchant customers, demonstrating how technical optimization directly translates to business value.

Conclusion

Razorpay’s migration to Amazon EMR demonstrates how the right data processing platform can transform business outcomes at scale. By achieving 11% better performance, 13-15% faster execution times, and 21% cost savings, EMR enabled Razorpay to build an enterprise-grade data platform that supports 800 daily users, more than 3,000 dashboards, and 10 million monthly queries.

To learn more about building similar data analytics solutions on AWS, check out the following resources.

Documentation:

AWS solutions:

Get started:

Visit the Amazon EMR product page
Explore Amazon EMR Notebooks overview

AWS Big Data Blog