AWS Storage Blog
How Delhivery migrated 500 TB of data across AWS Regions using Amazon S3 Replication
Delhivery is one of the largest third-party logistics providers in India. It fulfills millions of packages every day, servicing over 18,000 pin codes in India and powered by more than 20 automated sort centers, 90 warehouses, with over 2800 delivery centers.
Data is at the core of the Delhivery’s business. In response to recent regulatory changes, Delhivery needed to migrate its critical data to be within India. With Amazon S3, Delhivery was able to use features like S3 Replication and S3 Batch Operations to migrate over 500 TB of data and over 70 million objects across AWS Regions to achieve compliance with the regulations.
In this post, we will explore the various approaches, migration phases, and key learnings that enabled Delhivery to successfully migrate their data across AWS Regions. The migration process involved implementing near real-time replication mechanisms to ensure seamless data synchronization between the source and destination Regions, while also maintaining uninterrupted service for downstream systems and applications. Ultimately, this enabled Delhivery to achieve data residency requirements mandated by Indian regulators.
Delhivery’s use case
Delhivery had set up its data lake in the US East (Northern Virginia) Region. Over the years, this data lake had scaled exponentially, reaching more than 500 TB in size with over 70 million objects, powering critical data and analytics applications for the business. In 2023, India enacted the Digital Personal Data Protection (DPDP) Act, mandating that organizations store critical data within Indian geographical boundaries.
To ensure compliance with this new regulation, migrating Delhivery’s data lake to an AWS Region in India, became a top priority for the company. We laid out a set of key requirements to guide this complex data migration project. For example, the solution should synchronize the data in the source and destination locations, until the data pipeline was fully migrated to the target Region. Furthermore, existing applications relying on the data in the source Region should not be impacted.
Existing setup
The existing data platform fetches data through Change Data Capture (CDC) connectors from multiple data sources, which then push relevant events and payloads to Amazon Managed Streaming for Apache Kafka (Amazon MSK). From there, the data gets stored in Amazon S3 buckets comprising the data lake. Using Kafka connect transformation, the data is transformed into a batch or merge layer for further consumption, as shown in Figure 1.
Figure 1: Delhiverys Data Ingestion Pipeline with Apache Kafka, Amazon MSK, and Amazon S3
Regarding the scale of the data, Delhivery has over 800+ data pipelines that ingest 60k messages per second and process over 350 GB of data every day.
Determining the best approach
To address the requirement of migrating data across AWS Regions while making sure of consistency, compliance, and cost-effectiveness, we evaluated various approaches. The key strategies considered for this migration include:
Approach 1: Running duplicate connectors
In this approach, we proposed populating historical data from the source US East (Northern Virginia) Region to the target Asia Pacific (Mumbai) Region using S3 Batch Replication, and then setting up duplicate connectors in the target Asia Pacific (Mumbai) Region for live data replication. Once the read workload has been successfully migrated to the target Asia Pacific (Mumbai) Region, we could disable the connector in the source US East (Northern Virginia) Region.
However, Implementing duplicate connectors for each AWS Region would introduce unnecessary complexity and additional expenses to our current architecture, ultimately increasing the overall costs.
Approach 2: Cross-Region data consumption
In this approach, the plan was to use AWS Glue to serve both historical and recent data from two separate buckets: one for source data (historical) and another for target data (recent). Specifically, we would point recent partitions to the target (Mumbai) Region bucket while keeping old partitions unchanged, instead directing them to the source Region (North Virginia) bucket. Then, using S3 Batch Replication, we would replicate the historical data from the source bucket to the target Region bucket.
Major downside of this approach is the high data transfer cost (for the read workload), as the data would be read from the Asia Pacific (Mumbai) Region from day 1 (let’s say d is the data scan against which the data transfer cost occurs on day 1) and it would increase up to n * d (where n is number of days for the migration activity and d is the data scan). Therefore, if the migration activity takes more time than the expected timelines, then there could be large data transfer costs.
Approach 3: Amazon S3 Cross-Region Replication and S3 Batch Replication (recommended)
In this approach, we explored having live data replication across the connectors, thus allowing fresh data to be transmitted in real-time to the Asia Pacific (Mumbai) Region. Then, using S3 Batch Replication we would load the historical data from the North Virginia Region to the Mumbai Region bucket. One potential downside of this approach is the potential lag in the data pipeline for the consumers in the target Region, while the data is getting replicated across Region.
Existing data pipelines need the replication lag to be no more than five minutes. Through our load testing, we found S3 Replication (CRR) was replicating objects within 5 minutes. Therefore, we chose this approach as our primary migration strategy.
The planning phase
After finalizing the approach to use S3 Cross-Region Replication and S3 Batch Replication, we began by conducting a comprehensive assessment of the data . This involved diving into the data metrics and identifying the specific buckets, prefixes, and tables that would be part of the migration process.
After analyzing the data usage patterns using our Apache Presto query layer, we determined that migrating the historical data in the top 1% most frequently accessed (the 99th percentile) would best meet our requirements. We also identified the non-transactional data flowing into certain S3 buckets that needed complete data migration from inception.
The execution phase
With preparation complete, we ventured into the execution phase, starting with completing our prerequisites.
Addressing prerequisites
First, we had to address our solution prerequisites:
Amazon S3: We created a Asia Pacific (Mumbai) Region bucket corresponding to the US East (Virginia) Region buckets within the same account. To implement S3 CRR, we ensured that S3 Versioning was enabled for both the source and destination buckets, as it is a prerequisite for replication. Similarly, we disabled the S3 Lifecycle rules for both the Asia Pacific (Mumbai) Region and US East (Virginia) Region buckets, as per the considerations for S3 Batch Replication. Finally, we created one artifact bucket in the US East (Virginia) Region to store manifest files for S3 CRR and another artifact bucket in the Mumbai Region to store the S3 Batch Replication results.
AWS Identity and Access Management (IAM): To perform S3 CRR and S3 Batch Replication, we made sure that an IAM role had the appropriate permissions to execute the operation, such as read/write/delete access to the buckets, Amazon CloudWatch access, and permission to execute S3 CRR and S3 Batch Replication.
Monitoring: We created two Amazon Simple Notification Service (Amazon SNS) topics, one in the US East (Virginia) Region (for monitoring ReplicationLatency metrics) and another in the Mumbai Region (for tracking OperationsFailed metrics). There were also subscriptions of both SNS topics to the email protocol using the appropriate email addresses to receive alerts in case of anomalies.
Completing the migration
First, we set up S3 CRR. After setting up S3 CRR, we created S3 Batch Operations for the replication of historical objects, in which we defined the manifest to be fetched from the S3 CRR rules.
After generating a manifest file using S3 Batch Operations, we verified its accuracy by running a query against it using Amazon Athena to validate the contents. Following validation, we executed the job with specific filters applied, such as object creation time (to target historical data migration) and replication status (set to ‘not replicated’ to exclude objects that have already been replicated through S3 CRR).
We also set up CloudWatch alarms for monitoring metrics and posts that created the tables in the Asia Pacific (Mumbai) Region over the desired S3 location and loaded the partitions in AWS Glue.
After creating the necessary tables, we performed an in-depth verification process to make sure of data integrity. This involved verifying various key metrics, such as record counts at randomly selected partitions, file totals, and ETag validation. Upon completing the stages, we updated the read workload location from the US East (Virginia) Region to the Asia Pacific (Mumbai) Region and enabled users to be able to access data directly from the Asia Pacific (Mumbai) Region. Simultaneously, we migrated the connectors responsible for uploading data to S3 to the Asia Pacific (Mumbai) Region.
In this way, we successfully migrated over 500 TB of data spread across 70 million objects in 45 days across AWS Region without disrupting our extensive network of 800+ data pipelines and applications. Our teams worked together to meticulously execute the migration, making sure that every step was carefully planned and executed to make sure of a smooth transition. Additionally, we maintained backup of the data pipelines that drew data from the US East (Virginia) Region as a contingency plan.
Learnings
Here are Delhivery’s learnings from its data replication journey:
- Execute S3 CRR before S3 Batch Replication, because there are chances of missing data if we execute the S3 Batch Replication first and then S3 CRR. For example, say S3 Batch Replication is enabled at 08:05:10 PM and S3 CRR at 08:05:58 PM, in the case of continuous data ingestion the events between the interval are missed.
- We learned to review the existing replication rules. We noticed that in certain buckets, objects were already being replicated. This meant that our batch operation was not executing due to the filter’s requirement for ‘not replicated’ objects. To resolve this, we revised the filter to consider all objects, regardless of their replication status.
- Versioning is a prerequisite for replication. However, it’s crucial to make sure that the data within the bucket remains immutable (in other words, not deleted or moved) after versioning is enabled. This is crucial because once versioning is activated, there is a risk of rapid growth in bucket size due to multiple non-current versions, leading to increased storage costs. To mitigate this risk, we implemented S3 Lifecycle rules for non-current object versions to have control over storage expenses.
- Make sure of the prioritization of S3 CRR. It’s possible to add multiple S3 CRRs over a bucket based on a specific prefix. However, we need to assign a unique priority number for each prefix, because attempting to set up multiple S3 CRRs with the same priority number results in an error.
- Live replication is needed even if live data is not there. The existing objects of a prefix are only replicated through S3 Batch Replication if a S3 CRR is enabled for that prefix. For example, consider an S3 prefix where live data isn’t being generated, and we want to replicate historical data. To achieve this, we need to set up a S3 CRR over the prefix, making sure that it’s included in the manifest file created by the batch operation.
- The S3 Replication report should be used as the manifest file. We had three options to create an S3 Batch Replication manifest: S3 Inventory report, manual CSV generation, or using an S3 Replication configuration. We chose the third option because manually creating a CSV manifest would have been impractical due to the sheer scale of the objects.
Conclusion
Delhivery’s massive cross-Region data migration project showcases the complexities and challenges associated with large-scale data migrations. Through careful planning, execution, and refinement, Delhivery ensured compliance with regulatory requirements while enhancing operational resilience and efficiency.
Key takeaways:
- S3 CRR was used for near real-time replication of live data from the source US East (Virginia) Region to the target Asia Pacific (Mumbai) Region, while ensuring minimal lag in data pipelines.
- S3 Batch Operations was used to replicate historical data from the source to the target Region by generating and executing manifest files based on the S3 Replication configuration.
- Careful planning, phased execution, and meticulous monitoring were crucial in managing the complexities of such a large-scale data migration, ensuring minimal disruption to downstream systems and applications.
- Delhivery’s success was a result of a comprehensive approach that combined technical expertise, regulatory acumen, and collaborative teamwork.
To learn more about the permissions required to execute S3 Cross-Region Replication and S3 Batch Replication, check out the S3 User Guide. Additionally, explore best practices for managing replication rules to optimize your data replication strategy.