AWS Partner Network (APN) Blog

Accelerate Hadoop-to-Amazon EMR Migration Using Virtusa’s Migration Factory

By Suhrid Saran, Solution Architect, Data and Analytics – Virtusa
By Hussain Shabbir, AWS CoE Lead and Sr. Director – Virtusa
By Néstor Gándara, Sr. Global Partner Solutions Architect – AWS
By Dipankar Ghosal, Sr. Principal Data Architect – AWS

Virtusa-AWS-Partners
Virtusa
Connect with Virtusa-1

The global Hadoop-as-a-service market size is growing at a CAGR of ~39% with an expected market projection of $75 billion by 2026. While this is still a growing market, it leaves out many small, mid, and large-scale organizational players due to the inherent pains of migration.

Virtusa Corporation is an AWS Premier Tier Services Partner that provides digital business strategy, engineering, and IT services. Virtusa developed the Hadoop to Amazon EMR Migration Factory to help organizations move their Hadoop ecosystem to Amazon Web Services (AWS), catering to market growth and better value offerings.

Virtusa holds the Amazon EMR Service Delivery designation, and has eight AWS Competencies in areas like Migration, DevOps, Data and Analytics, and Digital Workplace.

In this post, we will discuss how Virtusa complements Amazon EMR Migration by providing approaches and utilities to streamline and manage large-scale Hadoop-to-Amazon EMR migrations by creating a migration framework of automated modules.

Following is an overview of migration framework:

Virtusa-Hadoop-EMR-3

Figure 1 – Overall view of migration framework.

Advantages of Migrating Big Data Applications to the Cloud

Productivity and Efficiency

  • Unified, streamlined, and secure platform ops for higher efficiency and productivity.
  • Rapid development of new apps.

Reduce Costs

  • Optimize infrastructure for the workload; only pay for what you use, thereby reducing the operational cost.
  • Scaled operations—build to fit for the cloud.
  • Dynamic resource allocation as needed.
  • Provide self-services for users.

Business Outcome

  • Accelerated data delivery to business users for accurate decision making.
  • Reliable and governed data delivery through shared data lake.
  • High scalability to deliver strong machine learning (ML) and artificial intelligence (AI) use cases that predict business outcomes for more value.

Virtusa-Hadoop-EMR-2

Figure 2 – Advantages of migrating to the cloud.

Challenges with Migrating to the Cloud

Most companies understand the need to migrate their big data solutions to the cloud and utilize its benefits, but some of the key factors that hold them to move away from on premises are:

  • Readiness of big data workflow to be migrated to AWS.
  • Prolonged migration timelines that add to the cost of migration.
  • Technical debt and identifying the right set of infrastructure on AWS.
  • Compute and storage bundled together.
  • Shared resource conflicts and resilient switch over strategy to cloud.

Virtusa’s Hadoop to Amazon EMR Migration Factory helps eliminate these barriers to cloud adoption.

Virtusa-Hadoop-EMR-1.1

Figure 3 – Challenges with migrating to the cloud.

Migration Factory Solution In-Depth

The first step in cloud migration is to pre-analyze the workflows to make necessary changes to its corresponding scripts to avoid potential errors on AWS.

Before starting the execution of the scripts from end-to-end data, an initial data-independent execution must be conducted. This strategic approach saves the migration cost and helps reduce the errors in the individual workflow scripts while running on their first execution in AWS.

The next step is to automate the process to ensure the remediation of failures in the early stages. The architecture in Figure 4 depicts how to build a migration accelerator called the “Migration Factory.” This will facilitate an easy execution of scripts in an accurate workflow and create a custom log.

Depending on the logs and decisions, the scripts are segregated based on their success or failure. The scripts to be migrated and the dependency matrix file with the script sequence will become the input to the migration factory.

Virtusa-Hadoop-EMR-4

Figure 4 – Migration Factory solution architecture.

The driver of the Migration Factory is AWS Lambda function 1 and function 2.

AWS Lambda Function 1

The AWS Lambda function will read script sequences from the dependency matrix and get them from Amazon Simple Storage Service (Amazon S3) scripts to execute the Amazon EMR.

The Lambda function 1 gets invoked when the scripts like Spark, Hive, and Pig, along with the dependency matrix with the sequence of workflow scripts, are pushed to S3 through GitHub.

Virtusa-Hadoop-EMR-5

Figure 5 – Migration Factory: AWS Lambda function 1.

AWS Lambda Function 2

After executing the scripts, the execution logs are available. The Lambda function 2 is invoked based on the success or failure steps on Amazon EMR. The successful scripts are moved to “valid bucket” and the ones with errors to “invalid bucket” on S3.

The Lambda function generates a custom exception log in JSON format to a third S3 bucket, named “exception bucket.” This is handy in keeping track of exceptions generating for the scripts while executing on EMR and can be easily queried via Amazon Athena.

This log can categorize scripts based on similar exceptions and apply a common fix; it can be archived and referred to until successful end-to-end execution of the workflow.

Virtusa-Hadoop-EMR-6

Figure 6 – Migration Factory: AWS Lambda function 2.

Pre-Handling Errors in the Scripts

Before starting the initial run, some of the common errors in the scripts related to on-premises paths and locations can be ruled out by creating an S3 location (like HDFS location). Using this S3 location (as a prefix), the on-premises HDFS path can be updated for migrating the scripts.

Since we are talking of massive migration, it’s good to automate such mundane tasks by writing small utilities. One of the ways is to write a Python or Unix utility that loops through each script in a workflow and searches for anything starting with an on-premises HDFS path and updating it with the S3 location.

Another common error is using custom jars in the workflow, pushing them to S3 as a dependency, and using them while launching the Amazon EMR.

Key Benefits to Customers

Virtusa delivered the following benefits and outcomes in implementing this Migration Factory during one of its Amazon EMR migrations:

  • 50% reduction in migration effort and timelines.
  • 70% ready to migrate codebase, post automated transformations.
  • ~40% reduction in the cost of compute per application run resulting in 40% reduction in total cost of ownership (TCO).
  • 80% faster infrastructure build, incorporating industry best practices of governance, consistency, attribution, and learnings from Virtusa’s experience in transparency and price predictability.
  • 90% reduction in data storage costs during migration and a similar reduction in overall storage costs.
  • Additional 67% savings on storage costs for the migration effort for all future developments.

Virtusa-Hadoop-EMR-7

Figure 7 – Key benefits and outcomes.

Conclusion

In this post, we shared how to simplify large-scale Hadoop-to-Amazon EMR migrations by creating a migration framework of automated modules.

With Virtusa’s Migration Factory, you can rapidly and successfully migrate your Hadoop application to Amazon EMR with minimum business disruption. Organizations can achieve a more predictable migration journey by combining these approaches and utilities.

To learn more about Virtusa’s solutions, visit their website.

.
Virtusa-APN-Blog-Connect-2.


Virtusa – AWS Partner Spotlight

Virtusa is an AWS Premier Tier Services Partner and global provider of digital business strategy, engineering, and IT services and solutions. Virtusa accelerates clients’ cloud adoption through technical, training, and GTM investments.

Contact Virtusa | Partner Overview | AWS Marketplace

*Already worked with Virtusa? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.