Flipboard Teams with Mactores to Modernize a High Volume HBase Data Platform to Fully-Managed Amazon EMR
By Bal Heroor, CEO and Principal Consultant – Mactores Cognition
By Kiran Randhi, Sr. Partner Solutions Architect – AWS
The social media company Flipboard started off with a mission to create a user-centric news curation platform that connects every user with the most relevant stories from across the internet. The idea was driven by the growing need for a unified data platform that identifies and aggregates the most engaging stories published across a myriad of online media channels.
From a user perspective, news aggregation is a dynamic and rapidly evolving problem—every user demands convenient access to the most interesting stories based on ever-changing preferences.
From a technology perspective, the solution involves a unified and managed serverless cloud platform capable of running large-scale distributed big data workloads that are integrated with open-source frameworks and ready to run advanced artificial intelligence (AI) applications at scale.
Flipboard currently has over 100 million active users generating vast volumes of data, and the company’s existing self-managed Apache HBase platform was failing to support the scaling of big data workload processing demands. The platform saw an average of 300,000 read requests per second and around 120,000 write requests per second.
The cluster running on Amazon Elastic Compute Cloud (Amazon EC2) across 5,600 regions and processing 40TB of information on Amazon Elastic Block Store (Amazon EBS) was entirely self-managed. The growing demands on big data processing, throughput performance, and driving profitable business analytics were overwhelming challenges for the workforce managing the limited technology capabilities.
To address these challenges, Flipboard engaged Mactores Cognition for a thorough assessment of the self-managed platform and help migrating existing data workloads to a fully managed Amazon EMR serverless big data platform.
The cloud migration and modernization process streamlined Flipboard’s distributed database capabilities, allowing the social media platform to support user spikes at scale, maximize throughput performance, and prepare to expand the user base exponentially.
In this post, we’ll take an in-depth view of the cloud migration and data platform modernization process for Flipboard. Mactores is an AWS Advanced Tier Services Partner with AWS Competencies in Data and Analytics, DevOps, Migration, and Machine Learning consulting services, as well as the Amazon EMR service delivery designation.
Amazon EMR Migration Strategy
Flipboard had originally adopted the self-managed HBase platform to realize its goals of high-speed distributed database, scalable Hadoop operations, and the flexibility to manage vast volumes of data. The company took advantage of the variety of Amazon Web Services (AWS) infrastructure as a service (IaaS) capabilities and developed a platform that best satisfied its technology demands for the foreseeable future.
In doing so, Flipboard identified two key challenges: the infrastructure demands for a data-driven company is a moving target that varies with limited predictability and control; at the same time, the company was inherently overwhelmed by the efforts and engineering resources required to effectively self-manage a large-scale distributed database platform.
As a solution, Mactores devised a strategic plan to:
- Migrate existing HBase systems running on the self-managed platform to a fully managed Amazon EMR platform integrated with AWS storage solutions.
- Migrate all tables and associated processes from the self-managed cluster to EMR.
- Reconfigure protocol buffers and HBase client from the existing MapReduce jobs to support HBase on EMR.
- Migrate all MapReduce jobs to a separate transient EMR cluster and support the Flipboard team to instantiate the cluster from Jenkins workflow.
- Migrate HBase 1.4 to HBase 2.2.6 on Amazon EMR 6.2.0.
Journey to a Fully Managed EMR HBase Platform
The migration and data platform modernization process started off with a thorough assessment and evaluation of Flipboard’s current state of cloud readiness. This approach allowed Mactores to establish a migration plan that was well aligned with Flipboard’s long-term business goals and end-user expectations.
The assessment also identified gaps and opportunities throughout the migration journey. The result was high stakeholder commitment and motivated teams ready to maximize the value potential of Amazon EMR to power advanced big data workloads on the managed cloud platform.
The migration process was accelerated by following an iterative approach across three phases: assess, migrate and modernize.
Phase 1: Assess
The assessment phase involves building a solid business case and action plan based on the prior assessment and evaluation. This brings together the people, process, and technology required to adopt and execute distributed database capabilities within a secure, automated, and efficient AWS environment.
The overarching goals of this phase included the following:
- Mobilizing teams and preparing for the EMR migration. The transition was designed to be user-centric, reducing the learning curve and transforming teams to maximize the value of the cloud environment.
- Defining and automating policies for security, operations, and compliance.
- Running a cloud-based Hadoop platform in production capacity with the goal of improved performance across several benchmarks and metrics.
In order to realize these goals, Mactores performed the following cloud migration, provisioning, and management tasks:
- Amazon EMR architecture with Amazon S3 storage: Configured three r6g.12xlarge Master nodes and nine r6g.12xlarge core nodes on regional servers and added additional nodes for auto scaling.
- Configurations: Pre-validated EMR-specific configurations applied to Apache HBase, Amazon S3 Region Server, WAL, Block Cache, and Memstore.
- Data migration: Self-managed to Amazon S3 HBase migration with a single snapshot one week before the cutoff date, followed by incremental periodic updates until final migration.
- Stress testing: Yahoo Cloud Serving Benchmark (YSCB) framework was used to gather performance benchmarks throughout the migration process.
Figure 1 – Final deployment architecture.
Phase 2: Migrate
The migration phase extends the action tasks from the mobilize phase and applies them to migrate data workloads at scale. This is an ongoing and iterative process that covers the process, technology, and design best practices pertaining to EMR migration, and prepares the organization for a fully managed AWS data platform service offering.
The following tasks were involved in the migration process:
- Amazon EMR architecture for MapReduce jobs: One m5.2xlarge Master node and five m5.4xlarge core nodes of region servers and auto scaling of core nodes for MapReduce jobs.
- EMR configurations: EMR-specific HDFS and MapReduce configurations based on pre-validated Apache Hadoop platform previously running on the self-managed cluster.
- Workload migration: From MapReduce to EMR v5.30.0 by recompiling the MapReduce code.
- Performance testing: Writing tests and executing MapReduce jobs on the EMR cluster with EMR HBase for benchmark comparisons.
The HBase region server core nodes installed on the Amazon EC2 cluster offer improved read performance by caching data and using efficient in-memory filters. The HBase Write Ahead Log (WAL) offers durable write performance for data stored in the on-cluster HDFS.
Similarly, the task nodes write the HBase WAL requests in HDFS running on the core nodes with the EMR file system implementation, providing convenient storage of persistent data on Amazon Simple Storage Service (Amazon S3) for strong read-after-write consistency.
Figure 2 – Options available to migrate data.
Specifically, Mactores evaluated three data migration options and recommended an optimal mix for various MapReduce Jobs. The following options were exercised:
- Snapshot: Create snapshot from source > Export Snapshot to S3 bucket of the EMR cluster > Restore table from S3 bucket.
- Evaluation: This option was the easiest to use and required a low overall runtime, but the performance and table availability improvements were minimal.
- Table exports: Create snapshot from source > Export snapshot to S3 bucket > Import snapshot from the S3 bucket > Clone snapshot.
- Evaluation: Easy to use, but limited performance and availability improvements at the cost of high overall runtime.
- Copy table: Use CopyTable to transfer individual tables from the source to the target cluster.
- Evaluation: A complex task that requires the highest overall runtime, but offers significant performance and table availability improvements.
Phase 3: Modernize
The modernization phase largely covers the implementation plan from the assessment phase and action tasks from the migration phase. Modernization achieves the design goals of unbounded scalability, an efficient and highly compatible open systems architecture, and the managed service offering with maximal visibility, transparency, and control over the underlying infrastructure.
These goals were achieved by implementing the following processes:
- Migration: Each table is migrated from self-managed HBase 1.4 to EMR running the HBase 2.2.6.
- Validation: Data consistency and integration is validated and guaranteed across all MapReduce jobs and other applications.
- Support: Ensure performance improvements, compatibility, and integration of data workloads and apps interacting with the HBase client and protobuf for the HBase 2.2.6 implementation on EMR clusters.
- Strategy: Devising a strategic plan to perform the cutover with the Flipboard team, streamlining the transition process, and ensuring the company can maximize EMR performance improvements immediately following the cutover.
Performance Comparison and Results
Mactores conducted stress testing to discover performance bottlenecks across the network using the Yahoo Cloud Serving Benchmark (YCSB) across a set of common workloads.
The tests were conducted across various operations such as 100% Read, 100% Write, and 50% of Read and Write operations on the following clusters:
- Cluster 1: EMR with HDFS storage mode (three m5.2xlarge Master nodes and 15 i3.8xlarge core nodes).
- Cluster 2: EMR with S3 storage mode (three r5a.8xlarge Master nodes and 15 r5a.8xlarge core nodes).
- Cluster 3: EMR with S3 storage mode (three r6g.8xlarge Master nodes and 15 r6g.8xlarge core nodes).
- Cluster 4: EMR with S3 storage mode (three r6g.12xlarge Master nodes and nine r6g.12xlarge core nodes).
The first cluster was chosen to simulate the configurations from the self-managed HBase cluster. Mactores enabled Flipboard to realize the following improvements by migrating to a fully managed EMR HBase data platform:
|Throughput (ops/sec)||Average read latency||Average read latency|
Mactores further compared the self-managed cluster with the Cluster 4 configurations and found 3X improved performance across Read and Write requests:
|Read requests||Write requests|
|EC2-based self-managed cluster||100,000||55,000|
|Cluster 4 on Amazon EMR||300,000||150,000|
Flipboard was able to achieve 300% performance improvement which drove user engagement after migrating to the Amazon EMR platform while also realizing the benefits of a fully-managed cloud service.
“We’re more resilient, and Flipboard’s overall user engagement has improved as a direct result of this migration,” said Greg Scallan, Vice President of Engineering at Flipboard. “The user experience with the app is faster because the API responses are quicker, and we can deliver new features more quickly.”
Amazon EMR is an out-of-the-box solution that quickly updates clusters without much manual effort, according to Mactores. It’s much simpler for Flipboard to upgrade its clusters to the newest HBase versions because the Amazon EMR platform already does most configuration changes and the S3 integration.
Flipboard is no longer overwhelmed by the complex and resource-intensive tasks of self-managing a big data platform as it scales its business across a growing user base.
If you want to know your path forward in achieving highly scalable performance along with successful deployment of Apache HBase on Amazon EMR, contact Mactores to help identify the automated accelerators, reference architectures, and total cost of ownership (TCO) calculations that align best with your environment and requirements.
Mactores Cognition – AWS Partner Spotlight
Mactores is an AWS Advanced Tier Services Partner and trusted leader among businesses in providing modern data platform solutions.