AWS Big Data Blog

Amazon EMR HBase on Amazon S3 transitioning to EMR S3A with comparable EMRFS performance

Starting with version 7.10, Amazon EMR is transitioning from EMR File System (EMRFS) to EMR S3A as the default file system connector for Amazon Simple Storage Service (Amazon S3) access. This transition brings HBase on Amazon S3 to a new level, offering performance parity with EMRFS while delivering substantial improvements, including better standardization, improved portability, stronger community support, improved performance through non-blocking I/O, asynchronous clients, and better credential management with AWS SDK V2 integration.

In this post, we discuss this transition and its benefits.

Understanding file system usage in HBase with Amazon EMR

HBase on Amazon S3 uses Amazon S3 as the primary storage layer instead of HDFS. When the memstore gets flushed, HBase writes HFiles directly to Amazon S3 using the file system connector. The Write Ahead Logs (WALs) and other operational files are still maintained in HDFS on the local cluster for performance and durability reasons. Amazon EMR also provides durable off-cluster EMR WAL implementation to improve the durability of the data.

With the HBase on Amazon S3 architecture, you can take advantage of the virtually unlimited storage capacity and cost-effectiveness of Amazon S3 while maintaining acceptable read/write performance. When data is read, HBase retrieves the HFiles directly from Amazon S3, and the block cache in memory helps optimize frequent read operations. This design alleviates the need for a large HDFS cluster for data storage, reducing operational costs and management overhead. The Amazon S3 file system connector handles the communication between HBase and Amazon S3, managing aspects like authentication, retry logic, and consistency. However, this setup might have slightly higher latency compared to traditional HBase on HDFS due to the network calls to Amazon S3, but the trade-off is justified by the benefits of scalability, caching layer, and cost-effectiveness that Amazon S3 provides.

Performance comparison of EMR S3A with EMRFS and OSS S3A from 7.3 release

Amazon EMR is transitioning how it connects to Amazon S3 storage. Through Amazon EMR 7.9, Amazon EMR has used EMRFS as its primary connector to interact with Amazon S3 for HBase storage. HBase on Amazon S3 significantly improved its performance with EMR S3A starting from the 7.3 release comparing to OSS S3A and matching the performance levels of EMRFS. This enhancement was thoroughly tested using Yahoo! Cloud Serving Benchmark (YCSB) workloads with 100 million rows in Amazon EMR 7.3 (using Hadoop 3.3 with AWS SDK V1) and Amazon EMR 7.10 (using Hadoop 3.4 with AWS SDK V2).

YCSB includes various workloads with different read and write proportions and data distribution patterns, such as:

  • Workload A (50% reads, 50% writes) – Simulates a scenario with equal read and write operations (50% each). This is ideal for applications requiring frequent updates and reads, such as session stores.
  • Workload B (95% reads, 5% writes) – Models a read-heavy application with 95% reads and 5% writes. This is well-suited for scenarios where retrieval operations dominate, like content delivery networks.
  • Workload C (100% reads) – Simulates user profile cache patterns and serves as a content delivery system.
  • Workload D (read latest data) – Simulates user status updates where users want to read the latest status.
  • Workload E (scan heavy) – Simulates threaded conversations where users scan through message threads.
  • Workload F (read/modify/write operations) – Simulates user record update patterns such as online gaming platforms where player scores are frequently read and updated based on game outcomes.

The performance comparison between EMRFS, EMR S3A, and OSS S3A for Amazon EMR 7.3 (AWS SDK V1) and 7.10 (AWS SDK V2) are illustrated in the following graphs, showing substantial improvements across different workload types. The graphs demonstrate how Amazon EMR 7.3 and 7.10 with EMR S3A achieve performance metrics comparable with EMRFS and up to 65% faster than OSS S3A, especially in read-heavy and mixed read/write workloads.


EMR S3A as the default file system from Amazon EMR 7.10

These performance improvements demonstrate a significant evolution in the capabilities of Amazon EMR. Well before EMR S3A became the default file system in version 7.10, EMR HBase users were already experiencing enhanced Amazon S3 access performance through EMR S3A. The critical enhancements implemented in Amazon EMR 7.3 successfully minimized the performance differential between EMRFS and EMR S3A for HBase operations. This achievement delivered optimal performance to users while preserving EMR S3A’s distinct benefits within the analytics ecosystem, including improved standardization, better community integration, and enhanced portability.

Amazon EMR 7.10 marks a significant change for HBase on Amazon S3 users. EMR S3A becomes the default file system connector automatically, independent of how your root directory’s file system is configured. This seamless transition enables EMR HBase customers to use EMR S3A’s expanding feature set and improvements without manual intervention.

Conclusion

The evolution of file system connectors in EMR HBase demonstrates AWS’s commitment to delivering high-performance, scalable solutions for big data workloads. Starting with EMR S3A, which achieved performance parity with EMRFS in Amazon EMR 7.3 (as validated through extensive YCSB benchmark tests with 100 million rows) and improvement over OSS S3A, to the upcoming transition to S3A as the default connector in Amazon EMR 7.10, AWS continues to enhance its storage interface capabilities.

The transition represents more than just a technical upgrade; it delivers a trifecta of benefits: enhanced standardization across Hadoop ecosystems, improved workload portability, and robust community support. Most importantly, this advancement maintains the high-performance standards established by EMRFS while positioning EMR HBase for future innovations in storage interface capabilities. AWS’s strategic evolution of file system connectors demonstrates its commitment to providing enterprise-grade solutions that combine performance, scalability, and architectural excellence.

As big data workloads continue to grow and evolve, this foundation of reliable, high-performance storage access will become increasingly crucial for organizations using EMR HBase for their data processing needs. We recommend that you stay up to date with the latest Amazon EMR release to take advantage of the latest performance and feature benefits.


About the Authors

Dong Li

Dong Li

Dong is a Senior Software development engineer for Amazon EMR at Amazon Web Services. His expertise is in big data systems, including Hadoop, HBase, and Hive. His customer obsession and dedication towards solving big data system problems helps Amazon EMR achieve more performance improvements.

Ramesh Kandasamy

Ramesh Kandasamy

Ramesh is an Engineering Manager for Amazon EMR at Amazon Web Services. He is a long tenured Amazonian dedicated to solving distributed system problems.

Giovanni Matteo Fumarola

Giovanni Matteo Fumarola

Giovanni is the Senior Manager for the Amazon EMR Spark and Iceberg group. He is an Apache Hadoop Committer and PMC member. He has been focusing on the big data analytics space since 2013.