AWS Open Source Blog

Community collaboration: The S3A story

Sometimes the best open source contributions involve doing less, not more. For example, Charity Majors has posited, “The best senior engineers I’ve worked with are the ones who worked the hardest not to have to write new code.” It’s not that writing new lines of code is bad. No, it’s really a matter of keeping code as simple as possible, thereby reducing technical debt and making the code as approachable as possible.

For years, developers, such as Cloudera’s Steve Loughran, have made impressive code contributions to Amazon Simple Storage Service (Amazon S3) adapters like S3A. These contributions enable Apache Hadoop to directly read and write Amazon S3 objects. But figuring out the optimal way to match the differing semantics—or consistency guarantees—between Hadoop’s HDFS and Amazon S3 hasn’t been easy. This is largely because communication between the Amazon S3 team and the S3A community had been minimal.

In anticipation of the introduction of strong consistency to Amazon S3, AWS engineers, such as Jimmy Zuber, decided to get involved with the S3A community. They started by talking to S3A contributors like Loughran, listening to the needs of the community, which led to guidance that helped S3A developers write less code despite a major change to Amazon S3.

Making Amazon S3 work for the Hadoop community

Amazon S3 is a popular way for organizations to store data, currently holding trillions of objects and regularly peaking at millions of requests per second. Although many customers choose to process their Amazon S3 data using Amazon EMR, others opt to run their own Hadoop instances. For organizations hoping to use Amazon S3 instead of HDFS as their data store, Jordan Mendelson of Common Crawl created the open source project S3A. S3A enables Hadoop to directly read and write Amazon S3 objects.

Mendelson’s pioneering work attracted interest from developers like Loughran at Cloudera (formerly Hortonworks). As Loughran tells it, the initial early work on S3A performance came from Western Digital engineers. “Their work on an S3-compatible store meant they knew subtleties the rest of us wouldn’t have thought of,” Loughran says. Additionally, developers from NetApp effectively highlighted areas of S3A for optimization.

Cloudera took on much of the development of S3A. Meanwhile, developers like Loughran, Gabor Bota, Sean Mackrory, Chris Nauroth, and others spent considerable time figuring out how to match the differing semantics, or consistency guarantees, between Hadoop’s HDFS and Amazon S3. Some of the critical questions, according to Loughran, revolved around areas like:

  • When there is a seek() to new location, when is it better off aborting the active HTTPS connection and negotiating a new one versus draining the outstanding ranged GET and recycling the current one?
  • How best to predict and minimize throttling? As he tells it, dealing with throttled I/O has been a primary occupation over the past year. This is because Hive-partitioned directory trees, and the S3A code’s propensity to DELETE requests, overloads versioned buckets (for example, HADOOP-16823).
  • Failure modes. There are rare and odd corner cases, such as when a copy returns 200 “success,” but an error in the body (for example, HADOOP-16188). Knowing these things before they surface makes for more resilient applications.

The problem was that it was hard for the community to know how to resolve such issues given that they were operating in relative isolation from the Amazon S3 team. Without insight from the Amazon S3 team into implementation details, Loughran says the community was left to its own experience and “superstitions passed on by others.”

Better together

With AWS planning to update Amazon S3 to offer strong read-after-write consistency automatically for all applications, AWS engineers were worried. Without modification, S3A wouldn’t allow customers to take advantage of this change, leading to a poor customer experience. The Amazon S3 team decided it needed to work more closely with the S3A community.

Zuber says this interaction involved surveying the code base and assessing what needed to be done in order to launch consistency on their side, to make sure that S3A was well-positioned to adapt to it. The main output of this effort was the realization that they could remove the need for the then-standard S3Guard module. This promised to be a great innovation for all Hadoop/Amazon S3 customers, because it meant they would no longer need to run somewhat expensive secondary indexes on Amazon DynamoDB.

At the same time, as AWS engineers worked with the S3A community, they identified a number of inefficiencies in how S3A worked with Amazon S3. One example is the addition of the option to stop deleting directory markers, thereby eliminating the I/O throttling such operations can cause. In other cases, AWS engineers have been able to make direct improvements to S3A, such as Zuber’s fix for HADOOP-17105, which eliminated superfluous HEAD requests. According to Loughran, “Things like that demonstrate good due diligence—that someone isn’t just adding a big, new feature, but trying to understand what the code does today.”

When considering the different ways people can contribute to open source projects, S3A contributor Nauroth says it’s important to remember that open source projects are filled with thousands of rough edges: small bugs, erroneous documentation, insufficient test coverage, and so on. As such, he says no contribution is too small. Beyond code, there are critically important ways to contribute, including documentation, testing and release verification, and weighing in during design reviews for major new features.

In this way, suggests Arvinth Ravi, senior software development manager with the Amazon S3 team, close collaboration between AWS and the S3A community has resulted in a seamless experience for customers who integrate with Amazon S3 through open source libraries like S3A. This sounds a bit magical, but results from significant behind-the-scenes collaboration between developers at Cloudera, Western Digital, NetApp, AWS, and others. It’s open source done right.

 

Matt Asay

Matt Asay

Matt Asay (pronounced "Ay-see") has been involved in open source and all that it enables (cloud, machine learning, data infrastructure, mobile, etc.) for nearly two decades, working for a variety of open source companies and writing regularly for InfoWorld and TechRepublic. You can follow him on Twitter (@mjasay).