AWS Open Source Blog

What Amazon gets by giving back to Apache Lucene

Lucene in action Amazon search screenshot.

 

At pretty much any scale, search is hard. It becomes dramatically harder, however, when searching at Amazon scale: think billions of products, complicated by millions of sellers constantly changing those products on a daily basis, with hundreds of millions of customers searching through that inventory at all hours. Although Amazon has powered its product search for years with a homegrown C++ search engine, today when you search for a new book or dishwasher on Amazon (or when you ask Alexa to search for you), you’re tapping the power of Apache Lucene (“Lucene”), an open source full-text search engine.

To get a deeper appreciation for Amazon’s embrace of Lucene, I caught up with Mike McCandless, a 12-year veteran of the Lucene community. McCandless, who joined Amazon in 2017, says that “the incredible challenge” of configuring Apache Lucene to run at Amazon scale was “too hard to resist”…

…so long as he could continue to contribute changes upstream, back to the open source Lucene project.

Why Apache Lucene?

In a Berlin Buzzwords 2019 talk, McCandless (and Amazon search colleague Mike Sokolov) walked through the reasons that Amazon, after years of success with a homegrown search engine, elected to embrace Lucene. In a follow-up discussion with me, McCandless stressed that the decision wasn’t trivial given our “very large, high-velocity catalog with exceptionally strong latency requirements and extremely peaky query rates.” Against such stringent demands, the product search team was unsure whether Lucene could keep up.

And yet it was worth an evaluation. Why?

First off, McCandless said, Lucene has attracted a massive community of passionate people who are constantly iterating on the technology. Second, while we might have worried about whether Lucene could meet our functionality and performance requirements, it’s not as if we’d be alone in using it at serious scale. Lucene “isn’t a toy,” McCandless declared, “It’s used in practice all over the place by companies like Twitter, Uber, LinkedIn, and Tinder.” Many other teams at Amazon have used Lucene for years across a variety of applications, though not previously for product search.

Ultimately, it’s that community of sophisticated users that makes Lucene hum, and which made it such an attractive option for Amazon’s product search team. Compared to Amazon’s internal product search service, McCandless argued, “Lucene has more features, is moving faster, has lots of developers working on it, offers a much bigger talent pool of experienced search developers, and more.”

All of which, while true, doesn’t necessarily explain why Amazon contributes to upstream Lucene.

Getting more by giving more

In pushing Lucene to its limits, Amazon developers uncovered “rough edges,” bugs, and other issues, according to McCandless. While the Apache License (Version 2.0) allows developers to modify the code without contributing changes back to the upstream community, Amazon chooses to actively contribute back to Lucene and other projects. Indeed, over time Amazon developers have steadily increased our participation in open source projects as a way to better serve customers, even in strategic areas like search that could yield competitive differentiation.

There are a few reasons for doing so.

First, as McCandless says, “The community is a fabulous resource: they suggest changes, and make the source code better.” By working with the Lucene community, we are better able to help our customers find the products they want, faster.

Second, we want to collaborate with that community to help bolster the main branch of innovation. Yes, at times a temporary branch is necessary, according to McCandless: “Sometimes we need to deal quickly with a short-term need. But then we take that change and propose it to the community. Once the change is merged upstream, we re-base our branch on top of the upstream version and switch back to a standard Lucene release or snapshot.” Keeping code branches as short-lived as possible has emerged as a software engineering best practice, in part thanks to the collaborative development processes pioneered by open source projects.

In this give-and-take of open source, Amazon developers have introduced several significant improvements to Lucene, including:

  • Concurrent updates and deletes. For those using Lucene for simple log analytics (i.e., appending new documents to a Lucene index, and never updating previously indexed documents), Lucene works great. For others, like Amazon, with update-heavy workloads, Lucene had a small but important single-threaded section in the code to resolve deleted IDs to their documents, which proved a major bottleneck for such use-cases. “Substantial, low-level changes to Lucene’s indexing code” were necessary, McCandless acknowledges, changes that McCandless and team contributed. “With this change,” he writes, “IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter.” The result? “A gigantic speed-up on concurrent hardware” (e.g., 53% indexing throughput speedup when updating whole documents, and a 7.4X – 8.6X speedup when updating doc values).
  • Indexing custom term frequencies. Amazon needed to add the ability to fold behavioral signals into a ranking (i.e., what do customers do after they have searched for something?), a feature long requested in the Lucene community. McCandless’ proposed patch? Creation of “a new token attribute, TermDocFrequencyAttribute, and tweak[ing] the indexing chain to use that attribute’s value as the term frequency if it’s present, and if the index options are DOCS_AND_FREQS for that field.” Sounds simple, right? Not really. “Getting behavioral signals into Lucene was hard work,” McCandless notes. The work, however, was doubly worth it since “Once we added it, others went in and built on top of it.” That community collaboration is clearly visible in the discussion of how best to implement McCandless’ proposed patch.
  • Lazy loading of Finite State Transducers. Our third significant contribution came from Ankit Jain with help from his colleagues Min Zhou and Adithya Chandra on the AWS Search Services team. Lucene’s finite state transducers (FSTs) are always loaded into heap memory during index open, which, as Jain describes, “caus[es] frequent JVM [out-of-memory] issues if the terms dictionary size is very large.” His solution was to move the FST off-heap and lazily load it using memory-mapped IO, thereby ensuring only required portions of the terms index would be loaded into memory. This AWS contribution resulted in substantial improvements in heap memory usage, without suffering much of a performance ding for hot indices.

With these improvements (not to mention concurrent query execution improvements) and a bevy of existing features the Lucene community has built (e.g., BM25F scoring, disjunctions, language-specific tokenizers), the Amazon search team is on track to run Lucene for all Amazon product searches in 2020. This is phenomenal progress just two-and-a-half years into our Lucene product search experiment, progress made possible by the exceptional community powering Lucene development. It’s enabling Amazon to improve the customer shopping experience with unique-to-Lucene features like dimensional points and query-time joins.

But there’s more to this story than how Amazon has improved Lucene for the customers’ benefit, important as that is.

Take, for example, the custom term frequencies contribution that Amazon pushed to Lucene. For Amazon, this enabled us to migrate our machine-learned ranking models to Lucene. Adding behavioral signals into Lucene rankings is “powerful,” says McCandless, and that new capability enables each of the companies “powered by Lucene,” and many others not listed on the Apache project page, to tap into this ability to better serve their customers. Amazon could have maintained the custom term frequencies feature in an internal code branch but, in addition to the ongoing costs of maintaining that branch over time, the team saw even more value in collaborating with the Lucene community to make the software better for everyone.

The same is true of the efficiency improvements for concurrent updates/deletes noted above. These help Amazon, of course, but the improvements also benefit any Lucene user who is doing heavy updates. Our goal is to directly improve the customer experience, while also making these powerful new enhancements available to the world.

For Amazon, whether in Lucene, ROS (Robot Operating System), Xen, or any number of other open source projects, we know that often the best way to deliver great customer experiences long-term requires investments in the open source software that is part of our systems, even when the software is invisible to those customers. Our contributions to Lucene illustrate the ongoing evolution in how we serve our customers by actively participating in open source software.

ps We’re hiring

Are you a Lucene expert who enjoys big challenges? Come work with us.

Matt Asay

Matt Asay

Matt Asay (pronounced "Ay-see") has been involved in open source and all that it enables (cloud, machine learning, data infrastructure, mobile, etc.) for nearly two decades, working for a variety of open source companies and writing regularly for InfoWorld and TechRepublic. You can follow him on Twitter (@mjasay).