AWS Open Source Blog

Embracing natural language processing with Hugging Face

In a previous post, I talked about how open source projects often work backwards from a specific problem or challenge, as one of their key motivators. In this post, I’ll explore another area where open source projects emerge: the need to follow an area of interest, a genuine passion, or an itch that needs to be scratched.

One open source project that embraces this approach is Hugging Face. I recently sat down with Julien Chaumond (co-founder and CTO), Jeff Boudier (Product and Growth), and Philipp Schmid (Technical Lead) to understand more about the origins of Hugging Face, a New York-based startup that helps developers accelerate the use of natural language processing (NLP) technologies. Hugging Face is a great story around how a group of passionate developers came together to help accelerate the adoption of new and emerging technologies, improve developer agility, and provide developers choice.

Hugging Face

More than five years ago, the founders of Hugging Face came together out of a shared enthusiasm around natural language processing. Their interest stemmed from a general interest in all things AI, but they were also exploring and curious about how humans communicate.

Even before launching Hugging Face, the founders had released open source projects. Chaumond told me about Circular, an open source Buffer clone, which you can still use today, and co-founder and Chief Science Officer Thomas Wolf’s Magic Sand, which uses augmented reality to provide a virtual sandbox. It’s unsurprising then, that open source plays a key role in the story of Hugging Face.

Agility, choice, and speed

As Hugging Face developers were creating their first consumer product, they decided to explore open sourcing some of the building blocks. One such building block was the coreference resolution system, a library that helps you understand the relationship of pronouns in a sentence.

Screenshot of Hugging Face library.

The Hugging Face team chose to open source this library because they wanted to get feedback from the community. In fact, Jeff Boudier said that it was important to get as diverse a set of people involved and contributing, particularly given the wide variation in how language is used. They were surprised at how well this move was received by the community, which convinced them that releasing more open source libraries was the way to go.

The publication in 2018 of a research paper, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, and a TensorFlow-only checkpoint changed the natural processing landscape. Thomas Wolf was excited about these developments, but wanted something that would work in PyTorch and not TensorFlow. Wolf created an implementation in PyTorch, in which lay the foundations for what was to lead to the creation of Transformers.

Hugging Face Transformers are pre-trained machine learning models that make it easier for developers to get started with natural language processing, and the transformers library lets you easily download and start using the latest state-of-the-art natural language processing models in 164 languages.

By creating a higher-level API, you can use those common APIs within your application to perform tasks such as text classification, question and answers, text generation, and more, and you can plug in different models that are available in the Transformer hub. From a developer perspective, this provides great choice and allows you to find a model/framework that is best suited to your use case. That flexibility extends to the tools that developers use, providing the ability to work on and then deploy using your favorite open source tools or managed services, such as Amazon SageMaker.

What this also enables, however, is the ability to go from research paper to reference implementation, create a transformer, and then make it readily available within your applications, thereby reducing the time it takes to go from research to implementation. Just create your model, upload it to the hub, and you are good to go. A recent example is when Adrian de Wynter from the Amazon Alexa AI team published a new paper on Bort: A version of the BERT language model that’s 20 times as fast, which is now available as of the v.4.3.0 release.

This flexibility and choice also extends beyond the models you can use. Hugging Face provides datasets that you can use to train your own models, and they have the largest public repository of NLP datasets, with more than 600 datasets contributed by the AI community.

Embracing open source

Hugging Face continues to improve how it approaches working with the community. In the beginning, Hugging Face followed a more organic approach, but today the project is more organized. For example, Hugging Face now has calls for specific, focused contributions, as well as initiatives to help mentor new contributors (shadowing more experienced members to give them confidence when working with the project) and a real focus on getting a diverse range of contributors.

These efforts appear to be well received by the open source developer communities, as Hugging Face is now the sixth most uniquely contributed project (according to the Octoverse report in 2020), with thousands of open source builders contributing to the project.

So, what does it look like when you combine the passion behind a project with solving real developer pain points? More than five thousand companies, from SMEs to the largest enterprises, are using Hugging Face in production. In the past 30 days alone, their models have been downloaded more than 25 million times. Find out more by visiting the project home page.

Ricardo Sueiras

Ricardo Sueiras

Cloud Evangelist at AWS. Enjoy most things where technology, innovation and culture collide into sometimes brilliant outcomes. Passionate about diversity and education and helping to inspire the next generation of builders and inventors with Open Source.