Behind the Scenes on AWS Contributions to Open Source Databases

Behind the scenes on AWS open source

AWS managed open source services make it easier for customers to set up and operate their favorite open source projects on AWS. We help reduce the overhead of self-managing open source and provide better integrations with AWS services. In the process, we hire maintainers or develop engineers to become experts and leaders in the open source projects that we operate. They contribute back to the upstream community through code contributions, performance testing and improvements, and releasing our own innovations as open source.

Amazon Elastic Kubernetes Service (Amazon EKS), for example, is a managed open source service that helps customers run Kubernetes in the AWS Cloud and on premises with custom-built integrations to Amazon EC2, Amazon Aurora, and other AWS services. AWS engineers contribute to Kubernetes and many other projects in the cloud native ecosystem, such as containerd, Cortex, etcd, Fluentd, nerdctl, Notary, OpenTelemetry, Thanos, and Tinkerbell.

Similarly, AWS engineers are significant contributors to the open source databases that our managed services are built on and that our customers depend on. Aurora PostgreSQL and MySQL-compatible editions, Amazon Relational Database Service (Amazon RDS) for PostgreSQL, MySQL, and MariaDB, and Amazon Elasticache for Redis are AWS services built on, or compatible with, open source databases.

Amazon employs six committers to PostgreSQL, including three security team members. We contribute to pgvector, the open source extension for PostgreSQL, and have released Trusted Language Extensions (pg_tle) as open source to improve security for all PostgreSQL users.

The managed open source service Amazon RDS for MariaDB is built on open source MariaDB, a fork of MySQL. The MariaDB Foundation has recognized Amazon as a major contributor to MariaDB with 11 contributors and 33 commits since the start of 2023, largely focused on reliability and security improvements. AWS is a diamond-level sponsor of the MariaDB Foundation and contributes to the broader MariaDB ecosystem, including projects such as Percona xtrabackup, HammerDB, and rocksdb.

Involvement in open source communities is critical to a project’s long-term success. Maintainers play an important role in contributing to projects by overseeing code maintenance and releases, community management, issue triage, and security, among other core project responsibilities. Amazon ElastiCache for Redis is a managed open source service built on open source Redis. Amazon is the third largest contributor to Redis and employs two active contributors, one of whom is also one of five maintainers.

“We provide our customers with a breadth of open source services to build on. As a result, developers not only benefit from the innovations driven by AWS, but also advancements from familiar open source communities,” said Sirish Chandrasekaran, General Manager and Engineering Director for Amazon RDS.

In this post, we will share some of the more substantial open source contributions AWS has made in the past two years to upstream databases, introduce some of our key contributors, and share how we approach upstream work in our database services.

Open source databases in the cloud

Over the past 15 years, cloud managed databases have changed the economics of database management. Customers have moved away from on-premises databases to the cloud to take advantage of a more agile, elastically scalable, and flexible architecture that allows them to use multiple databases for a single application. Open source databases have accelerated this transformation as an essential part of a cost-effective, modern data strategy.

Many AWS services are built with open source components, and over time our customers have asked us to provide managed versions of the open source projects that we build on. In 2009, we launched Amazon RDS, one of the first managed relational database services in the cloud which has helped accelerate the adoption of open source databases such as MySQL and PostgreSQL.

Today, five of the seven Amazon RDS engines are open source or open source compatible. Our customers turn to open source databases first when they’re building new applications in the cloud. Our managed services help them scale by removing the operational overhead of self-managing open source.

Our engineers and our customers rely on the security and long-term stability of open source databases. So, collaboration with the upstream open source communities is critical to how we build and operate database services.

“Increasingly, what you’re seeing with AWS is “We’re listening.” We’re approaching open source and making sure we’re engaging, communicating, listening, and understanding,“ Barry Morris, general manager of in-memory and emerging databases at AWS, said at Percona Live in 2022. ”The users and customers behind those communities understand where the technologies need to go. We listen for ways that we can help accelerate innovation in a particular community, and then we give back.“

Once we’ve identified opportunities where our engineers can add value, we look for opportunities to introduce our code changes upstream first in order to benefit all users as well as AWS customers. For example, the Amazon RDS team has been working closely over the past year with committers at EDB, Fujitsu, Microsoft, and others in the PostgreSQL community to improve logical replication. Logical replication is a data copying technique that allows PostgreSQL and PostgreSQL-compatible systems to replay changes from a source database regardless of its architecture. AWS customers and Postgres users, in general, are asking for logical replication for features such as online major version upgrades, Extract Transform and Load (ETL) processes, and migrations between heterogeneous systems.

“This is what I love about the Postgres community, even if there are commercial interests behind getting features in, there’s a shared common goal. This is something that benefits everyone, so by working together and collaborating on it, we can move farther, faster,” said Jonathan Katz, Product Manager – Technical on Amazon RDS, at PGCon 2023.

How AWS contributes to PostgreSQL

AWS contributes in many ways to PostgreSQL, which underpins our managed database services and represents an important component of Amazon.com, among other Amazon technologies. We employ engineers dedicated to working on the upstream project to improve performance and resiliency, fix bugs, and review patches, as well as provide new features and open source extensions. At the same time, we are aware of each community’s guidelines and preferences and aim to accommodate them. For example, with the PostgreSQL community, one tenet is to not have any single entity that’s too influential, so we are moderating our overall authorship and commit levels to honor this.

This year, AWS engineers rank among the leading contributors to the forthcoming PostgreSQL 16 release. For example, Postgres committer Masahiko Sawada has contributed changes to upstream PostgreSQL that allow parallelization of applying transactions on subscribers. Similarly, Bertrand Drouvot added support for logical decoding on standby instances, which also adds support for failover with logical replication. This was a major milestone for the project, as the feature was stalled for several years. Previously, it was only possible to stream logical changes from a primary, which could be taxing on systems under high load. Customers with heavy workloads on their primaries can now offload logical replication to standbys.

We also help support numerous open source Postgres projects, including the JDBC driver and extensions like pg_tle, PostGIS, pg_hint_plan, and pgvector. Customers are asking to store vector data in PostgreSQL as the popularity and accessibility of Generative AI (GenAI) and other machine learning tools have led to a demand for permanent storage systems for this data. The pgvector extension provides a vector data type and search functionality for high-dimensional vectors. We’ve collaborated with the PostgreSQL community to continue making performance enhancements to this feature, such as a new hnsw index type made available in pgvector 0.5.0. We’ve been supporting pgvector development through direct code contributions and performance testing on behalf of our customers.

Because of the sheer scale at which we operate, we also noticed that the hundreds of available PostgreSQL extensions were becoming a source of attacks on Posgres databases. We decided to address the issue to improve security for all Postgres users by creating trusted language extensions (pg_tle) and released that as an open source project in November 2022. pg_tle is a development kit for building Postgres extensions that provides database administrators control over who can install extensions and a permissions model for running them, letting application developers deliver new functionality as soon as they determine an extension meets their needs. Similarly, we added support for Rust, which gives developers a high-performance option for writing stored procedures on PostgreSQL databases.

“We identified that problem and then worked to not just shut down the specific issues, but to create a safer environment to run those extensions in. We released this [as open source] because we recognized that everyone should have a more secure and a more safe PostgreSQL,” said David Nalley, head of open source strategy and marketing at AWS at re:Inforce 2023.

PostgreSQL users need not use AWS to benefit from the open source extension. PostgreSQL cloud provider Supabase is now also using the pg_tle extension, which they called a “surprise gift” from AWS.

“When AWS released pg_tle, we realized immediately how powerful the tool is and how it will change the way people release and install trusted extensions forever. We started collaborating with the TLE team at AWS, exchanging ideas and resources to further promote TLEs in both our database platform and theirs,” writes Michel Pelletier, Supabase engineer on their blog. “By providing the TLE extension in both Supabase and AWS, we are unifying a standard package platform across two of the largest Postgres providers.”

How AWS contributes to Redis

Developers love to use Redis for real-time applications, such as a cache in front of a database or as a high-speed message broker between micro-services. In these real-time environments, every second of outage or increased latency can cause a degraded user experience or cascading failures throughout the system. As a large cloud provider, AWS engineers are able to observe issues that arise at scale, allowing us to bring a unique perspective to the open source community and contribute what we’ve learned.

“So, what’s really hard to do, but really important, is building resiliency and making sure that, even if something fails, the rest of the application continues working. So that’s something AWS as a whole can bring to the conversation because we operate at such a high scale. We’ve really learned a lot of lessons about how to make sure applications stay running and stay resilient,” Madelyn Olson, Principal Engineer at AWS and a Redis committer, said.

AWS is also able to use its scale to identify patterns of use across multiple users of our Redis services, Amazon ElastiCache for Redis, and Amazon MemoryDB for Redis. In the past year, we’ve heard from our customers that they were looking for ways to improve the efficiency of their caching applications to help lower costs. In response to this, AWS has worked on improving the efficiency of Redis by decreasing the overhead of storing data in Redis. This feature, which should be available in Redis 8, allows all users of Redis to enjoy a lower cost to operate.

These contributions demonstrate a long history of AWS being involved in Redis. In 2022, AWS engineers and Redis contributors Harkrishn Patro and Olson contributed several major features in Redis 7: Fine grained access control over keys and commands, native hostname support for clustered configuration which enables TLS security, and partitioned channels for scalable pub/sub. We are committed to continue supporting the Redis community in order to make sure it’s sustainable in the long term, which includes more than just features. We have also been working to improve client support, such as adding cluster mode support to redis-py and adding important availability improvements such as exponential backoffs.

Conclusion

Open source databases form the foundation of many AWS managed services. By building with databases as a service in the cloud, customers benefit from online monitoring and expert customer support, as well as the scalability, elasticity, and data portability that are the basis of a modern data and analytics strategy.

AWS is investing in upstream open source database projects, including Redis, MariaDB, and PostgreSQL to help ensure the long term sustainability of the projects that our customers rely on. Our engineers contribute new code and bug fixes upstream, review code contributions, provide operational expertise, help drive project road maps, and build consensus in the open source communities where we participate.

Open source is important to AWS, our customers, and the world. These are the three pillars of open source at AWS. We’re here to support and contribute to open source databases for the long haul. And we’re going to continue making substantial investments and increase our investments in open source in coming years.

Do you have feedback on our open source involvement? We encourage you to get involved in our projects or partner with us to improve the long-term sustainability of the technologies we all build our lives and businesses on.