Improve the performance of generative AI workloads on Amazon Aurora with Optimized Reads and pgvector

Generative AI has increased the possibilities for businesses to build applications that require searching and comparison of unstructured data types such as text, images, and video. Embeddings, or vectors, capture the meaning and context of this unstructured data in a machine-readable form, which is the basis for how similarity comparisons can be made directly in Amazon Aurora PostgreSQL-Compatible Edition using pgvector. pgvector adds additional capabilities for efficient storage and fast retrieval of data stored as high-dimensional vectors.

Some use cases, such as searching ecommerce product catalogs, or retrieval augmented generation (RAG), require high-performance vector comparisons in real time. These use cases require millions or billions of vector embeddings to be stored and used in comparisons. Furthermore, because vectors with hundreds of dimensions can be very large, the entire vector index won’t fit in memory, and applications will resort to reading from disk. As an example, the Amazon Titan Embeddings G1 – Text embedding model generates embeddings with 1,536 dimensions. Moving these embeddings in and out of memory for similarity search can be costly for performance if they aren’t already cached in memory.

Amazon Aurora Optimized Reads is a new Amazon Aurora feature that helps vector workloads be more performant while using fewer resources for applications with large datasets that exceed the memory capacity of a database instance. Using Amazon Aurora Optimized Reads with pgvector hierarchical navigable small worlds (HNSW) indexing delivers 20x improved query performance as compared to pgvector inverted file flat (IVFFlat) indexing, helping your applications to deliver low latency responses.

In this post, we discuss how using Optimized Reads improves performance for vector workloads running on Amazon Aurora PostgreSQL with pgvector, specifically using HNSW indexing. Optimized Reads instances using HNSW indexing can deliver an average query throughput performance increase of up to nine times greater, which equates to a cost per query that is 75–80% less than Aurora PostgreSQL instances without Optimized Reads.

How Aurora Optimized Reads benefits vector workloads

The most common technique used in search with vectors is called a similarity search or nearest neighbor, where vectors are compared by calculating the distance between each other. There are different types of similarity searches, including k-nearest neighbor (KNN), which searches over the entire dataset, and approximate nearest neighbor (ANN), which searches over a subset. Each method has a trade-off: KNN returns the most relevant results, but ANN searches are often more performant.

Vectors generated from embedding models can take up large amounts of memory. For example, the Amazon Titan Embeddings G1 – Text embedding model generates embeddings that have 1,536 dimensions, which is approximately 6 KiB of data. Performing a KNN on 1,000,000,000 of these vectors requires moving 5.7 TiB of data into memory to complete the operation. Using a database like Aurora PostgreSQL with pgvector offers a reliable, durable solution for loading vectors into memory when they are needed for comparison. This does create a trade-off, because keeping vectors in memory for both KNN and ANN searches is usually more performant than reading them from a network-based storage system. When evaluating how to query your vectors from a database, you must consider the cost, performance, and scalability that can be achieved by using larger instances with more memory.

Aurora Optimized Reads provides a cost-effective, performant solution for managing workloads with large datasets. Optimized Reads uses the local NVMe-based SSD block-level storage available on r6gd and r6id instances to store ephemeral data. This reduces data access to network-based storage, offering improved read latency and increased throughput. One feature shown to benefit vector workloads is when pages, the fundamental unit of storage in PostgreSQL, are evicted from memory but are cached onto local storage. When a query accesses the evicted page, Aurora loads data from the NVMe instead of having to retrieve it from storage, improving query latency. Optimized Reads stores temporary tables on the local NVMe, delivering improved query performance for complex queries and faster index rebuild operations. The tiered cache feature for Aurora Optimized Reads, which enables you to harness performance benefits for your vector workloads, is exclusively accessible on instances that support Aurora I/O Optimized.

Optimized Reads can deliver up to 9x improved query latency when compared to instances without it. This will benefit applications with large datasets that exceed the memory capacity of a database instance, and lets you scale your workload further on the same instance size. This enables the database to achieve near in-memory speeds for accessing vector data with Aurora PostgreSQL and pgvector before upgrading to a larger instance size. The following section shows an experiment that highlights the benefits of Optimized Reads with a billion-scale vector workload.

Benchmarking the performance benefits of Aurora Optimized Reads

To see how Aurora Optimized Reads benefits vector workloads, we performed a benchmark of vectors stored in Aurora with pgvector and measured the performance with and without Optimized Reads instances. We used a modified version of the BIGANN Benchmark (1 billion vectors) benchmarking tool, which allowed us to run concurrent similarity searches on vector datasets loaded into Aurora. This benchmark was performed using the BIGANN-1B dataset, which consists of scale-invariant feature transform (SIFT) descriptors applied to images extracted from a large image dataset. There are 128 dimensions in the benchmark. To determine the accuracy of the benchmark queries, measured in recall, we used the ground truth files available for the BIGANN-1B dataset.

We use this benchmark to compare Aurora PostgreSQL performance with and without Optimized Reads. When evaluating the performance of Aurora and pgvector in this test, it’s important to evaluate two characteristics:

Performance and throughput – What is the number of queries a database can run per second?
Recall – What is the quality of the results?

These two characteristics must be evaluated together. Performance is a standard measurement calculated in queries per second in this test. Recall is the percentage of relevant results returned by a query. ANN algorithms often provide parameters for managing the trade-offs between recall and query throughput. For the HNSW index type, which pgvector supports since version 0.5.0, one way you can manage search quality is the hnsw.ef_search parameter. This parameter defines the size of the queue of the candidate nodes to visit during traversal of neighbors.

Benchmark setup: pgvector version = 0.5.0; ef_search=400, which is the candidate node queue size which provided a recall of 0.9578. The index was built using the HNSW algorithm and the build parameters: m=16 which is the number of bi-directional links created for every new element during construction; ef_construction=64 which controls the index_time/index_accuracy. As previously mentioned, we modified the BIGANN Benchmark to enable parallel query runs using multiple threads.

We ran the test on two different instance sizes that were selected so that the workload would not fit into memory, meaning that Aurora would have to read data from storage. The tests were conducted on the R6gd-12xlarge (R6gd-12xl) and R6gd-16xlarge (R6gd-16xl) instances, which both use Optimized Reads, and the R6g-12xl and R6g-16xl instances, which are the baseline instances without Optimized Reads. The table and index size on the 16xl instances was 781 GB. The table and index size on the 12xl instances was 614 GB.

Instance details

Optimized Reads	Instance Type	vCPUs	Memory (GB)	Table/Index Size (GB)	Shared_Bufferpool_Size (GB)	NVMe_Cache_Size (GB)	Monthly Cost (us-east-1)
Yes	R6gd-12xlarge	48	384	614/781	250	1,500	$7,008
Yes	R6gd-16xlarge	64	512	614/781	322	2,100	$9,345
No	R6g-12xlarge	48	384	614/781	250	N/A	$5,831
No	R6g-16xlarge	64	512	614/781	350	N/A	$7,775

The following figure shows the throughput measured in queries per second (QPS), as the query workload against the BIGANN dataset is increased with simultaneous threads at a recall rate of 0.9578. We can observe that the Optimized Reads R6gd 12xl and 16xl instances are able to process more vector queries than the R6g instances, which rely on disk I/O for searching the vectors. Note that in the following figure, all four instances are configured for Aurora I/O-Optimized.

The following table shows that Optimized Reads R6gd-12xl and R6gd-16xl instances deliver on average six to seven times greater query throughput (with a range of 4.1–9.3) when compared to the baseline Aurora instances without Optimized Reads (tiered cache feature). Optimized Reads instances remove the disk I/O bottleneck to vector search, allowing the CPU utilization in this benchmark to reach as high as 95% on the R6gd-12xl. With the standard instances, the CPU utilization never increases above 15% because storage I/O is the bottleneck to performance. Additionally, we observe additional performance benefits of Optimized Reads instances as concurrency increases, due to the increased frequency of page evictions as the workload increases. The improved throughput performance of the Optimized Reads instances enables them to deliver an average monthly cost per query that is 20–25% of the cost of instances without Optimized Reads.

.	R6gd-12xl (Optimized Reads)		R6g-12xl	R6gd-16xl (Optimized Reads)		R6g-16xl
Threads	QPS	Throughput vs. r6g-12xl	QPS	QPS	Throughput vs. r6g-16xl	QPS
4	7.03	9.3	0.76	7.24	8.6	0.84
8	13.87	9.2	1.51	15.42	8.2	1.88
16	25.34	8	3.16	28.43	6.7	4.22
32	43.60	7.1	6.12	49.67	5.8	8.62
48	55.89	6.1	9.09	60.50	5.1	11.82
64	65.70	6.1	10.81	73.62	4.6	16.16
80	74.26	6.3	11.87	81.77	4.2	19.42
96	80.83	7.2	11.24	90.19	4.1	21.82
Metrics
Monthly avg. queries	118,746,480	.	17,676,688	131,815,205	.	27,467,485
Monthly cost (USE-1)	$7,008	.	$5,831	$9,345	.	$7,775
Cost per million queries	$59.02	.	$329.84	$70.89	.	$283.05
Cost reduction % using Optimized Reads	82%	.	.	75%	.	.

For this experiment, we can see that the Optimized Reads instances have both better performance and price/performance than the instances without Optimized Reads. Although this is one experiment, Optimized Reads can benefit vector workloads that exceed available instance memory, because the local NVMe cache reduces the need to fetch data from storage. Not all vector workloads may benefit from using Optimized Reads. The local NVMe will only cache evicted pages that are unmodified, so if your vector data is updated frequently, you may not see the same speedup. Additionally, if your vector workload can fit entirely in memory, you may not need an Optimized Reads instance—but running one will help your workload continue to scale on the same instance size.

Conclusion

Developers can build innovative generative AI applications and augment their foundation models with new or proprietary data (RAG) by using vector similarity search of images and text in Aurora PostgreSQL databases using SQL and pgvector. Amazon’s integration of Knowledge Bases for Amazon Bedrock with Aurora automates the RAG process, which you can learn more about in Build generative AI applications with Amazon Aurora and Knowledge Bases for Amazon Bedrock. Aurora Optimized Reads provides a high performant, cost-effective option that delivers a 20x queries per second improvement for pgvector HNSW indexing over IVFFlat indexing. Using the BIGANN-1B benchmark, we saw that Aurora Optimized Reads with HNSW indexing offers an increase in performance up to nine times greater than the equivalent instance type while having a cost per query that is 75–80% less than standard instances.

For vector workloads that exceed instance memory, Aurora Optimized Reads provides a high performant, cost-effective option that delivers up to a 20x queries per second improvement for pgvector HNSW indexing over IVFFlat while having a cost per query that is 75–80% less than standard instances. For more information on how to get started with Optimized Reads, see Improving query performance for Aurora PostgreSQL with Aurora Optimized Reads.

We invite you to leave feedback in the comments.

About the Authors

Steve Dille is a Senior Product Manager for Amazon Aurora, and leads all generative AI strategy and product initiatives with Aurora databases for AWS. Previous to this role, Steve founded the performance and benchmark team for Aurora and then built and launched the Amazon RDS Data API for Amazon Aurora Serverless v2. He has been with AWS for 4 years. Prior to this, he served as a software developer at NCR, product manager at HP, and Data Warehousing Director at Sybase (SAP). He has over 20 years of experience as VP of Product or CMO on the executive teams of companies. Steve earned a Master’s in Information and Data Science at UC Berkeley, an MBA from the University of Chicago Booth School of Business, and a BS in Computer Science/Math with distinction from the University of Pittsburgh.

Mark Greenhalgh is a senior database engineer with over 20 years of experience designing, developing, and optimizing high-performance database systems. He specializes in analyzing database benchmarks and metrics to improve performance and scalability.

Sunil Kamath is the head of engineering for Amazon’s Aurora database Performance and PostgreSQL Engine development. His team drives the performance and scalability charter of Amazon Aurora database and also Aurora PostgreSQL’s Serverless and Engine features. Sunil has over 24 years of experience on databases and has previously worked at Microsoft, and IBM. He earned a M.S. in Computer Science at the University of Alberta, Canada.

Jonathan Katz is a Principal Product Manager – Technical on the Amazon RDS team and is based in New York. He is a Core Team member of the open-source PostgreSQL project and an active open-source contributor.

Sudhir Kumar is a seasoned Senior Performance Engineer at Amazon based in East Palo Alto, leveraging his expertise to optimize Aurora PostgreSQL/MySQL efficiency and enhance overall Aurora RDS performance. With a proven track record in the systems performance domain, he plays a pivotal role in ensuring Amazon RDS operates at peak performance.

AWS Database Blog

Improve the performance of generative AI workloads on Amazon Aurora with Optimized Reads and pgvector

How Aurora Optimized Reads benefits vector workloads

Benchmarking the performance benefits of Aurora Optimized Reads

Conclusion

About the Authors

Resources

Blog Topics

Follow