AWS Database Blog

Load vector embeddings up to 67x faster with pgvector and Amazon Aurora

pgvector is the open source PostgreSQL extension for vector similarity search that powers generative artificial intelligence (AI) applications using techniques such as semantic search and retrieval-augmented generation (RAG). Amazon Aurora PostgreSQL-Compatible Edition has supported pgvector 0.5.1 since 2023. Amazon Aurora now supports pgvector version 0.7.0, which adds parallelism to improve the performance of building Hierarchical Navigable Small Worlds (HNSW) indexes. pgvector 0.7.0 also supports scalar and binary quantization which can reduce the size or overall number of dimensions of a vector to further improve cost and performance. In this post, we show you the performance tests we ran in the database performance lab at AWS comparing the index build and query times of pgvector 0.7.0 with the prior pgvector 0.5.1 version in Aurora PostgreSQL.

In use cases such as semantic search and RAG, Aurora PostgreSQL-Compatible Edition serves as a vector store knowledge base that contains data that the large language model (LLM) wasn’t trained on. This could be your business data or more recent data that became available after the model was trained. This data, or a link to the text data, is saved in a row and is associated with a vector generated by an embedding model such as Amazon Titan Embeddings Text or Cohere Embed English. Vector search is then used to find the most similar vector in comparison to a user or application-initiated knowledge base query vector. In the most basic form, this is accomplished by comparing the query vector with every vector in the database using a distance function to determine the k nearest neighbors (K-NN). This full scan process can make vector search hard to scale even though vector comparison is fast.

To improve vector search, pgvector implemented approximate nearest neighbor (ANN) distance measures and indexes. These constructs allow us to accelerate data scans by performing vector similarity searches over a subset of the database. pgvector supports two ANN index types: Inverted File Flat (IVFFlat) and Hierarchical Navigable Small World (HNSW). IVFFlat divides vectors into clusters, and then searches a subset of those clusters that are closest to the query vector. HNSW uses a multi-layered, graph-based approach designed for billions of rows vector search. As HNSW is fast, efficient at scale, and popular we used it in our tests.

HNSW has two drawbacks:

  1. High memory requirement: The HNSW index requires more memory than the IVFFlat index. You can resolve the memory issue by using a larger database instance or by using Aurora Serverless, which can scale up to build an HNSW index, and then scale back down to lower cost memory. But pgvector 0.7.0 also addresses this with new scalar and binary quantization which will compress the index size to save the amount of memory used.
  2. Long index build time: HNSW indexes can take hours to build due to the time spent calculating the distance between vectors used for similarity search. pgvector 0.7.0 helps solve this by supporting parallel index builds for HNSW indexes. By using parallel workers, you can now build your HNSW index up to 30x faster with Aurora and pgvector 0.7.0 compared to Aurora and pgvector 0.5.1. When this is combined with the new compression techniques, we see index builds 67x faster with pgvector 0.7.0 when compared to pgvector 0.5.1 on Aurora.

Test Results

Before we can talk about our test results, we have to address the important topic of recall. Recall is a data science metric that tells you the quality of your model’s results. It is defined as the percentage of the expected results that are returned during a vector search. For example, if we expected to return 10 vectors for a query, but we only return 9 of the expected results, we’d have a recall of 0.90 (90%). In our testing, we used a target recall of 0.95 (95%) or greater with pgvector 0.7.0. This is important as it would be simple to produce very fast running results of poor quality with low recall. When comparing the results of vector databases you should always understand the assumption of the recall metric to make sure that you are comparing apples to apples.

All of our tests were run using VectorDBBench: A Benchmark Tool for VectorDB, modified to support parallel table load and binary quantization. This tool has a variety of datasets included that are publicly available. Our first test used the Open AI 5M row data set and the second test used the larger 10M row Cohere data set. Both tests use dense vectors with 768 or 1,536 dimensions which we see modern embedding models outputting. For example, the embedding dimension size of Hugging Face Instructor-xl is 768, and Amazon Titan Text Embeddings outputs a vector of 1536 dimensions. To make sure that we achieved max parallelism, we set max parallel workers to be 46, which is 48 vCPUs minus 2 so we would never consume all compute capacity on the r7g.12xl instance.

Table 1 summarizes our benchmark and data sets.

Table 1
benchmark dataset data type
dimension/Type
#vectors dataset size
VectorDBBench openai5m 1536 / fp32 5 M 31 GB
VectorDBBench cohere10m 768 / fp32 10 M 31 GB

HNSW index build and loading speeds up to 30x faster with pgvector 0.7.0

Table 2 shows the results of our load duration Open AI 5M row data set test.

Table 2
dataset pgvector version load_duration (seconds) load_duration speedup recall Instance_type
openai5m 0.5.1 29,752.6 1.0 0.973 r7g.12xlarge
openai5m 0.7.0 1,272.1 23.4 0.972 r7g.12xlarge

Table 3 shows the results of our load duration Cohere 10M row data set test.

Table 3
dataset pgvector version load_duration (minutes) load_duration speedup recall Instance_type
cohere10m 0.5.1 49,836.9 1.0 0.934 r7g.12xlarge
cohere10m 0.7.0 1,685.6 29.6 0.951 r7g.12xlarge

Figure 1 visually shows the results of the Open AI 5M row and Cohere 10M row load and index creation tests for pgvector 0.5.1 compared to pgvector 0.7.0 with Aurora PostgreSQL.

Figure 1

Figure 1

As we can see from these test results in Tables 2 and 3 and Figure 1, the promise of using parallelism on index builds really pays off with pgvector 0.7.0 in Aurora when compared to the prior shipping version of Aurora with pgvector 0.5.1. Without using quantization methods, our tests show up to a 30x speed up in load duration on the larger Cohere 10M row data set and we see a 23x speed up on the smaller Open AI 5M row data set. For an index build that was taking 15 hours, a 30x speedup will allow it to complete in 30 minutes!

Scalar quantization with halfvec saves 50% of vector memory consumption and storage costs with similar search quality

pgvector 0.7.0 includes quantization features that reduce the vector and index sizes to reduce memory and storage consumption. Vector search applications will run fastest when the entire index can be kept in memory. Quantization allows vector-based applications to run fast on less costly, smaller instances. We tested how scalar quantization with halfvec reduces memory and storage costs without compromising the performance and efficiency of pgvector 0.7.0 indexes.

Quantization allows us to reduce the size of the vector dimensions. pgvector offers various types of quantizations, but in this post, we tested scalar quantization, which uses half-precision vectors (halfvec) that represent floats in 16 bits instead of 32 bits used by the normal vector data type (fullvec). For example, the length of OpenAI’s text-embedding-3-small embedding vector is 1,536 and requires 6,148 bytes per vector (4 bytes x 1,536 dimensions + 4). With halfvec, the embedding vector requires only 3,076 bytes of storage. This allows us to fit 2 vectors into each 8KB index page instead of 1 allowing us to store our data in fewer pages.

pgvector 0.7.0 support for halfvec reduces vector and index storage space by 50%. However, this can impact accuracy, speed, and resource usage compared to using full vector type. Recall that the HNSW algorithm is used for efficiently finding similar vectors in large datasets. It constructs a multi-layered graph, where each layer represents a subset of the dataset. The top layer contains the fewest data points, while the bottom layer contains all the datapoints. The search process begins at the highest layer, where the algorithm selects the closest vector to the query and proceeds down layer by layer, each time selecting the nearest vector. This process continues until it reaches the bottom layer, where it returns the set of similar vectors.

We tested with two HNSW parameters that determine the quality of the index build (ef_construction) and the quality of search results (ef_search). There is also an m parameter that determines the maximum number of connections (or edges) a data point (node) can have in the graph. We used m=16 in all tests.

  • ef_construction: This parameter is used when the graph is being built. Think of it as the algorithm’s thoroughness when it’s adding a new point to the graph. A higher ef_construction means the algorithm will search more extensively for neighbors, which can make the graph more accurate. However, this also means it will take more time and resources to build the graph. We ran tests with ef_construction = 256 for the highest quality results.
  • ef_search: This parameter comes into play when you’re searching for the nearest neighbors of a specific point in the graph. A higher ef_search means the algorithm will look more extensively for nearest neighbors, which can improve the accuracy of the search. However, this might slow down the search process. We ran tests with ef_search = 256 for the highest quality results.

Table 4 shows the results of pgvector 0.7.0 halfvec compared to fullvec test results.

Table 4
dataset/# of rows ef_search and ef_construction vector type table/ index size database memory % load duration seconds recall
cohere 10m 256 fullvec 38 GB 15.12% 1,685.6 0.950
cohere 10m 256 halfvec 19 GB 7.55% 1,696.3 0.950
openai 5m 256 fullvec 38 GB 15.12% 1,272.1 0.971
openai 5m 256 halfvec 19 GB 7.55% 1,284.1 0.969

Figure 2 visually compares the table/index size of halfvec and fullvec.

Figure 2

Figure 2

Figure 3 visually compares the load/index creation time in seconds achieved using halfvec and fullvec.

Figure 3

Figure 3

Figure 4 visually compares the recall achieved using halfvec and fullvec.

Figure 4

Figure 4

So, what impact does halfvec have on index build/load, cost, and search performance? From the results in Table 4 and Figure 2, we see that index size is reduced by 50% which clearly saves on memory and storage costs. For example, if an r7g.12xl instance would be required to store the index in memory when using fullvec, that same index could now fit in a smaller r7g.6xl instance with halfvec. Alternatively, you can also double the number of dimensions on the same hardware. For example, if you used 2,000 dimensions in pgvector with fullvec, you could now support 4,000 dimensions with halfvec. The good news is that halfvec provides a large benefit at very low cost. These tests show that taking advantage of the halfvec 50% reduction in index size has very little impact on load or recall as seen in Table 4 and Figures 3-4.

Binary quantization can improve index build and loading speeds up to 67x with pgvector 0.7.0

If we take this test further and apply the binary quantization methods and use halfvec 2-byte floats, we see a 67x speed up in HNSW index build performance for pgvector 0.7.0 compared to our original pgvector 0.5.1 results in Tables 5 and 6 and Figure 5 below. Binary quantization is a more extreme quantization technique in that it can reduce the full value of the dimension of a vector to a single bit of information. Specifically, binary quantization will reduce any positive value to 1, and any zero or negative value to 0. This will lead to a large reduction in memory usage and an increase in QPS but will likely have a negative impact on recall due to the loss of precision in many uses. Our testing has shown that to improve the recall, we need to re-rank the results and that is what we did in our test. With re-ranking, we used binary quantization to narrow down the result set of vectors, and then reordered the reduced set of vectors using the original flat vectors stored in the table. Here is an example of this re-ranking.

select i.id
from (select id, embedding <=> % s as distance
 from public."pg_vector_collection"
 order by binary_quantize(embedding)::bit(1536) <~> binary_quantize(%s)
limit %s::int) i
order by distance
limit %s::int

If you are curious as to how we reduced vector values to a single bit, pgvector gets support for binary quantization through supporting vector operations from the PostgreSQL bit type. pgvector provides two different bitwise distance functions: Jaccard and Hamming distance. Jaccard similarity measures how similar two vectors are by comparing their common elements. It’s like comparing two groups of friends and seeing how many friends they have in common. Hamming distance measures how many places two strings of equal length differ. It’s like comparing two words to see how many letters are different. Hamming distance has better performance characteristics for this test and was chosen.

Table 5 shows the results of our load duration Open AI 5M row data set test comparing the performance of pgvector 0.5.1 and pgvector 0.7.0 with and without binary quantization.

Table 5
dataset pgvector version load duration (seconds) load duration speedup recall instance_type
openai5m 0.5.1 29,752.6 1.0 0.973 r7g.12xlarge
openai5m 0.7.0 1,272.0 23.4 0.972 r7g.12xlarge
openai5m 0.7.0 with binary quantization 445.4 66.8 0.822 r7g.12xlarge

Table 6 shows the results of our load duration Cohere 10M row data set test comparing the performance of pgvector 0.5.1 and pgvector 0.7.0 with and without binary quantization.

Table 6
dataset pgvector version load duration (seconds) load duration speedup recall instance_type
cohere10m 0.5.1 49,836.9 1.0 0.934 r7g.12xlarge
cohere10m 0.7.0 1,680.0 29.7 0.952 r7g.12xlarge
cohere10m 0.7.0 with binary quantization 738.6 67.5 0.659 r7g.12xlarge

Figure 5 visually compares the load/index creation duration of pgvector 0.5.1 and pgvector 0.7.0 with and without binary quantization.

Figure 5

Figure 5

As predicted, we did see a drop in recall once we employed binary quantization with this particular dataset. This is due to the large amount of information contained in a vector value and that precision is lost. Other data sets may not experience the same drop in recall or may experience a greater drop. Be sure to test this technique with your data to make sure that you are getting the recall quality and cost your use case needs.

Summary

We compared the index build and query times of the new version of pgvector 0.7.0 with the prior pgvector 0.5.1 version in Aurora PostgreSQL to demonstrate the performance and cost improvements of parallel index builds and scalar and binary quantization. We found that HNSW index build and loading speeds were up to 30x faster with pgvector 0.7.0 in Aurora PostgreSQL compared to pgvector 0.5.1. In addition, we found that scalar quantization with halfvec saved 50% of vector memory consumption and storage costs while providing similar search quality and query performance in our tests. Last, we found a 67x speed up in HNSW index build performance when using binary quantization available in pgvector 0.7.0 compared to pgvector 0.5.1 index build times in Aurora PostgreSQL. We note a significant drop in recall when using binary quantization in our tests which can be improved with re-ranking. We advise that you test binary quantization and the search result quality as measured by recall with your data before adopting binary quantization.

We invite your comments on this post. You can learn more about pgvector and the generative artificial intelligence (AI) ecosystem on the Aurora features page.


About the authors

Steve Dille is a Senior Product Manager for Amazon Aurora, and leads all generative AI strategy and product initiatives with Aurora databases for AWS. Previous to this role, Steve founded the performance and benchmark team for Aurora and then built and launched the Amazon RDS Data API for Amazon Aurora Serverless v2. He has been with AWS for 5 years. Prior to this, he served as a software developer at NCR, product manager at HP, and Data Warehousing Director at Sybase (SAP). He has over 20 years of experience as VP of Product or CMO on the executive teams of companies. Steve earned a Master’s in Information and Data Science at UC Berkeley, an MBA from the University of Chicago Booth School of Business, and a BS in Computer Science/Math with distinction from the University of Pittsburgh.

Mark Greenhalgh is a senior database engineer with over 20 years of experience designing, developing, and optimizing high-performance database systems. He specializes in analyzing database benchmarks and metrics to improve performance and scalability.