AWS Big Data Blog
Save big on OpenSearch: Unleashing Intel AVX-512 for binary vector performance
With OpenSearch version 2.19, Amazon OpenSearch Service now supports hardware-accelerated enhanced latency and throughput for binary vectors. When you choose the latest-generation, Intel Xeon instances for your data nodes, OpenSearch uses AVX-512 acceleration to bring up to 48% throughput improvement vs. previous-generation R5 instances, and 10% throughput improvement compared with OpenSearch 2.17 and below. There’s no need to change your settings. You will simply see improvements when you upgrade to OpenSearch 2.19 and use c7i, m7i, and R7i instances.
In this post, we discuss the improvements these advanced processors provide to your OpenSearch workloads, and how it can help you lower your total cost of ownership (TCO).
Difference between full precision and binary vectors
When you use OpenSearch Service for semantic search, you create vector embeddings that you store in OpenSearch. OpenSearch’s k-nearest neighbors (k-NN) plugin provides engines—Facebook AI Similarity Search (FAISS), Non-Metric Space Library (NMSLib), and Apache Lucene—and algorithms—Hierarchical Navigable Small World (HNSW) and Inverted File (IVF)—that store embeddings and compute nearest neighbor matches.
Vector embeddings are high-dimension arrays of 32-bit floating-point numbers (FP32). Large language models (LLMs), foundation models (FMs), and other machine learning (ML) models generate vector embeddings from their inputs. A typical, 384-dimension embedding takes 384 * 4 = 1,536 B. As the number of vectors in the solution grows into the millions (or billions), it is costly to store and work with that much data.
OpenSearch Service supports binary vectors. These vectors use 1 bit to store each dimension. A 384-dimension, binary embedding takes 384 / 8 b = 48 B to store. Of course, in reducing the number of bits, you also lose information. Binary vectors don’t provide recall that is as accurate as full-precision vectors. In trade, binary vectors are substantially less costly and provide substantially better latency.
Hardware acceleration: AVX-512 and popcount instructions
Binary vectors rely on Hamming distance to measure similarity. The Hamming distance between two binary vectors is defined as the difference in the number of bits between two numbers. Hamming distance relies on a technique called popcount (population count), which is briefly described in the next section.
For example, for finding the Hamming distance between 5 and 3:
- 5 = 101
- 3 = 011
- Differences at two positions (bitwise XOR): 101 ⊕ 011 = 110 (2 ones)
Therefore, Hamming distance (5, 3) = 2.
Popcount is an operation that counts the number of 1 bits in a binary input. The Hamming distance between two binary inputs is directly equivalent to calculating the popcount of their bitwise XOR result. The AVX-512 accelerator has a native popcount operation, which makes popcount and Hamming distance calculations fast.
OpenSearch 2.19 integrates advanced Intel AVX-512 instructions in the FAISS engine. When you use binary vectors with OpenSearch 2.19 engine in OpenSearch Service, OpenSearch can maximize performance on the latest Intel Xeon processors. The OpenSearch k-NN plugin with FAISS uses a specialized build mode, avx512_spr
, that enhances the Hamming distance computation with the __mm512_popcnt_epi64
vector instruction. __mm512_popcnt_epi64
counts the number of logical 1 bits in eight 64-bit integers at once. This reduces the instruction pathlength—the number of instructions the CPU executes— by eight times. The benchmarks in the next sections demonstrate the improvements seen on OpenSearch binary vectors due to this optimization.
There is no special configuration required to take advantage of the optimization, because it’s enabled by default. The requirements to using the optimization are:
- OpenSearch version 2.19 and above
- Intel 4th Generation Xeon or newer instances—C7i, M7i, or R7i— for data nodes
Where do binary vector workloads spend the bulk of time?
To put our system through its paces, we created a test dataset of 10 million binary vectors. We chose the Hamming space for measuring distances between vectors because it’s particularly well-suited for binary data. This substantial dataset helped us generate enough stress on the system to pinpoint exactly where performance bottlenecks might occur. If you’re interested in the details, you can find the complete cluster configuration and index settings for this analysis in Appendix 2 at the end of this post.
The following profile analysis of binary vector-based workloads using a flame graph shows that the bulk of time is spent in the FAISS library computing Hamming distances. We observe up to 66% time spent on BinaryIndices
in the FAISS library.
Benchmarks and Results
In the next sections, we look at the results of optimizing this logic and the benefits to OpenSearch workloads along two aspects:
- Price-performance; with reduced CPU consumption, you might be able to reduce the instances in your domain
- Performance gains due to the Intel popcount instruction
Price-performance and TCO gains for OpenSearch users
If you want to take advantage of the performance gains, we recommend the R7i instances, with a high memory:core ratio, for your data nodes. The following table shows the results of benchmarking with a 10-million-vector and 100-million-vector dataset and the resulting improvements on an R7i instance compared to an R5 instance. R5 instances support avx512
instructions, but not the advanced instructions present in avx512_spr
. That is only available with R7i and newer Intel instances.
On average, we observed 20% gains on indexing throughput and up to 48% gains on search throughput comparing R5 and R7i instances. R7i instances are about 13% more costly than R5 instances. The price-performance favors the R7is. The 100-million-vector dataset showed slightly better results with search throughput improving more than 40%. In Appendix 1, we document the test configuration, and we present the tabular results in Appendix 3.
The following figures visualize the results with the 10-million-vector dataset.
The following figures visualize the results with the 100-million-vector dataset.
Performance gains due to popcount instruction in AVX-512
This section is for advanced users interested in knowing the extent of improvements the new avx512_spr
provides and more details on where the performance gains are coming from. The OpenSearch configuration used in this experiment is documented in Appendix 2.
We ran an OpenSearch benchmark on R7i instances with and without the Hamming distance optimization. You can disable avx512_spr
by setting knn.faiss.avx512_spr.disabled
in your opensearch.yaml
file, as described in SIMD optimization. The data shows that the feature provides a 10% throughput improvement on indexing and search and a 10% reduction in latency if the client load is constant.
The gain is due to the use of __mm512_popcnt_epi64
hardware instruction present on Intel processors, which results in a pathlength reduction for the workloads. The hotspot identified in the earlier section is optimized with code using the hardware instruction. This results in fewer CPU cycles spent to run the same workload and translates to a 10% speed-up for binary vector indexing and latency reduction for search workloads on OpenSearch.
The following figures visualize the benchmarking results.
Conclusion
Improving storage, memory, and compute is key to optimizing vector search. Binary vectors already offer storage and memory benefits over FP32/FP16. This post detailed how our improvements to Hamming distance calculations significantly improve compute performance by up to 48% when comparing R5 and R7i instances on AWS. Whereas binary vectors fall short on matching recall for FP32 counterparts, techniques such as oversampling and rescoring help with improving recall rates. If you’re handling massive datasets, compute costs become a major expense. By migrating to Intel’s R7i and newer offerings on AWS, we’ve demonstrated substantial reductions in infrastructure costs, making these processors a highly efficient solution for users.
Hamming distance with newer AVX-512 instructions support is available on OpenSearch starting with 2.19 and later. We encourage you to give it a try on the latest Intel instances in your preferred cloud environment.
The new instructions also provide additional opportunities to use hardware acceleration in other areas of vector search, such as quantization techniques of FP16 and BF16. We are also interested in exploring the use of other hardware accelerators to vector search, such as AMX and AVX-10.
About the Authors
Akash Shankaran is a Software Architect and Tech Lead in the Xeon software team at Intel. He works on pathfinding opportunities and enabling optimizations on OpenSearch.
Mulugeta Mammo is a Senior Software Engineer and currently leads the OpenSearch Optimization team at Intel.
Noah Staveley is a Cloud Development Engineer currently working in the OpenSearch Optimization team at Intel.
Assane Diop is a Cloud Development Engineer, and currently works in the OpenSearch Optimization team at Intel.
Naveen Tatikonda is a software engineer at AWS, working on the OpenSearch Project and Amazon OpenSearch Service. His interests include distributed systems and vector search.
Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems.
Dylan Tong is a Senior Product Manager at Amazon Web Services. He leads the product initiatives for AI and machine learning (ML) on OpenSearch including OpenSearch’s vector database capabilities. Dylan has decades of experience working directly with customers and creating products and solutions in the database, analytics and AI/ML domain. Dylan holds a BSc and MEng degree in Computer Science from Cornell University.
Notices and disclaimers
Intel and the OpenSearch team collaborated on adding the Hamming distance feature. Intel contributed by designing and implementing the feature, and Amazon contributed by updating the toolchain, including compilers, release management, and documentation. Both teams collected data points showcased in the post.
Performance varies by use, configuration, and other factors. Learn more on the Performance Index website.
Your costs and results may vary.
Intel technologies might require enabled hardware, software, or service activation.
Appendix 1
The following table summarizes the test configuration for results in Appendix 3.
avx512 |
avx512_spr |
|
vector dimension | 768 | |
ef_construction | 100 | |
ef_search | 100 | |
primary shards | 8 | |
replica | 1 | |
data nodes | 2 | |
data node instance type | R5.4xl | R7i.4xl |
vCPU | 16 | |
Cluster manager nodes | 3 | |
Cluster manager node instance type | c5.xl | |
data type | binary | |
space type | Hamming |
Appendix 2
The following table summarizes the OpenSearch configuration used for this benchmarking.
avx512 |
avx512_spr |
|
OpenSearch version | 2.19 | |
engine | faiss | |
dataset | random-768-10M | |
vector dimension | 768 | |
ef_construction | 256 | |
ef_search | 256 | |
primary shards | 4 | |
replica | 1 | |
data nodes | 2 | |
cluster manager nodes | 1 | |
data node instance type | R7i.2xl | |
client instance | m6id.16xlarge | |
data type | binary | |
space type | Hamming | |
Indexing clients | 20 | |
query clients | 20 | |
force merge segments | 1 |
Appendix 3
This appendix contains the results of the 10-million-vector and 100-million-vector dataset runs.
The following table summarizes the query results in queries per second (QPS).
Query Throughput Without Forcemerge | Query Throughput with Forcemerge to 1 Segment | ||||||
Dataset | Dimension | avx512 / avx512_spr |
Query Clients | Mean Throughput | Median Throughput | Mean Throughput | Median Throughput |
random-768-10M | 768 | avx512 |
10 | 397.00 | 398.00 | 1321.00 | 1319.00 |
random-768-10M | 768 | avx512_spr |
10 | 516.00 | 525.00 | 1542.00 | 1544.00 |
%gain | – | – | – | 29.97 | 31.91 | 16.73 | 17.06 |
random-768-10M | 768 | avx512 |
20 | 424.00 | 426.00 | 1849.00 | 1853.00 |
random-768-10M | 768 | avx512_spr |
20 | 597.00 | 600.00 | 2127.00 | 2127.00 |
%gain | – | – | – | 40.81 | 40.85 | 15.04 | 14.79 |
random-768-100M | 768 | avx512 |
10 | 219 | 220 | 668 | 668 |
random-768-100M | 768 | avx512_spr |
10 | 324 | 324 | 879 | 887 |
%gain | – | – | – | 47.95 | 47.27 | 31.59 | 32.78 |
random-768-100M | 768 | avx512 |
20 | 234 | 235 | 756 | 757 |
random-768-100M | 768 | avx512_spr |
20 | 338 | 339 | 1054 | 1062 |
%gain | – | – | – | 44.44 | 44.26 | 39.42 | 40.29 |
The following table summarizes the indexing results.
Indexing Throughput (documents/second) | ||||||
Dataset | Dimension | avx512 / avx512_spr |
Indexing Clients | Mean Throughput | Median Throughput | Forcemerge (minutes) |
random-768-10M | 768 | avx512 |
20 | 58729 | 57135 | 61 |
random-768-10M | 768 | avx512_spr |
20 | 63595 | 65240 | 57 |
%gain | – | – | 8.29 | 14.19 | 7.02 | |
random-768-100M | 768 | avx512 |
16 | 28006 | 25381 | 682 |
random-768-100M | 768 | avx512_spr |
16 | 33477 | 30581 | 634 |
%gain | – | – | 19.54 | 20.49 | 7.04 |