Save big on OpenSearch: Unleashing Intel AVX-512 for binary vector performance

With OpenSearch version 2.19, Amazon OpenSearch Service now supports hardware-accelerated enhanced latency and throughput for binary vectors. When you choose the latest-generation, Intel Xeon instances for your data nodes, OpenSearch uses AVX-512 acceleration to bring up to 48% throughput improvement vs. previous-generation R5 instances, and 10% throughput improvement compared with OpenSearch 2.17 and below. There’s no need to change your settings. You will simply see improvements when you upgrade to OpenSearch 2.19 and use c7i, m7i, and R7i instances.

In this post, we discuss the improvements these advanced processors provide to your OpenSearch workloads, and how it can help you lower your total cost of ownership (TCO).

Difference between full precision and binary vectors

When you use OpenSearch Service for semantic search, you create vector embeddings that you store in OpenSearch. OpenSearch’s k-nearest neighbors (k-NN) plugin provides engines—Facebook AI Similarity Search (FAISS), Non-Metric Space Library (NMSLib), and Apache Lucene—and algorithms—Hierarchical Navigable Small World (HNSW) and Inverted File (IVF)—that store embeddings and compute nearest neighbor matches.

Vector embeddings are high-dimension arrays of 32-bit floating-point numbers (FP32). Large language models (LLMs), foundation models (FMs), and other machine learning (ML) models generate vector embeddings from their inputs. A typical, 384-dimension embedding takes 384 * 4 = 1,536 B. As the number of vectors in the solution grows into the millions (or billions), it is costly to store and work with that much data.

OpenSearch Service supports binary vectors. These vectors use 1 bit to store each dimension. A 384-dimension, binary embedding takes 384 / 8 b = 48 B to store. Of course, in reducing the number of bits, you also lose information. Binary vectors don’t provide recall that is as accurate as full-precision vectors. In trade, binary vectors are substantially less costly and provide substantially better latency.

Hardware acceleration: AVX-512 and popcount instructions

Binary vectors rely on Hamming distance to measure similarity. The Hamming distance between two binary vectors is defined as the difference in the number of bits between two numbers. Hamming distance relies on a technique called popcount (population count), which is briefly described in the next section.

For example, for finding the Hamming distance between 5 and 3:

5 = 101
3 = 011
Differences at two positions (bitwise XOR): 101 ⊕ 011 = 110 (2 ones)

Therefore, Hamming distance (5, 3) = 2.

Popcount is an operation that counts the number of 1 bits in a binary input. The Hamming distance between two binary inputs is directly equivalent to calculating the popcount of their bitwise XOR result. The AVX-512 accelerator has a native popcount operation, which makes popcount and Hamming distance calculations fast.

OpenSearch 2.19 integrates advanced Intel AVX-512 instructions in the FAISS engine. When you use binary vectors with OpenSearch 2.19 engine in OpenSearch Service, OpenSearch can maximize performance on the latest Intel Xeon processors. The OpenSearch k-NN plugin with FAISS uses a specialized build mode, avx512_spr, that enhances the Hamming distance computation with the __mm512_popcnt_epi64 vector instruction. __mm512_popcnt_epi64 counts the number of logical 1 bits in eight 64-bit integers at once. This reduces the instruction pathlength—the number of instructions the CPU executes— by eight times. The benchmarks in the next sections demonstrate the improvements seen on OpenSearch binary vectors due to this optimization.

There is no special configuration required to take advantage of the optimization, because it’s enabled by default. The requirements to using the optimization are:

OpenSearch version 2.19 and above
Intel 4th Generation Xeon or newer instances—C7i, M7i, or R7i— for data nodes

Where do binary vector workloads spend the bulk of time?

To put our system through its paces, we created a test dataset of 10 million binary vectors. We chose the Hamming space for measuring distances between vectors because it’s particularly well-suited for binary data. This substantial dataset helped us generate enough stress on the system to pinpoint exactly where performance bottlenecks might occur. If you’re interested in the details, you can find the complete cluster configuration and index settings for this analysis in Appendix 2 at the end of this post.

The following profile analysis of binary vector-based workloads using a flame graph shows that the bulk of time is spent in the FAISS library computing Hamming distances. We observe up to 66% time spent on BinaryIndices in the FAISS library.

Benchmarks and Results

In the next sections, we look at the results of optimizing this logic and the benefits to OpenSearch workloads along two aspects:

Price-performance; with reduced CPU consumption, you might be able to reduce the instances in your domain
Performance gains due to the Intel popcount instruction

Price-performance and TCO gains for OpenSearch users

If you want to take advantage of the performance gains, we recommend the R7i instances, with a high memory:core ratio, for your data nodes. The following table shows the results of benchmarking with a 10-million-vector and 100-million-vector dataset and the resulting improvements on an R7i instance compared to an R5 instance. R5 instances support avx512 instructions, but not the advanced instructions present in avx512_spr. That is only available with R7i and newer Intel instances.

On average, we observed 20% gains on indexing throughput and up to 48% gains on search throughput comparing R5 and R7i instances. R7i instances are about 13% more costly than R5 instances. The price-performance favors the R7is. The 100-million-vector dataset showed slightly better results with search throughput improving more than 40%. In Appendix 1, we document the test configuration, and we present the tabular results in Appendix 3.

The following figures visualize the results with the 10-million-vector dataset.

The following figures visualize the results with the 100-million-vector dataset.

Performance gains due to popcount instruction in AVX-512

This section is for advanced users interested in knowing the extent of improvements the new avx512_spr provides and more details on where the performance gains are coming from. The OpenSearch configuration used in this experiment is documented in Appendix 2.

We ran an OpenSearch benchmark on R7i instances with and without the Hamming distance optimization. You can disable avx512_spr by setting knn.faiss.avx512_spr.disabled in your opensearch.yaml file, as described in SIMD optimization. The data shows that the feature provides a 10% throughput improvement on indexing and search and a 10% reduction in latency if the client load is constant.

The gain is due to the use of __mm512_popcnt_epi64 hardware instruction present on Intel processors, which results in a pathlength reduction for the workloads. The hotspot identified in the earlier section is optimized with code using the hardware instruction. This results in fewer CPU cycles spent to run the same workload and translates to a 10% speed-up for binary vector indexing and latency reduction for search workloads on OpenSearch.

The following figures visualize the benchmarking results.

Conclusion

Improving storage, memory, and compute is key to optimizing vector search. Binary vectors already offer storage and memory benefits over FP32/FP16. This post detailed how our improvements to Hamming distance calculations significantly improve compute performance by up to 48% when comparing R5 and R7i instances on AWS. Whereas binary vectors fall short on matching recall for FP32 counterparts, techniques such as oversampling and rescoring help with improving recall rates. If you’re handling massive datasets, compute costs become a major expense. By migrating to Intel’s R7i and newer offerings on AWS, we’ve demonstrated substantial reductions in infrastructure costs, making these processors a highly efficient solution for users.

Hamming distance with newer AVX-512 instructions support is available on OpenSearch starting with 2.19 and later. We encourage you to give it a try on the latest Intel instances in your preferred cloud environment.

The new instructions also provide additional opportunities to use hardware acceleration in other areas of vector search, such as quantization techniques of FP16 and BF16. We are also interested in exploring the use of other hardware accelerators to vector search, such as AMX and AVX-10.

About the Authors

Akash Shankaran is a Software Architect and Tech Lead in the Xeon software team at Intel. He works on pathfinding opportunities and enabling optimizations on OpenSearch.

Mulugeta Mammo is a Senior Software Engineer and currently leads the OpenSearch Optimization team at Intel.

Noah Staveley is a Cloud Development Engineer currently working in the OpenSearch Optimization team at Intel.

Assane Diop is a Cloud Development Engineer, and currently works in the OpenSearch Optimization team at Intel.

Naveen Tatikonda is a software engineer at AWS, working on the OpenSearch Project and Amazon OpenSearch Service. His interests include distributed systems and vector search.

Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems.

Dylan Tong is a Senior Product Manager at Amazon Web Services. He leads the product initiatives for AI and machine learning (ML) on OpenSearch including OpenSearch’s vector database capabilities. Dylan has decades of experience working directly with customers and creating products and solutions in the database, analytics and AI/ML domain. Dylan holds a BSc and MEng degree in Computer Science from Cornell University.

Notices and disclaimers

Intel and the OpenSearch team collaborated on adding the Hamming distance feature. Intel contributed by designing and implementing the feature, and Amazon contributed by updating the toolchain, including compilers, release management, and documentation. Both teams collected data points showcased in the post.

Performance varies by use, configuration, and other factors. Learn more on the Performance Index website.

Your costs and results may vary.

Intel technologies might require enabled hardware, software, or service activation.

Appendix 1

The following table summarizes the test configuration for results in Appendix 3.

	`avx512`	`avx512_spr`
vector dimension	768
ef_construction	100
ef_search	100
primary shards	8
replica	1
data nodes	2
data node instance type	R5.4xl	R7i.4xl
vCPU	16
Cluster manager nodes	3
Cluster manager node instance type	c5.xl
data type	binary
space type	Hamming

Appendix 2

The following table summarizes the OpenSearch configuration used for this benchmarking.

	`avx512`	`avx512_spr`
OpenSearch version	2.19
engine	faiss
dataset	random-768-10M
vector dimension	768
ef_construction	256
ef_search	256
primary shards	4
replica	1
data nodes	2
cluster manager nodes	1
data node instance type	R7i.2xl
client instance	m6id.16xlarge
data type	binary
space type	Hamming
Indexing clients	20
query clients	20
force merge segments	1

Appendix 3

This appendix contains the results of the 10-million-vector and 100-million-vector dataset runs.

The following table summarizes the query results in queries per second (QPS).

				Query Throughput Without Forcemerge		Query Throughput with Forcemerge to 1 Segment
Dataset	Dimension	`avx512` / `avx512_spr`	Query Clients	Mean Throughput	Median Throughput	Mean Throughput	Median Throughput
random-768-10M	768	`avx512`	10	397.00	398.00	1321.00	1319.00
random-768-10M	768	`avx512_spr`	10	516.00	525.00	1542.00	1544.00
%gain	–	–	–	29.97	31.91	16.73	17.06

random-768-10M	768	`avx512`	20	424.00	426.00	1849.00	1853.00
random-768-10M	768	`avx512_spr`	20	597.00	600.00	2127.00	2127.00
%gain	–	–	–	40.81	40.85	15.04	14.79

random-768-100M	768	`avx512`	10	219	220	668	668
random-768-100M	768	`avx512_spr`	10	324	324	879	887
%gain	–	–	–	47.95	47.27	31.59	32.78
random-768-100M	768	`avx512`	20	234	235	756	757
random-768-100M	768	`avx512_spr`	20	338	339	1054	1062
%gain	–	–	–	44.44	44.26	39.42	40.29

The following table summarizes the indexing results.

				Indexing Throughput (documents/second)
Dataset	Dimension	`avx512` / `avx512_spr`	Indexing Clients	Mean Throughput	Median Throughput	Forcemerge (minutes)
random-768-10M	768	`avx512`	20	58729	57135	61
random-768-10M	768	`avx512_spr`	20	63595	65240	57
%gain	–	–		8.29	14.19	7.02

random-768-100M	768	`avx512`	16	28006	25381	682
random-768-100M	768	`avx512_spr`	16	33477	30581	634
%gain	–	–		19.54	20.49	7.04

AWS Big Data Blog