
Overview
The Cohere Rerank v3.5 endpoint enables businesses to significantly improve search and retrieval-augmented generation systems. The model takes a query and lists potential relevant documents. Rerank v3.5 then returns the documents as a list sorted by semantic similarity to the provided query. As an intelligent cross-encoding AI mode it is able to understand the meaning behind enterprise data and user questions. This model can be implemented with just a few lines of code and delivers leading performance across over 100 languages. Rerank is uniquely capable of understanding complex information which requires reasoning. Rerank v3.5 can be added to existing systems to improve performance. Please note that as of July 2025 the minimum requirement to deploy this model are NVIDIA driver version - 535 and CUDA version - 12.2.
Highlights
- Rerank v3.5 is uniquely capable of understanding complex documents and queries. This leads to more accurate search results when user questions have multiple aspects and require reasoning. Rerank v3.5 also offers strong performance on semi-structured data such as Code, Tables, and JSON Documents. These attributes make the model ideal for global organizations within such as Finance, Healthcare, Energy, Government, Manufacturing.
- Rerank v3.5 can be added to existing search and retrieval-augmented generation (RAG) systems with just a few lines of code. This ease of implementation makes is simple to boost semantic understanding and improve search results. Rerank v3.5 is also highly efficient, in terms of throughput, and is capable of satisfying demanding requirements for large organizations.
- Rerank v3.5 offers leading multilingual performance in over 100 languages, including but not limited to: Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish. This is useful for global organizations who operate across various languages and require a performant AI model to improve their search systems.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Free trial
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.g5.2xlarge Inference (Batch) Recommended | Model inference on the ml.g5.2xlarge instance type, batch mode | $3.50 |
ml.g5.2xlarge Inference (Real-Time) Recommended | Model inference on the ml.g5.2xlarge instance type, real-time mode | $3.50 |
ml.g6.2xlarge Inference (Real-Time) | Model inference on the ml.g6.2xlarge instance type, real-time mode | $3.50 |
ml.g5.xlarge Inference (Real-Time) | Model inference on the ml.g5.xlarge instance type, real-time mode | $3.50 |
ml.g6.xlarge Inference (Real-Time) | Model inference on the ml.g6.xlarge instance type, real-time mode | $3.50 |
Vendor refund policy
No refunds.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
A key feature update adjusts the default maximum token limit for per-model reranking to balance performance and resource use, with customization options available via API or configuration. Critical bug fixes resolve the "Empty EncodedTexts" issue in Rerank and Embed endpoints by improving chunking logic for oversized inputs and adding safeguards to ensure valid outputs.
Additional details
Inputs
- Summary
The model accepts JSON requests that specifies the input texts to be reranked. The maximum number number of documents that can be passed into a single rerank call is 1000. Note: The documentation below is for Version 2 of the Rerank API.
Req { “model”: “...”, "query": "...?", "documents": [“”...], "max_tokens_per_doc": 1, "top_n": 100 }
Res
{ "results": [ { "index": 0, "relevance_score": 0.0048297215 } ],
- Input MIME type
- application/json
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
query | The search query. Queries longer than 2000 tokens get automatically truncated. | Type: FreeText | Yes |
documents | A list of texts that will be compared to the `query`. For optimal performance we recommend against sending more than 1,000 documents in a single request. **Note**: long documents will automatically be truncated to the value of max_tokens_per_doc. **Note**: structured data should be formatted as YAML strings for best performance. | Type: FreeText | No |
top_n | Limits the number of returned rerank results to the specified value. If not passed, all the rerank results will be returned. | Default value: [] Type: Integer Minimum: 1 | No |
max_tokens_per_doc | Defaults to 4096. Long documents will be automatically truncated to the specified number of tokens. Compatibility: 'max_tokens_per_doc' is a parameter introduced in Rerank API Version 2 (`"api_version": 2`). | Default value: 4096 Type: Integer Minimum: 1 Maximum: 40000 | No |
Resources
Vendor resources
Support
Vendor support
Contact us at support+aws@cohere.com or at https://cohere.com/contact-sales support+aws@cohere.com
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products



Customer reviews
RAG assistant workflows have improved answer relevance and now deliver faster accurate decisions
What is our primary use case?
My main use case for Cohere Rerank v3.5 is that we used it in our enterprise RAG-based AI assistant solution hosted over AWS , which was integrated with Amazon Bedrock , OpenSearch , and vector embeddings. To improve document retrieval relevance for internal enterprise workflows, such as access management, ticketing, and knowledge retrieval, we have used Cohere Rerank v3.5 .
A specific example of how we used Cohere Rerank v3.5 in one of those workflows is that we usually found some bugs when asking questions as the KB search was doing multiple calls. To avoid that, we thought we would use a reranking model to decrease the KB calls, ensuring we have proper KBs at the first go, and for that, we have used reranking.
What is most valuable?
The best features Cohere Rerank v3.5 offers include high-quality reranking relevance, better contextual retrieval, easy AWS integration, fast response time, and strong multilingual support, which are some of the better things that I have observed.
The easy AWS integration and fast response time help my team because the whole solution is on AWS, and AWS already provides Cohere as a provider where reranking is available. We just have to call the ARN of the reranking model as an interface, and it easily integrates into the KB search call, making integration straightforward.
The most valuable feature was the reranking quality. After introducing Cohere Rerank v3.5 into our pipeline, the relevance of the required chunks improved significantly, which directly reduced hallucination responses from the downstream LLMs, and the latency was quite good, making it acceptable for the enterprise-grade application.
Cohere Rerank v3.5 has positively impacted my organization by improving answer accuracy in our AI assistant workflows and reducing irrelevant retrieval results. This improved end-user trust in the system and helped move some proof of concept implementation closer to production readiness so that our end users can trust the answers.
We have significantly seen the outcomes, and the answer quality outcomes have improved after implementing the reranking.
What needs improvement?
For improvement purposes, latency can be improved for sure, as it is currently around one to one and a half seconds, and if we can improve it so that it takes much lesser time than whatever it is taking right now, that would be great. Other than that, I have not seen that much room for improvement because it is already a much improved version I am using right now.
I can see that better native observability can be implemented, and price transparency is not there on AWS. Other than that, AWS native analytics could also be helpful for developers, and if fine-tuning can be available for those reranking models, it could have much better control over the reranking model, in my opinion.
I do not think there are any other improvements for Cohere Rerank v3.5 that we have not discussed yet, as we have already talked about latency, integration, and everything else that is already there.
For how long have I used the solution?
I have been using Cohere Rerank v3.5 for the last one and a half years.
What do I think about the stability of the solution?
Cohere Rerank v3.5 is quite stable and fast.
What do I think about the scalability of the solution?
Cohere Rerank v3.5's scalability is something that works on-the-go, as AWS already supports scalability for enterprise-specific needs. If right now one hundred users are using it, fewer resources will be utilized, but if more than that or maybe one thousand to ten thousand users are using it, the load will scale accordingly, and we have not seen any performance degradation as user numbers increase, so scaling works very fast and much better.
How are customer service and support?
I have not yet visited the customer support for Cohere Rerank v3.5 because we have not required that. In terms of stability and scalability, we have found that the solution was stable during testing in enterprise workloads and was able to handle large document retrieval scenarios with acceptable performance degradation. The API integration through AWS was straightforward and reliable as all of this was mentioned in the AWS documentation, which was quite good.
Which solution did I use previously and why did I switch?
We were not initially using any reranking models, but after switching to reranking models, the performance and answer quality have improved significantly.
How was the initial setup?
The setup process is very easy as the inference model is already provided on the AWS documentation on how to utilize it, which I think is very good to have.
What was our ROI?
I have seen a return on investment, as time has been saved significantly because the end-user experience has improved considerably, and the end-user is impressed with the responses we are providing to them, appreciating the response quality greatly.
Which other solutions did I evaluate?
Before choosing Cohere Rerank v3.5, I evaluated other options, including AWS Titan, which provides embedding retrieval, as well as OpenSearch k-NN and some open-source reranking models from Hugging Face , but we found that Cohere performs much better than the options I mentioned earlier.
What other advice do I have?
I would suggest to any developer who wants to increase their RAG response quality to look into Cohere Rerank v3.5 for a significant improvement in response quality. I gave this product a rating of nine out of ten.