
Overview
The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.
In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35x through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.
For additional information please contact NVIDIA: https://www.nvidia.com/en-us/data-center/lp/aws-marketplace-offerÂ
This model is ready for commercial use.
Highlights
- The NeMo Retriever Llama3.2 embedding model is most suitable for users who want to build a multilingual question-and-answer application over a large text corpus, leveraging the latest dense retrieval technologies.
- NVIDIA NIM, a part of the [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/) software platform available on the [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-ozgjkov6vq3l6), is a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing.
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.g5.2xlarge Inference (Batch) Recommended | Model inference on the ml.g5.2xlarge instance type, batch mode | $1.00 |
ml.g5.2xlarge Inference (Real-Time) Recommended | Model inference on the ml.g5.2xlarge instance type, real-time mode | $1.00 |
ml.g5.xlarge Inference (Batch) | Model inference on the ml.g5.xlarge instance type, batch mode | $1.00 |
ml.g5.12xlarge Inference (Batch) | Model inference on the ml.g5.12xlarge instance type, batch mode | $4.00 |
ml.g5.8xlarge Inference (Batch) | Model inference on the ml.g5.8xlarge instance type, batch mode | $1.00 |
ml.g5.4xlarge Inference (Batch) | Model inference on the ml.g5.4xlarge instance type, batch mode | $1.00 |
ml.g5.48xlarge Inference (Batch) | Model inference on the ml.g5.48xlarge instance type, batch mode | $8.00 |
ml.g5.16xlarge Inference (Batch) | Model inference on the ml.g5.16xlarge instance type, batch mode | $1.00 |
ml.g5.24xlarge Inference (Batch) | Model inference on the ml.g5.24xlarge instance type, batch mode | $4.00 |
ml.g6.16xlarge Inference (Real-Time) | Model inference on the ml.g6.16xlarge instance type, real-time mode | $1.00 |
Vendor refund policy
No refund
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
Added the NIM_SERVED_MODEL_NAME environment variable. Updated the LangChain Playbook to use the Llama-3.2-NV-EmbedQA-1B-v2 NIM.
Additional details
Inputs
- Summary
The model accepts JSON requests that specifies the input text to be embedded.
{ "input": ["Hello world"], "model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input_type": "query" }'
- Input MIME type
- application/json
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
input | Input text to embed. Max length is 512 tokens. | Default value: ""
Type: FreeText | No |
model | ID of the embedding model. | Type: FreeText | Yes |
input_type | passage is used when generating embeddings during indexing. query is used when generating embeddings during querying. | Default value: passage
Type: Categorical
Allowed values: passage, query | No |
encoding_format | The format to return the embeddings in.
| Default value: float
Type: Categorical
Allowed values: float, base64 | No |
truncate | Specifies how inputs longer than the maximum token length of the model are handled. | Default value: none
Type: Categorical
Allowed values: NONE, START, END | No |
Resources
Vendor resources
Support
Vendor support
Free support via NVIDIA NIM Developer Forum: https://forums.developer.nvidia.com/c/ai-data-science/nvidia-nim/Â
Global enterprise support with NVIDIA AI Enterprise subscription:
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.