Overview
Nemotron-3-Super-120B-A12B-FP8 is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Like other models in the family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template.
The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Distinct from the Nano model, the Super model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using NVFP4 quantization to maximize compute efficiency. The model has 12B active parameters and 120B parameters in total.
The supported languages include: English, French, German, Italian, Japanese, Spanish, and Chinese
This model is ready for commercial use.
Highlights
- Architecture Type: Mamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP)
- Network Architecture: Nemotron Hybrid LatentMoE
- Number of model parameters: 120B Total / 12B Active
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.g5.48xlarge Inference (Batch) Recommended | Model inference on the ml.g5.48xlarge instance type, batch mode | $8.00 |
ml.p5.48xlarge Inference (Real-Time) Recommended | Model inference on the ml.p5.48xlarge instance type, real-time mode | $8.00 |
ml.p4de.24xlarge Inference (Real-Time) | Model inference on the ml.p4de.24xlarge instance type, real-time mode | $8.00 |
ml.p5e.48xlarge Inference (Real-Time) | Model inference on the ml.p5e.48xlarge instance type, real-time mode | $8.00 |
ml.p5en.48xlarge Inference (Real-Time) | Model inference on the ml.p5en.48xlarge instance type, real-time mode | $8.00 |
Vendor refund policy
No Refunds.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
Additional details
Inputs
- Summary
Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D): Sequences
NVIDIA Nemotron-3-Super-120B-A12B accepts JSON requests via the /invocations API. The request contains a list of messages and optional generation controls. Reasoning behavior is controlled via the chat_template_kwargs parameter, supporting reasoning on, reasoning off, and low-effort reasoning modes. Both streaming (invoke_endpoint_with_response_stream) and non-streaming (invoke_endpoint) are supported.
- Limitations for input type
- Other Properties Related to Input: Maximum context length up to 1M tokens. Supported languages include: English, French, German, Italian, Japanese, Spanish, and Chinese
- Input MIME type
- application/json
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
model | Model identifier, must be nvidia/nemotron-3-super-120b-a12b.
| - | Yes |
messages | Array of message objects with role (system/user/assistant) and content (String).
| - | Yes |
temperature | Optional. `temperature=1.0` is recommended for all tasks. | - | No |
top_p | Optional. `top_p=0.95` is recommended for all tasks. | - | No |
max_tokens | Optional. Maximum number of tokens to generate. | - | No |
extra_body.chat_template_kwargs.enable_thinking | Optional. Enables internal reasoning trace before the final response. | - | No |
Resources
Support
Vendor support
Free support via NVIDIA NIM Developer Forum:
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products
