Overview
Llama 3.3 Nemotron Super 49B V1.5 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff.
The model underwent a multiphase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Science, and Tool Calling. Additionally, the model went through multiple stages of Reinforcement Learning (RL) including Reward-aware Preference Optimization (RPO) for chat, Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning, and iterative Direct Preference Optimization (DPO) for Tool Calling capability enhancements. The final checkpoint was achieved after merging several RL and DPO checkpoints.
This model is part of the Llama Nemotron Collection. The other model(s) in this family are: Llama-3.1-Nemotron-Nano-4B-v1.1 Llama-3.1-Nemotron-Ultra-253B-v1 This model is ready for commercial use.
Highlights
- **Architecture Type:** Dense decoder-only Transformer model **Network Architecture:** Llama 3.3 70B Instruct, customized through Neural Architecture Search (NAS) Derivative of Meta Llama-3.3-70B-Instruct, using Neural Architecture Search (NAS). The NAS algorithm results in non-standard and non-repetitive blocks. This includes the following: Skip attention: In some blocks, the attention is skipped entirely, or replaced with a single linear layer.
- We utilize a block-wise distillation of the reference model, where for each block we create multiple variants providing different tradeoffs of quality vs. computational complexity, discussed in more depth below. We then search over the blocks to create a model which meets the required throughput and memory (optimized 1x H100-80GB GPU) while minimizing quality degradation. The model undergoes knowledge distillation (KD), with a focus on English single and multi-turn chat.
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Free trial
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.g5.48xlarge Inference (Batch) Recommended | Model inference on the ml.g5.48xlarge instance type, batch mode | $1.00 |
ml.g6e.24xlarge Inference (Real-Time) Recommended | Model inference on the ml.g6e.24xlarge instance type, real-time mode | $1.00 |
ml.g5.12xlarge Inference (Batch) | Model inference on the ml.g5.12xlarge instance type, batch mode | $1.00 |
ml.g5.24xlarge Inference (Batch) | Model inference on the ml.g5.24xlarge instance type, batch mode | $1.00 |
ml.g5.48xlarge Inference (Real-Time) | Model inference on the ml.g5.48xlarge instance type, real-time mode | $1.00 |
ml.g6e.12xlarge Inference (Real-Time) | Model inference on the ml.g6e.12xlarge instance type, real-time mode | $1.00 |
ml.g6e.48xlarge Inference (Real-Time) | Model inference on the ml.g6e.48xlarge instance type, real-time mode | $1.00 |
ml.p4d.24xlarge Inference (Real-Time) | Model inference on the ml.p4d.24xlarge instance type, real-time mode | $1.00 |
ml.p4de.24xlarge Inference (Real-Time) | Model inference on the ml.p4de.24xlarge instance type, real-time mode | $1.00 |
ml.p5.48xlarge Inference (Real-Time) | Model inference on the ml.p5.48xlarge instance type, real-time mode | $1.00 |
Vendor refund policy
No refund.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
initial release of V 1.5
Additional details
Inputs
- Summary
The model accepts JSON requests with parameters on /invocations and /ping APIs that can be used to control the generated text. See examples and field descriptions below.
- Input MIME type
- application/json
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
model | Name of the model: nvidia/llama-3_3-nemotron-super-49b-v1_5 | Type: FreeText | No |
messages.role | Role of the entity in conversation | Type: Categorical Allowed values: system, user, assistant | Yes |
max_tokens | Number of tokens that can be generated in the model's response | 1 to 65536. Defaults to 65536 | No |
Resources
Support
Vendor support
Free support via NVIDIA NIM Developer Forum:
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products


