Nemotron 3 Super 120B-A12B
NVIDIA's large hybrid Mixture-of-Experts reasoning model — 120B total / 12B active — NVFP4 for Blackwell/Thor.
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}]
}' llama.cpp server (OpenAI-compatible API)
After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.
curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [{"role": "user", "content": "Hello!"}]
}' Benchmark
Nemotron 3 Super 120B-A12B · vLLM · NVFP4 · ISL 2048 / OSL 128
C = concurrent requests. Results will vary with image, clocks, and workload.
Model Details
Nemotron 3 Super 120B-A12B is a large hybrid Mixture-of-Experts reasoning model from the NVIDIA Nemotron family — 120B total parameters with ~12B active per forward pass. This page covers the NVFP4 checkpoint, which fits in ~60 GB and runs natively on Jetson Thor (Blackwell, sm_110) for efficient 4-bit inference. The checkpoint is ungated — no Hugging Face token required.
Architecture
A hybrid Mamba-2 / attention Mixture-of-Experts design (NemotronHForCausalLM):
- Mamba-2 (state-space) layers interleaved with sparse MoE layers and a small number of attention layers
- ~12B active parameters routed per token out of 120B total
- 256K context window, NVFP4 (E2M1 weights with FP8 block scales) for Blackwell FP4 Tensor Cores
Inputs and Outputs
Input: Text
Output: Text
Intended Use Cases
- Agentic Workflows: Function calling and tool use with chain-of-thought reasoning
- Complex Reasoning: Math, coding, and multi-step problem solving where a larger expert pool helps
- Chatbots and RAG: High-quality conversational and retrieval-augmented generation
- On-device Frontier-class Inference: Serving a 120B-class model on a single Jetson Thor via NVFP4
Supported Platforms
- Jetson Thor (T5000, 128 GB) — the ~60 GB of weights plus KV cache require the 128 GB SKU
Nemotron 3 Family
| Model | Parameters | Memory | Best For |
|---|---|---|---|
| Nemotron3 Nano 4B | 4B | 4GB RAM | Lightweight edge deployment |
| Nemotron3 Nano 30B-A3B | 30B total / 3B active | 32GB RAM | Efficient MoE reasoning on AGX Orin |
| Nemotron 3 Nano Omni | 30B total / 3B active | 64GB RAM | Multimodal reasoning (text, image, audio, video) |
| Nemotron 3 Super 120B-A12B | 120B total / 12B active | 128GB RAM | Frontier-class reasoning on Jetson Thor |