New Text

Nemotron 3 Super 120B-A12B

NVIDIA's large hybrid Mixture-of-Experts reasoning model — 120B total / 12B active — NVFP4 for Blackwell/Thor.

Parameters 120B total / 12B active
Modalities
Text
Context Length 256K
License NVIDIA Open Model License
Precision
NVFP4

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmark

Nemotron 3 Super 120B-A12B  · vLLM  · NVFP4 · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Nemotron 3 Super 120B-A12B is a large hybrid Mixture-of-Experts reasoning model from the NVIDIA Nemotron family — 120B total parameters with ~12B active per forward pass. This page covers the NVFP4 checkpoint, which fits in ~60 GB and runs natively on Jetson Thor (Blackwell, sm_110) for efficient 4-bit inference. The checkpoint is ungated — no Hugging Face token required.

Architecture

A hybrid Mamba-2 / attention Mixture-of-Experts design (NemotronHForCausalLM):

  • Mamba-2 (state-space) layers interleaved with sparse MoE layers and a small number of attention layers
  • ~12B active parameters routed per token out of 120B total
  • 256K context window, NVFP4 (E2M1 weights with FP8 block scales) for Blackwell FP4 Tensor Cores

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

  • Agentic Workflows: Function calling and tool use with chain-of-thought reasoning
  • Complex Reasoning: Math, coding, and multi-step problem solving where a larger expert pool helps
  • Chatbots and RAG: High-quality conversational and retrieval-augmented generation
  • On-device Frontier-class Inference: Serving a 120B-class model on a single Jetson Thor via NVFP4

Supported Platforms

  • Jetson Thor (T5000, 128 GB) — the ~60 GB of weights plus KV cache require the 128 GB SKU

Nemotron 3 Family

ModelParametersMemoryBest For
Nemotron3 Nano 4B4B4GB RAMLightweight edge deployment
Nemotron3 Nano 30B-A3B30B total / 3B active32GB RAMEfficient MoE reasoning on AGX Orin
Nemotron 3 Nano Omni30B total / 3B active64GB RAMMultimodal reasoning (text, image, audio, video)
Nemotron 3 Super 120B-A12B120B total / 12B active128GB RAMFrontier-class reasoning on Jetson Thor