New Text

Nemotron 3 Super 120B-A12B

NVIDIA's large hybrid Mixture-of-Experts reasoning model — 120B total / 12B active — NVFP4 for Blackwell/Thor.

Command to Run on Jetson Benchmark Model Details

Parameters 120B total / 12B active

Modalities

Text

Context Length 256K

License NVIDIA Open Model License

Precision

NVFP4

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Nemotron 3 Super 120B-A12B · vLLM · NVFP4 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

Nemotron 3 Super 120B-A12B is a large hybrid Mixture-of-Experts reasoning model from the NVIDIA Nemotron family — 120B total parameters with ~12B active per forward pass. This page covers the NVFP4 checkpoint, which fits in ~60 GB and runs natively on Jetson Thor (Blackwell, sm_110) for efficient 4-bit inference. The checkpoint is ungated — no Hugging Face token required.

Architecture

A hybrid Mamba-2 / attention Mixture-of-Experts design (NemotronHForCausalLM):

Mamba-2 (state-space) layers interleaved with sparse MoE layers and a small number of attention layers
~12B active parameters routed per token out of 120B total
256K context window, NVFP4 (E2M1 weights with FP8 block scales) for Blackwell FP4 Tensor Cores

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

Agentic Workflows: Function calling and tool use with chain-of-thought reasoning
Complex Reasoning: Math, coding, and multi-step problem solving where a larger expert pool helps
Chatbots and RAG: High-quality conversational and retrieval-augmented generation
On-device Frontier-class Inference: Serving a 120B-class model on a single Jetson Thor via NVFP4

Supported Platforms

Jetson Thor (T5000, 128 GB) — the ~60 GB of weights plus KV cache require the 128 GB SKU

Nemotron 3 Family

Model	Parameters	Memory	Best For
Nemotron3 Nano 4B	4B	4GB RAM	Lightweight edge deployment
Nemotron3 Nano 30B-A3B	30B total / 3B active	32GB RAM	Efficient MoE reasoning on AGX Orin
Nemotron 3 Nano Omni	30B total / 3B active	64GB RAM	Multimodal reasoning (text, image, audio, video)
Nemotron 3 Super 120B-A12B	120B total / 12B active	128GB RAM	Frontier-class reasoning on Jetson Thor