Text

Nemotron3 Nano 30B-A3B

NVIDIA's flagship hybrid MoE reasoning model with 30B total / 3.5B active parameters

Command to Run on Jetson Benchmark Model Details

Parameters 30B total / 3B activated

Modalities

Text

Context Length 256K

License NVIDIA Nemotron Open Model License

Precision

FP4

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/api/generate -d '{
  "model": "NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:11434/v1",
    api_key="ollama",  # required by the client; Ollama ignores it
)

completion = client.chat.completions.create(
    model="NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(completion.choices[0].message.content)

import json
import urllib.request

url = "http://${JETSON_HOST}:11434/api/generate"
payload = json.dumps(
    {
        "model": "NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
        "prompt": "Why is the sky blue?",
        "stream": False,
    }
).encode("utf-8")
req = urllib.request.Request(
    url,
    data=payload,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    body = json.load(resp)
    print(body.get("response", body))

One-shot inference

Choose a Jetson module, adjust optional parameters, then copy the command to run a single inference on the device.

Command

·Shell

Benchmark

Nemotron 3 30B-A3B · vLLM · NVFP4 / W4A16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Try on build.nvidia.com

View on HuggingFace

Note: The Thor command requires a Hugging Face access token with access to the gated NVFP4 checkpoint. The Orin command uses a community AWQ checkpoint that does not require authentication. If you see “Free memory on device … is less than desired GPU memory utilization”, lower --gpu-memory-utilization in the Advanced options.

Architecture

The model employs a hybrid Mixture-of-Experts (MoE) architecture:

23 Mamba-2 and MoE layers
6 Attention layers
128 experts + 1 shared expert per MoE layer
6 experts activated per token
3.5B active parameters / 30B total parameters

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

AI Agent Systems: Build autonomous agents with strong reasoning capabilities
Chatbots: General purpose conversational AI
RAG Systems: Retrieval-augmented generation applications
Reasoning Tasks: Complex problem-solving with configurable reasoning traces
Instruction Following: General instruction-following tasks

Supported Languages

English, Spanish, French, German, Japanese, Italian, and coding languages.

Reasoning Configuration

The model’s reasoning capabilities can be configured through a flag in the chat template:

With reasoning traces: Higher-quality solutions for complex queries
Without reasoning traces: Faster responses with slight accuracy trade-off for simpler tasks

Skipping reasoning (minimize TTFT)

For low-latency or single-token tasks (e.g. picking a number for a pre-scripted response), disable reasoning so the model does not generate a <think> block first:

Per request: Pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} in your chat completion call, and use max_tokens=1 (or 2) if you only need one token.
Server default: Add --default-chat-template-kwargs '{"enable_thinking": false}' to the vllm serve command so all requests skip reasoning by default and TTFT stays minimal.