Text

Nemotron3 Nano 30B-A3B

NVIDIA's flagship hybrid MoE reasoning model with 30B total / 3.5B active parameters

Parameters 30B total / 3B activated
Modalities
Text
Context Length 256K
License NVIDIA Nemotron Open Model License
Precision
FP4

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/api/generate -d '{
  "model": "NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

One-shot inference

Choose a Jetson module, adjust optional parameters, then copy the command to run a single inference on the device.

Command

·Shell


												
											

Benchmark

Nemotron 3 30B-A3B  · vLLM  · NVFP4 / W4A16 · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Note: The Thor command requires a Hugging Face access token with access to the gated NVFP4 checkpoint. The Orin command uses a community AWQ checkpoint that does not require authentication. If you see “Free memory on device … is less than desired GPU memory utilization”, lower --gpu-memory-utilization in the Advanced options.

Architecture

The model employs a hybrid Mixture-of-Experts (MoE) architecture:

  • 23 Mamba-2 and MoE layers
  • 6 Attention layers
  • 128 experts + 1 shared expert per MoE layer
  • 6 experts activated per token
  • 3.5B active parameters / 30B total parameters

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

  • AI Agent Systems: Build autonomous agents with strong reasoning capabilities
  • Chatbots: General purpose conversational AI
  • RAG Systems: Retrieval-augmented generation applications
  • Reasoning Tasks: Complex problem-solving with configurable reasoning traces
  • Instruction Following: General instruction-following tasks

Supported Languages

English, Spanish, French, German, Japanese, Italian, and coding languages.

Reasoning Configuration

The model’s reasoning capabilities can be configured through a flag in the chat template:

  • With reasoning traces: Higher-quality solutions for complex queries
  • Without reasoning traces: Faster responses with slight accuracy trade-off for simpler tasks

Skipping reasoning (minimize TTFT)

For low-latency or single-token tasks (e.g. picking a number for a pre-scripted response), disable reasoning so the model does not generate a <think> block first:

  • Per request: Pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} in your chat completion call, and use max_tokens=1 (or 2) if you only need one token.
  • Server default: Add --default-chat-template-kwargs '{"enable_thinking": false}' to the vllm serve command so all requests skip reasoning by default and TTFT stays minimal.