Text

Nemotron3 Nano 30B-A3B

NVIDIA's flagship hybrid MoE reasoning model with 30B total / 3.5B active parameters

Memory Requirement 32GB RAM
Precision FP4
Size 17GB

Jetson Inference - Supported Inference Engines

🚀
Container
# Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code

Note: The Thor command requires a Hugging Face access token with access to the gated NVFP4 checkpoint. The Orin command uses a community AWQ checkpoint that does not require authentication. If you see “Free memory on device … is less than desired GPU memory utilization”, lower --gpu-memory-utilization in the Advanced options.

Architecture

The model employs a hybrid Mixture-of-Experts (MoE) architecture:

  • 23 Mamba-2 and MoE layers
  • 6 Attention layers
  • 128 experts + 1 shared expert per MoE layer
  • 6 experts activated per token
  • 3.5B active parameters / 30B total parameters

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

  • AI Agent Systems: Build autonomous agents with strong reasoning capabilities
  • Chatbots: General purpose conversational AI
  • RAG Systems: Retrieval-augmented generation applications
  • Reasoning Tasks: Complex problem-solving with configurable reasoning traces
  • Instruction Following: General instruction-following tasks

Supported Languages

English, Spanish, French, German, Japanese, Italian, and coding languages.

Reasoning Configuration

The model’s reasoning capabilities can be configured through a flag in the chat template:

  • With reasoning traces: Higher-quality solutions for complex queries
  • Without reasoning traces: Faster responses with slight accuracy trade-off for simpler tasks

Skipping reasoning (minimize TTFT)

For low-latency or single-token tasks (e.g. picking a number for a pre-scripted response), disable reasoning so the model does not generate a <think> block first:

  • Per request: Pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} in your chat completion call, and use max_tokens=1 (or 2) if you only need one token.
  • Server default: Add --default-chat-template-kwargs '{"enable_thinking": false}' to the vllm serve command so all requests skip reasoning by default and TTFT stays minimal.