Nemotron3 Nano 30B-A3B
NVIDIA's flagship hybrid MoE reasoning model with 30B total / 3.5B active parameters
Jetson Inference - Supported Inference Engines
# Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve stelterlab/NVIDIA-Nemotron-3-Nano-30B-A3B-AWQ \
--gpu-memory-utilization 0.8 \
--trust-remote-code Model Details
Note: The Thor command requires a Hugging Face access token with access to the gated NVFP4 checkpoint. The Orin command uses a community AWQ checkpoint that does not require authentication. If you see “Free memory on device … is less than desired GPU memory utilization”, lower --gpu-memory-utilization in the Advanced options.
Architecture
The model employs a hybrid Mixture-of-Experts (MoE) architecture:
- 23 Mamba-2 and MoE layers
- 6 Attention layers
- 128 experts + 1 shared expert per MoE layer
- 6 experts activated per token
- 3.5B active parameters / 30B total parameters
Inputs and Outputs
Input: Text
Output: Text
Intended Use Cases
- AI Agent Systems: Build autonomous agents with strong reasoning capabilities
- Chatbots: General purpose conversational AI
- RAG Systems: Retrieval-augmented generation applications
- Reasoning Tasks: Complex problem-solving with configurable reasoning traces
- Instruction Following: General instruction-following tasks
Supported Languages
English, Spanish, French, German, Japanese, Italian, and coding languages.
Reasoning Configuration
The model’s reasoning capabilities can be configured through a flag in the chat template:
- With reasoning traces: Higher-quality solutions for complex queries
- Without reasoning traces: Faster responses with slight accuracy trade-off for simpler tasks
Skipping reasoning (minimize TTFT)
For low-latency or single-token tasks (e.g. picking a number for a pre-scripted response), disable reasoning so the model does not generate a <think> block first:
- Per request: Pass
extra_body={"chat_template_kwargs": {"enable_thinking": false}}in your chat completion call, and usemax_tokens=1(or 2) if you only need one token. - Server default: Add
--default-chat-template-kwargs '{"enable_thinking": false}'to thevllm servecommand so all requests skip reasoning by default and TTFT stays minimal.