Nemotron 3 Nano Omni
NVIDIA's multimodal reasoning model with language, vision, audio, and video understanding β 30B total / 3B active MoE, available in NVFP4, FP8, and BF16.
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
Β·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}]
}' curl -s http://${JETSON_HOST}:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}]
}' llama.cpp server (OpenAI-compatible API)
After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.
curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [{"role": "user", "content": "Hello!"}]
}' With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.
curl -s http://${JETSON_HOST}:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
"messages": [{"role": "user", "content": "Why is the sky blue?"}]
}' With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.
curl -s http://${JETSON_HOST}:11434/api/generate -d '{
"model": "Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
"prompt": "Why is the sky blue?",
"stream": false
}' Benchmark
Nemotron 3 Nano Omni · vLLM · FP8 / NVFP4 · ISL 2048 / OSL 128
C = concurrent requests. Results will vary with image, clocks, and workload.
Model Details
Nemotron Nano 3 Omni is NVIDIAβs multimodal reasoning model combining language, vision, audio, and video understanding. It uses a Mixture-of-Experts architecture with 30B total parameters and 3B active per forward pass, delivering strong multimodal reasoning with efficient inference on Jetson platforms.
Inputs and Outputs
Input: Text, image, audio, and video
Output: Text
Intended Use Cases
- Multimodal Assistants: Answering questions about images, audio clips, and video segments
- Voice and Vision Interfaces: Edge AI applications combining speech and visual understanding
- Agentic Workflows: Function calling with chain-of-thought reasoning for autonomous task execution
- Document Understanding: OCR, chart analysis, and visual document Q&A
- Audio Transcription and Analysis: Processing short audio clips with context awareness
Supported Platforms
- Jetson Thor
Running with vLLM
sudo docker run -it --rm --pull always \
--runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.20.0-ubuntu2404 \
bash -c "pip install -q 'vllm[audio]' && vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.65 \
--max-model-len 32768 \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder"
Running with SGLang
sudo docker run -it --rm --pull always \
--runtime=nvidia --network host \
--shm-size 16g \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
lmsysorg/sglang:latest \
-c "pip install -q librosa && sglang serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--trust-remote-code \
--mem-fraction-static 0.80 \
--max-running-requests 8 \
--reasoning-parser nemotron_3 \
--tool-call-parser qwen3_coder \
--disable-cuda-graph"
Note: SGLang uses
--reasoning-parser nemotron_3(distinct from vLLMβsnemotron_v3).--disable-cuda-graphis required for this model.librosais installed at runtime for audio encoder support.
Once the server is running, query it with:
curl -s http://${JETSON_HOST}:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Running with llama.cpp
sudo docker run -it --rm --pull always \
--runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/data/models/huggingface \
-v $HOME/.cache/llama.cpp:/data/models/llama.cpp \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
bash -lc 'MODEL=$(huggingface-downloader ggml-org/Nemotron-3-Nano-Omni-GGUF/nemotron-3-nano-omni-ga_v1.0-Q4_K_M.gguf) && \
MMPROJ=$(huggingface-downloader ggml-org/Nemotron-3-Nano-Omni-GGUF/mmproj-nemotron-3-nano-omni-ga_v1.0.gguf) && \
llama-server -m "$MODEL" -mm "$MMPROJ" \
--ctx-size 8192 \
--alias my_model \
--n-gpu-layers 999 \
--host 0.0.0.0 \
--port 8080'
Once the server is running, query it with:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 256,
"chat_template_kwargs": {"enable_thinking": true}
}'
Note:
huggingface-downloaderfetches the GGUF model and multimodal projector fromggml-org/Nemotron-3-Nano-Omni-GGUF.-mm "$MMPROJ"loads the multimodal projector for vision support.--n-gpu-layers 999offloads all layers to GPU.--alias my_modelsets the model name used in API requests.chat_template_kwargs: {"enable_thinking": true}activates chain-of-thought reasoning.
Running with Ollama
Ollama runs the Q4_K_M GGUF directly on the GPU and works on both Jetson Thor and Jetson AGX Orin 64GB.
ollama run nemotron3:33b-q4_K_M
Running with TensorRT Edge-LLM
TensorRT Edge-LLM support for this model is currently Jetson Thor only. It requires exporting the model to ONNX and building TensorRT engines before running inference. See the TensorRT Edge-LLM GitHub repository and the export/build quick start for the full setup flow.
Steps for TensorRT-Edge-LLM on Jetson Thor
Run these steps on Jetson Thor. Set paths for the TensorRT Edge-LLM checkout, model checkpoint, ONNX export directory, and engine output directory:
export TRT_EDGE_LLM_REPO=$HOME/tensorrt-edge-llm
export CHECKPOINT_DIR=/path/to/nemotron-nano-3-omni-checkpoint
export WORKSPACE=$HOME/tensorrt-edgellm-workspace/nemotron-nano-3-omni
export ONNX=$WORKSPACE/onnx
export ENGINE=$WORKSPACE/engines
Build TensorRT Edge-LLM
cd $HOME
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git tensorrt-edge-llm
cd tensorrt-edge-llm
git submodule update --init 3rdParty/nlohmannJson 3rdParty/NVTX
mkdir -p build
cd build
export PATH=/usr/local/cuda/bin:$PATH
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DENABLE_CUTE_DSL=ALL \
-DTRT_PACKAGE_DIR=/usr \
-DCUDA_CTK_VERSION=13.0
make -j$(nproc)
Export ONNX
export PYTHONPATH=$TRT_EDGE_LLM_REPO/experimental:$PYTHONPATH
python3 -m venv $HOME/trt-edgellm-venv
source $HOME/trt-edgellm-venv/bin/activate
cd $TRT_EDGE_LLM_REPO/experimental/llm_loader
pip3 install -r requirements.txt
python3 -m llm_loader.export_all_cli \
$CHECKPOINT_DIR \
$ONNX
This creates $ONNX/llm, $ONNX/visual, and $ONNX/audio.
Build Engines
export BUILD=$HOME/tensorrt-edge-llm/build
export EDGELLM_PLUGIN_PATH=$BUILD/libNvInfer_edgellm_plugin.so
export LD_PRELOAD=$EDGELLM_PLUGIN_PATH
$BUILD/examples/llm/llm_build \
--onnxDir $ONNX/llm \
--engineDir $ENGINE/llm
$BUILD/examples/multimodal/visual_build \
--onnxDir $ONNX/visual \
--engineDir $ENGINE/visual
$BUILD/examples/multimodal/audio_build \
--onnxDir $ONNX/audio \
--engineDir $ENGINE/audio
# Some builder versions emit nested modality directories. Flatten them if needed.
[ -d "$ENGINE/visual/visual" ] && mv $ENGINE/visual/visual/* $ENGINE/visual/ && rmdir $ENGINE/visual/visual
[ -d "$ENGINE/audio/audio" ] && mv $ENGINE/audio/audio/* $ENGINE/audio/ && rmdir $ENGINE/audio/audio
Run Inference
Use the same inference command for text, vision, and audio by changing the input and output JSON files:
$BUILD/examples/llm/llm_inference \
--engineDir $ENGINE/llm \
--multimodalEngineDir $ENGINE \
--inputFile <input.json> \
--outputFile <output.json> \
--dumpOutput
Text input:
{
"requests": [
{
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"max_generate_length": 50,
"temperature": 0.0
}
]
}
Vision input:
{
"requests": [
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "/path/to/image.jpg"},
{"type": "text", "text": "What do you see in this image?"}
]
}
],
"max_generate_length": 100,
"temperature": 0.0
}
]
}
Audio input uses a pre-computed mel-spectrogram .safetensors file shaped [1, time_steps, 128] with float16 values:
{
"requests": [
{
"messages": [
{
"role": "user",
"content": [
{"type": "audio", "audio": "/path/to/audio_mel.safetensors"},
{"type": "text", "text": "What did you hear in this audio?"}
]
}
],
"max_generate_length": 100,
"temperature": 0.0
}
]
}