New Multimodal

Nemotron 3 Nano Omni

NVIDIA's multimodal reasoning model with language, vision, audio, and video understanding — 30B total / 3B active MoE, available in NVFP4, FP8, and BF16.

Command to Run on Jetson Benchmark Model Details

Parameters 30B total / 3B active

Modalities

Text Image Audio Video

Context Length 256K

License NVIDIA Open Model License

Precision

NVFP4 FP8 BF16 Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

curl -s http://${JETSON_HOST}:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:30000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/api/generate -d '{
  "model": "Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:11434/v1",
    api_key="ollama",  # required by the client; Ollama ignores it
)

completion = client.chat.completions.create(
    model="Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(completion.choices[0].message.content)

import json
import urllib.request

url = "http://${JETSON_HOST}:11434/api/generate"
payload = json.dumps(
    {
        "model": "Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
        "prompt": "Why is the sky blue?",
        "stream": False,
    }
).encode("utf-8")
req = urllib.request.Request(
    url,
    data=payload,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    body = json.load(resp)
    print(body.get("response", body))

Benchmark

Nemotron 3 Nano Omni · vLLM · FP8 / NVFP4 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Try on build.nvidia.com

View on HuggingFace

Nemotron Nano 3 Omni is NVIDIA’s multimodal reasoning model combining language, vision, audio, and video understanding. It uses a Mixture-of-Experts architecture with 30B total parameters and 3B active per forward pass, delivering strong multimodal reasoning with efficient inference on Jetson platforms.

Inputs and Outputs

Input: Text, image, audio, and video

Output: Text

Intended Use Cases

Multimodal Assistants: Answering questions about images, audio clips, and video segments
Voice and Vision Interfaces: Edge AI applications combining speech and visual understanding
Agentic Workflows: Function calling with chain-of-thought reasoning for autonomous task execution
Document Understanding: OCR, chart analysis, and visual document Q&A
Audio Transcription and Analysis: Processing short audio clips with context awareness

Supported Platforms

Jetson Thor

Running with vLLM

sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.0-ubuntu2404 \
  bash -c "pip install -q 'vllm[audio]' && vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
    --trust-remote-code \
    --gpu-memory-utilization 0.65 \
    --max-model-len 32768 \
    --reasoning-parser nemotron_v3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder"

Running with SGLang

sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  --shm-size 16g \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  --entrypoint bash \
  lmsysorg/sglang:latest \
  -c "pip install -q librosa && sglang serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
    --trust-remote-code \
    --mem-fraction-static 0.80 \
    --max-running-requests 8 \
    --reasoning-parser nemotron_3 \
    --tool-call-parser qwen3_coder \
    --disable-cuda-graph"

Note: SGLang uses --reasoning-parser nemotron_3 (distinct from vLLM’s nemotron_v3). --disable-cuda-graph is required for this model. librosa is installed at runtime for audio encoder support.

Once the server is running, query it with:

curl -s http://${JETSON_HOST}:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Running with llama.cpp

sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/data/models/huggingface \
  -v $HOME/.cache/llama.cpp:/data/models/llama.cpp \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
  bash -lc 'MODEL=$(huggingface-downloader ggml-org/Nemotron-3-Nano-Omni-GGUF/nemotron-3-nano-omni-ga_v1.0-Q4_K_M.gguf) && \
    MMPROJ=$(huggingface-downloader ggml-org/Nemotron-3-Nano-Omni-GGUF/mmproj-nemotron-3-nano-omni-ga_v1.0.gguf) && \
    llama-server -m "$MODEL" -mm "$MMPROJ" \
      --ctx-size 8192 \
      --alias my_model \
      --n-gpu-layers 999 \
      --host 0.0.0.0 \
      --port 8080'

Once the server is running, query it with:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 256,
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Note: huggingface-downloader fetches the GGUF model and multimodal projector from ggml-org/Nemotron-3-Nano-Omni-GGUF. -mm "$MMPROJ" loads the multimodal projector for vision support. --n-gpu-layers 999 offloads all layers to GPU. --alias my_model sets the model name used in API requests. chat_template_kwargs: {"enable_thinking": true} activates chain-of-thought reasoning.

Running with Ollama

Ollama runs the Q4_K_M GGUF directly on the GPU and works on both Jetson Thor and Jetson AGX Orin 64GB.

ollama run nemotron3:33b-q4_K_M

Running with TensorRT Edge-LLM

TensorRT Edge-LLM support for this model is currently Jetson Thor only. It requires exporting the model to ONNX and building TensorRT engines before running inference. See the TensorRT Edge-LLM GitHub repository and the export/build quick start for the full setup flow.

Steps for TensorRT-Edge-LLM on Jetson Thor

Run these steps on Jetson Thor. Set paths for the TensorRT Edge-LLM checkout, model checkpoint, ONNX export directory, and engine output directory:

export TRT_EDGE_LLM_REPO=$HOME/tensorrt-edge-llm
export CHECKPOINT_DIR=/path/to/nemotron-nano-3-omni-checkpoint
export WORKSPACE=$HOME/tensorrt-edgellm-workspace/nemotron-nano-3-omni
export ONNX=$WORKSPACE/onnx
export ENGINE=$WORKSPACE/engines

Build TensorRT Edge-LLM

cd $HOME
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git tensorrt-edge-llm
cd tensorrt-edge-llm
git submodule update --init 3rdParty/nlohmannJson 3rdParty/NVTX

mkdir -p build
cd build
export PATH=/usr/local/cuda/bin:$PATH

cmake .. -DCMAKE_BUILD_TYPE=Release \
  -DENABLE_CUTE_DSL=ALL \
  -DTRT_PACKAGE_DIR=/usr \
  -DCUDA_CTK_VERSION=13.0

make -j$(nproc)

Export ONNX

export PYTHONPATH=$TRT_EDGE_LLM_REPO/experimental:$PYTHONPATH
python3 -m venv $HOME/trt-edgellm-venv
source $HOME/trt-edgellm-venv/bin/activate

cd $TRT_EDGE_LLM_REPO/experimental/llm_loader
pip3 install -r requirements.txt

python3 -m llm_loader.export_all_cli \
  $CHECKPOINT_DIR \
  $ONNX

This creates $ONNX/llm, $ONNX/visual, and $ONNX/audio.

Build Engines

export BUILD=$HOME/tensorrt-edge-llm/build
export EDGELLM_PLUGIN_PATH=$BUILD/libNvInfer_edgellm_plugin.so
export LD_PRELOAD=$EDGELLM_PLUGIN_PATH

$BUILD/examples/llm/llm_build \
  --onnxDir $ONNX/llm \
  --engineDir $ENGINE/llm

$BUILD/examples/multimodal/visual_build \
  --onnxDir $ONNX/visual \
  --engineDir $ENGINE/visual

$BUILD/examples/multimodal/audio_build \
  --onnxDir $ONNX/audio \
  --engineDir $ENGINE/audio

# Some builder versions emit nested modality directories. Flatten them if needed.
[ -d "$ENGINE/visual/visual" ] && mv $ENGINE/visual/visual/* $ENGINE/visual/ && rmdir $ENGINE/visual/visual
[ -d "$ENGINE/audio/audio" ] && mv $ENGINE/audio/audio/* $ENGINE/audio/ && rmdir $ENGINE/audio/audio

Run Inference

Use the same inference command for text, vision, and audio by changing the input and output JSON files:

$BUILD/examples/llm/llm_inference \
  --engineDir $ENGINE/llm \
  --multimodalEngineDir $ENGINE \
  --inputFile <input.json> \
  --outputFile <output.json> \
  --dumpOutput

Text input:

{
  "requests": [
    {
      "messages": [
        {"role": "user", "content": "What is 2+2?"}
      ],
      "max_generate_length": 50,
      "temperature": 0.0
    }
  ]
}

Vision input:

{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "image", "image": "/path/to/image.jpg"},
            {"type": "text", "text": "What do you see in this image?"}
          ]
        }
      ],
      "max_generate_length": 100,
      "temperature": 0.0
    }
  ]
}

Audio input uses a pre-computed mel-spectrogram .safetensors file shaped [1, time_steps, 128] with float16 values:

{
  "requests": [
    {
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "audio", "audio": "/path/to/audio_mel.safetensors"},
            {"type": "text", "text": "What did you hear in this audio?"}
          ]
        }
      ],
      "max_generate_length": 100,
      "temperature": 0.0
    }
  ]
}