New Multimodal

Cosmos3 Nano

NVIDIA's compact vision-language reasoning model (16B) with chain-of-thought over text, image, and video — NVFP4 for Blackwell/Thor.

Parameters 16B
Modalities
Text Image Video
Context Length 256K
License NVIDIA Open Model License
Precision
NVFP4

Benchmark

Cosmos3 Nano  · vLLM  · NVFP4* · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Cosmos3 Nano is a compact (16B) vision-language reasoning model from the NVIDIA Cosmos family. It performs chain-of-thought reasoning over text, images, and video, producing text output. This page covers the NVFP4 checkpoint, which runs natively on Jetson Thor (Blackwell, sm_110) for efficient 4-bit inference.

Key Capabilities

  • Multimodal Reasoning: Chain-of-thought over combined image/video + text input
  • Spatial & Scene Understanding: Reasoning about objects and relationships in a scene
  • Video Understanding: Temporal reasoning across video frames
  • NVFP4 on Blackwell: 4-bit (E2M1 with FP8 block scales) weights for high throughput on Thor

Running with vLLM (NVFP4)

The NVFP4 checkpoint is published on NGC and downloaded via the NGC CLI.

Step 1: Install and Configure the NGC CLI

wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.20.1/files/ngccli_arm64.zip
unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set

You will need an NGC account with access to the model and a valid API key.

Step 2: Download the NVFP4 Model

mkdir -p ~/cosmos3-ngc
ngc registry model download-version \
  "nim/nvidia/cosmos3-nano-reasoner:modelopt-nvfp4-full-quantize-final_format_fix" \
  --dest ~/cosmos3-ngc
export MODEL_PATH=$(find ~/cosmos3-ngc -maxdepth 2 -name config.json -exec dirname {} \; | head -1)

Step 3: Serve on Jetson Thor

sudo docker run -it --rm --runtime=nvidia --network host \
  -v $MODEL_PATH:/model:ro \
  --entrypoint "" \
  vllm/vllm-openai:v0.23.0-aarch64-ubuntu2404 \
  vllm serve /model \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --limit-mm-per-prompt '{"image": 1, "video": 0}'

Send an OpenAI-style chat request with an image_url (data URI or http URL) plus a text prompt to exercise the multimodal path.

Additional Resources