Cosmos Reason 2 8B

NVIDIA Cosmos Reason 2 8B is the larger variant in the Cosmos Reason 2 family, offering enhanced reasoning performance with 8 billion parameters. It provides stronger chain-of-thought reasoning capabilities compared to the 2B variant, suitable for more demanding vision-language tasks on Jetson.

Key Capabilities

Enhanced Reasoning: Stronger chain-of-thought reasoning compared to the 2B variant
Spatial Reasoning: Advanced understanding of spatial relationships between objects
Anomaly Detection: Identifies unusual patterns and anomalies in visual data
Scene Analysis: Comprehensive and detailed analysis of complex visual scenes
Video Understanding: Supports video frame analysis for temporal reasoning

Running with vLLM

The vLLM path uses an FP8 quantized checkpoint from NGC downloaded via the NGC CLI.

Step 1: Install and Configure the NGC CLI

wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set

You will need an NGC account with access to the nim org and a valid API key.

Step 2: Download the FP8 Model

ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" \
  --dest ~/.cache/huggingface/hub
export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"

Step 3: Serve

The second volume -v ${HOME}/.cache/vllm:/root/.cache/vllm persists vLLM’s torch.compile cache on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create the dir if needed: mkdir -p ~/.cache/vllm.

mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3

sudo docker run -it --rm --runtime=nvidia --network host \
  -v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
  vllm serve /models/cosmos-reason2-8b \
    --served-model-name nvidia/cosmos-reason2-8b-fp8 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.7 \
    --reasoning-parser qwen3 \
    --media-io-kwargs '{"video": {"num_frames": -1}}' \
    --enable-prefix-caching \
    --port 8010

mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3

sudo docker run -it --rm --runtime=nvidia --network host \
  -v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve /models/cosmos-reason2-8b \
    --max-model-len 8192 --gpu-memory-utilization 0.8 --reasoning-parser qwen3 \
    --media-io-kwargs '{"video": {"num_frames": -1}}' \
    --enable-prefix-caching \
    --port 8010

Running with llama.cpp (Recommended for Orin Nano)

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
  llama-server -hf Kbenkhaled/Cosmos-Reason2-8B-GGUF:Q4_K_M -c 8192

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server -hf Kbenkhaled/Cosmos-Reason2-8B-GGUF:Q4_K_M -c 8192

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server -hf Kbenkhaled/Cosmos-Reason2-8B-GGUF:Q4_K_M -c 8192

Cosmos Reason 2 Family

Model	Parameters	Memory	Best For
Cosmos Reason 2 2B	2B	8GB RAM	Lightweight edge deployment
Cosmos Reason 2 8B	8B	18GB RAM	Higher accuracy, demanding tasks

Additional Resources

NGC FP8 Checkpoint - FP8 quantized model for vLLM
Live VLM WebUI - real-time webcam-to-VLM interface

Benchmark

Model Details