Cosmos Reason 2 8B
NVIDIA's 8B parameter vision-language model with advanced chain-of-thought reasoning capabilities
Model Details
NVIDIA Cosmos Reason 2 8B is the larger variant in the Cosmos Reason 2 family, offering enhanced reasoning performance with 8 billion parameters. It provides stronger chain-of-thought reasoning capabilities compared to the 2B variant, suitable for more demanding vision-language tasks on Jetson.
Key Capabilities
- Enhanced Reasoning: Stronger chain-of-thought reasoning compared to the 2B variant
- Spatial Reasoning: Advanced understanding of spatial relationships between objects
- Anomaly Detection: Identifies unusual patterns and anomalies in visual data
- Scene Analysis: Comprehensive and detailed analysis of complex visual scenes
- Video Understanding: Supports video frame analysis for temporal reasoning
Running with vLLM
The vLLM path uses an FP8 quantized checkpoint from NGC downloaded via the NGC CLI.
Step 1: Install and Configure the NGC CLI
wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set
You will need an NGC account with access to the nim org and a valid API key.
Step 2: Download the FP8 Model
ngc registry model download-version "nim/nvidia/cosmos-reason2-8b:1208-fp8-static-kv8" \
--dest ~/.cache/huggingface/hub
export MODEL_PATH="${HOME}/.cache/huggingface/hub/cosmos-reason2-8b_v1208-fp8-static-kv8"
Step 3: Serve
The second volume -v ${HOME}/.cache/vllm:/root/.cache/vllm persists vLLM’s torch.compile cache on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create the dir if needed: mkdir -p ~/.cache/vllm.
mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3
sudo docker run -it --rm --runtime=nvidia --network host \
-v $MODEL_PATH:/models/cosmos-reason2-8b:ro \
-v ${HOME}/.cache/vllm:/root/.cache/vllm \
ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
vllm serve /models/cosmos-reason2-8b \
--served-model-name nvidia/cosmos-reason2-8b-fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.7 \
--reasoning-parser qwen3 \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--enable-prefix-caching \
--port 8010
Running with llama.cpp (Recommended for Orin Nano)
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
llama-server -hf Kbenkhaled/Cosmos-Reason2-8B-GGUF:Q4_K_M -c 8192
Cosmos Reason 2 Family
| Model | Parameters | Memory | Best For |
|---|---|---|---|
| Cosmos Reason 2 2B | 2B | 8GB RAM | Lightweight edge deployment |
| Cosmos Reason 2 8B | 8B | 18GB RAM | Higher accuracy, demanding tasks |
Additional Resources
- NGC FP8 Checkpoint - FP8 quantized model for vLLM
- Live VLM WebUI - real-time webcam-to-VLM interface