Cosmos Reason 2 2B
NVIDIA's compact 2B parameter vision-language model with built-in chain-of-thought reasoning for edge deployment
Model Details
NVIDIA Cosmos Reason 2B is a compact vision-language model with built-in chain-of-thought reasoning capabilities. Despite its small 2B parameter size, it can perform spatial reasoning, anomaly detection, and detailed scene analysis, making it well-suited for edge deployment on Jetson.
Key Capabilities
- Spatial Reasoning: Understands spatial relationships between objects in scenes
- Anomaly Detection: Identifies unusual patterns or objects in visual data
- Scene Analysis: Provides detailed descriptions and analysis of visual content
- Chain-of-thought Reasoning: Generates reasoning traces before concluding with a final response
Inputs and Outputs
Input:
- Text prompts and images
- Supports video frame analysis via
--media-io-kwargs
Output:
- Generated text with chain-of-thought reasoning traces
- Spatial analysis, anomaly detection results, and scene descriptions
Running with vLLM
The vLLM path uses an FP8 quantized checkpoint from NGC downloaded via the NGC CLI.
Step 1: Install and Configure the NGC CLI
wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set
You will need an NGC account with access to the nim org and a valid API key.
Step 2: Download the FP8 Model
ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8" \
--dest ~/.cache/huggingface/hub
MODEL_PATH="$(home)/.cache/huggingface/hub/cosmos-reason2-2b_v1208-fp8-static-kv8"
Step 3: Serve
The second volume -v ${HOME}/.cache/vllm:/root/.cache/vllm persists vLLM’s torch.compile cache on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create the dir if needed: mkdir -p ~/.cache/vllm.
mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3
sudo docker run -it --rm --runtime=nvidia --network host \
-v $MODEL_PATH:/models/cosmos-reason2-2b:ro \
-v ${HOME}/.cache/vllm:/root/.cache/vllm \
ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
vllm serve /models/cosmos-reason2-2b \
--served-model-name nvidia/cosmos-reason2-2b-fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.8 \
--reasoning-parser qwen3 \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--enable-prefix-caching \
--port 8010
Running with llama.cpp (Recommended for Orin Nano)
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
llama-server -hf Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q8_0 -c 8192
Cosmos Reason 2 Family
| Model | Parameters | Memory | Best For |
|---|---|---|---|
| Cosmos Reason 2 2B | 2B | 8GB RAM | Lightweight edge deployment |
| Cosmos Reason 2 8B | 8B | 18GB RAM | Higher accuracy, demanding tasks |
Additional Resources
- NGC FP8 Checkpoint - FP8 quantized model for vLLM
- Live VLM WebUI - real-time webcam-to-VLM interface