Multimodal

Cosmos Reason 2 2B

NVIDIA's compact 2B parameter vision-language model with built-in chain-of-thought reasoning for edge deployment

Memory Requirement 8GB RAM
Precision FP8
Size 5GB

Model Details

NVIDIA Cosmos Reason 2B is a compact vision-language model with built-in chain-of-thought reasoning capabilities. Despite its small 2B parameter size, it can perform spatial reasoning, anomaly detection, and detailed scene analysis, making it well-suited for edge deployment on Jetson.

Key Capabilities

  • Spatial Reasoning: Understands spatial relationships between objects in scenes
  • Anomaly Detection: Identifies unusual patterns or objects in visual data
  • Scene Analysis: Provides detailed descriptions and analysis of visual content
  • Chain-of-thought Reasoning: Generates reasoning traces before concluding with a final response

Inputs and Outputs

Input:

  • Text prompts and images
  • Supports video frame analysis via --media-io-kwargs

Output:

  • Generated text with chain-of-thought reasoning traces
  • Spatial analysis, anomaly detection results, and scene descriptions

Running with vLLM

The vLLM path uses an FP8 quantized checkpoint from NGC downloaded via the NGC CLI.

Step 1: Install and Configure the NGC CLI

wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
ngc config set

You will need an NGC account with access to the nim org and a valid API key.

Step 2: Download the FP8 Model

ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8" \
  --dest ~/.cache/huggingface/hub
MODEL_PATH="$(home)/.cache/huggingface/hub/cosmos-reason2-2b_v1208-fp8-static-kv8"

Step 3: Serve

The second volume -v ${HOME}/.cache/vllm:/root/.cache/vllm persists vLLM’s torch.compile cache on the host. The first run compiles kernels and writes them there; later runs reuse the cache and start faster. Create the dir if needed: mkdir -p ~/.cache/vllm.

mkdir -p ~/.cache/vllm
sudo sysctl -w vm.drop_caches=3

sudo docker run -it --rm --runtime=nvidia --network host \
  -v $MODEL_PATH:/models/cosmos-reason2-2b:ro \
  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
  vllm serve /models/cosmos-reason2-2b \
    --served-model-name nvidia/cosmos-reason2-2b-fp8 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.8 \
    --reasoning-parser qwen3 \
    --media-io-kwargs '{"video": {"num_frames": -1}}' \
    --enable-prefix-caching \
    --port 8010
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor \
  llama-server -hf Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q8_0 -c 8192

Cosmos Reason 2 Family

ModelParametersMemoryBest For
Cosmos Reason 2 2B2B8GB RAMLightweight edge deployment
Cosmos Reason 2 8B8B18GB RAMHigher accuracy, demanding tasks

Additional Resources