Multimodal

Nemotron Nano 12B VL

NVIDIA's vision-language model for image understanding and multimodal reasoning

Command to Run on Jetson Model Details

Parameters 12B

Modalities

Text Image Video

Context Length 128K

License NVIDIA Open Model License

Precision

NVFP4-QAD

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Model Details

View on HuggingFace

NVIDIA Nemotron Nano 12B VL is a vision-language model capable of understanding images and text, with support for chain-of-thought reasoning across multimodal inputs.

Inputs and Outputs

Input: Image, Text

Output: Text

Intended Use Cases

Image Summarization: Generate detailed descriptions of images
Text-Image Analysis: Analyze relationships between text and visual content
Optical Character Recognition (OCR): Extract text from images
Interactive Q&A on Images: Answer questions about image content
Chain-of-Thought Reasoning: Complex visual reasoning tasks

Supported Languages

English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese.