Multimodal

Nemotron Nano 12B VL

NVIDIA's vision-language model for image understanding and multimodal reasoning

Parameters 12B
Modalities
Text Image Video
Context Length 128K
License NVIDIA Open Model License
Precision
NVFP4-QAD

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Model Details

NVIDIA Nemotron Nano 12B VL is a vision-language model capable of understanding images and text, with support for chain-of-thought reasoning across multimodal inputs.

Inputs and Outputs

Input: Image, Text

Output: Text

Intended Use Cases

  • Image Summarization: Generate detailed descriptions of images
  • Text-Image Analysis: Analyze relationships between text and visual content
  • Optical Character Recognition (OCR): Extract text from images
  • Interactive Q&A on Images: Answer questions about image content
  • Chain-of-Thought Reasoning: Complex visual reasoning tasks

Supported Languages

English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese.