Multimodal

Nemotron Nano 12B VL

NVIDIA's vision-language model for image understanding and multimodal reasoning

Memory Requirement 16GB RAM
Precision NVFP4-QAD
Size 8GB

Jetson Inference - Supported Inference Engines

🚀
Container

This model is not supported on this platform.

Model Details

NVIDIA Nemotron Nano 12B VL is a vision-language model capable of understanding images and text, with support for chain-of-thought reasoning across multimodal inputs.

Inputs and Outputs

Input: Image, Text

Output: Text

Intended Use Cases

  • Image Summarization: Generate detailed descriptions of images
  • Text-Image Analysis: Analyze relationships between text and visual content
  • Optical Character Recognition (OCR): Extract text from images
  • Interactive Q&A on Images: Answer questions about image content
  • Chain-of-Thought Reasoning: Complex visual reasoning tasks

Supported Languages

English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese.