Multimodal
Nemotron Nano 12B VL
NVIDIA's vision-language model for image understanding and multimodal reasoning
Memory Requirement 16GB RAM
Precision NVFP4-QAD
Size 8GB
Jetson Inference - Supported Inference Engines
🚀
Container This model is not supported on this platform.
Model Details
NVIDIA Nemotron Nano 12B VL is a vision-language model capable of understanding images and text, with support for chain-of-thought reasoning across multimodal inputs.
Inputs and Outputs
Input: Image, Text
Output: Text
Intended Use Cases
- Image Summarization: Generate detailed descriptions of images
- Text-Image Analysis: Analyze relationships between text and visual content
- Optical Character Recognition (OCR): Extract text from images
- Interactive Q&A on Images: Answer questions about image content
- Chain-of-Thought Reasoning: Complex visual reasoning tasks
Supported Languages
English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese.