NanoVLM - Efficient Multimodal Pipeline

We saw in the previous LLaVA tutorial how to run vision-language models through tools like text-generation-webui and llama.cpp . In a similar vein to the SLM page on Small Language Models, here we'll explore optimizing VLMs for reduced memory usage and higher performance that reaches interactive levels (like in Liva LLava ). These are great for fitting on Orin Nano and increasing the framerate.

There are 3 model families currently supported: Llava , VILA , and Obsidian (mini VLM)

VLM Benchmarks

This FPS measures the end-to-end pipeline performance for continuous streaming like with Live Llava (on yes/no question)

Multimodal Chat

What you need

One of the following Jetson devices:

Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB) ⚠️
Running one of the following versions of JetPack :

JetPack 6 (L4T r36)
NVMe SSD highly recommended for storage speed and space
- 22GB for nano_llm container image
- Space for models ( >10GB )
Supported VLM models in NanoLLM :
- liuhaotian/llava-v1.5-7b , liuhaotian/llava-v1.5-13b , liuhaotian/llava-v1.6-vicuna-7b , liuhaotian/llava-v1.6-vicuna-13b
- Efficient-Large-Model/VILA-2.7b , Efficient-Large-Model/VILA-7b , Efficient-Large-Model/VILA-13b
- Efficient-Large-Model/VILA1.5-3b , Efficient-Large-Model/Llama-3-VILA1.5-8B , Efficient-Large-Model/VILA1.5-13b
- VILA-2.7b , VILA1.5-3b , VILA-7b , Llava-7b , and Obsidian-3B can run on Orin Nano 8GB

The optimized NanoLLM library uses MLC/TVM for quantization and inference provides the highest performance. It efficiently manages the CLIP embeddings and KV cache. You can find Python code for the chat program used in this example here .

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32

This starts an interactive console-based chat with Llava, and on the first run the model will automatically be downloaded from HuggingFace and quantized using MLC and W4A16 precision (which can take some time). See here for command-line options.

You'll end up at a >> PROMPT: in which you can enter the path or URL of an image file, followed by your question about the image. You can follow-up with multiple questions about the same image. Llava does not understand multiple images in the same chat, so when changing images, first reset the chat history by entering clear or reset as the prompt. VILA supports multiple images (area of active research)

Automated Prompts

During testing, you can specify prompts on the command-line that will run sequentially:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --prompt '/data/images/hoover.jpg' \
    --prompt 'what does the road sign say?' \
    --prompt 'what kind of environment is it?' \
    --prompt 'reset' \
    --prompt '/data/images/lake.jpg' \
    --prompt 'please describe the scene.' \
    --prompt 'are there any hazards to be aware of?'

You can also use --prompt /data/prompts/images.json to run the test sequence, the results of which are in the table below.

Results

• The model responses are with 4-bit quantization enabled, and are truncated to 128 tokens for brevity.
• These chat questions and images are from /data/prompts/images.json (found in jetson-containers)

JSON

When prompted, these models can also output in constrained JSON formats (which the LLaVA authors cover in their LLaVA-1.5 paper ), and can be used to programatically query information about the image:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model liuhaotian/llava-v1.5-13b \
    --prompt '/data/images/hoover.jpg' \
    --prompt 'extract any text from the image as json'

{
  "sign": "Hoover Dam",
  "exit": "2",
  "distance": "1 1/2 mile"
}

Web UI

To use this through a web browser instead, see the llamaspeak tutorial:

Live Streaming

These models can also be used with the Live Llava agent for continuous streaming - just substitute the desired model name below:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output

Then navigate your browser to https://<IP_ADDRESS>:8050 after launching it with your camera. Using Chrome or Chromium is recommended for a stable WebRTC connection, with chrome://flags#enable-webrtc-hide-local-ips-with-mdns disabled.

The Live Llava tutorial shows how to enable additional features like vector database integration, image tagging, and RAG.

Video Sequences

The VILA-1.5 family of models can understand multiple images per query, enabling video search/summarization, action & behavior analysis, change detection, and other temporal-based vision functions. By manipulating the KV cache and dropping off the last frame from the chat history, we can keep the stream rolling continuously beyond the maximum context length of the model. The vision/video.py example shows how to use this:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.vision.video \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-images 8 \
    --max-new-tokens 48 \
    --video-input /data/my_video.mp4 \
    --video-output /data/my_output.mp4 \
    --prompt 'What changes occurred in the video?'

Python Code

For a simplified code example of doing live VLM streaming from Python, see here in the NanoLLM docs.

You can use this to implement customized prompting techniques and integrate with other vision pipelines. This code applies the same set of prompts to the latest image from the video feed. See here for the version that does multi-image queries on video sequences.