Skip to content

NanoVLM - Efficient Multimodal Pipeline

We saw in the previous LLaVA tutorial how to run vision-language models through tools like text-generation-webui and llama.cpp. In a similar vein to the SLM page on Small Language Models, here we'll explore optimizing VLMs for reduced memory usage and higher performance that reaches interactive levels (like in Liva LLava). These are great for fitting on Orin Nano and increasing the framerate.

There are 3 model families currently supported: Llava, VILA, and Obsidian (mini VLM)

VLM Benchmarks

This FPS measures the end-to-end pipeline performance for continuous streaming like with Live Llava (on yes/no question)

•   These models all use CLIP ViT-L/14@336px for the vision encoder.
•   Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.

Multimodal Chat

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)⚠️

  2. Running one of the following versions of JetPack:

    JetPack 6 (L4T r36)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 22GB for nano_llm container image
    • Space for models (>10GB)
  4. Supported VLM models in NanoLLM:

The optimized NanoLLM library uses MLC/TVM for quantization and inference provides the highest performance. It efficiently manages the CLIP embeddings and KV cache. You can find Python code for the chat program used in this example here.

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model liuhaotian/llava-v1.6-vicuna-7b \
    --max-context-len 768 \
    --max-new-tokens 128

This starts an interactive console-based chat with Llava, and on the first run the model will automatically be downloaded from HuggingFace and quantized using MLC and W4A16 precision (which can take some time). See here for command-line options.

You'll end up at a >> PROMPT: in which you can enter the path or URL of an image file, followed by your question about the image. You can follow-up with multiple questions about the same image. Llava does not understand multiple images in the same chat, so when changing images, first reset the chat history by entering clear or reset as the prompt. VILA supports multiple images (area of active research)

Automated Prompts

During testing, you can specify prompts on the command-line that will run sequentially:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model liuhaotian/llava-v1.6-vicuna-7b \
    --max-context-len 768 \
    --max-new-tokens 128 \
    --prompt '/data/images/hoover.jpg' \
    --prompt 'what does the road sign say?' \
    --prompt 'what kind of environment is it?' \
    --prompt 'reset' \
    --prompt '/data/images/lake.jpg' \
    --prompt 'please describe the scene.' \
    --prompt 'are there any hazards to be aware of?'

You can also use --prompt /data/prompts/images.json to run the test sequence, the results of which are in the table below.

Results

•   The model responses are with 4-bit quantization enabled, and are truncated to 128 tokens for brevity.
•   These chat questions and images are from /data/prompts/images.json (found in jetson-containers)

JSON

When prompted, these models can also output in constrained JSON formats (which the LLaVA authors cover in their LLaVA-1.5 paper), and can be used to programatically query information about the image:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model liuhaotian/llava-v1.5-13b \
    --prompt '/data/images/hoover.jpg' \
    --prompt 'extract any text from the image as json'

{
  "sign": "Hoover Dam",
  "exit": "2",
  "distance": "1 1/2 mile"
}

Web UI

To use this through a web browser instead, see the llamaspeak tutorial:

Live Streaming

These models can also be used with the Live Llava agent for continuous streaming - just substitute the desired model name below:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA-2.7b \
    --max-context-len 768 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output

Then navigate your browser to https://<IP_ADDRESS>:8050 after launching it with your camera. Using Chrome or Chromium is recommended for a stable WebRTC connection, with chrome://flags#enable-webrtc-hide-local-ips-with-mdns disabled.

The Live Llava tutorial shows how to enable additional features like vector database integration, image tagging, and RAG.