Skip to content

Tutorial - Live LLaVA

Recommended

Follow the chat-based LLaVA and NanoVLM tutorials to familiarize yourself with vision/language models and test the models first.

This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:

It uses models like LLaVA or VILA (based on Llama and CLIP) and has been quantized with 4-bit precision to be deployed on Jetson Orin. This runs an optimized multimodal pipeline from the NanoLLM library, including event filters and alerts, and multimodal RAG:

For benchmarks and further discussion about multimodal optimizations, see the NanoVLM page.

Running the Live Llava Demo

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)⚠️

  2. Running one of the following versions of JetPack:

    JetPack 6 (L4T r36.x)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 22GB for nano_llm container image
    • Space for models (>10GB)
  4. Follow the chat-based LLaVA and NanoVLM tutorials first.

  5. Supported vision/language models:

The VideoQuery agent applies prompts to the incoming video feed with the VLM. Navigate your browser to https://<IP_ADDRESS>:8050 after launching it with your camera (Chrome is recommended with chrome://flags#enable-webrtc-hide-local-ips-with-mdns disabled)

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output

This uses jetson_utils for video I/O, and for options related to protocols and file formats, see Camera Streaming and Multimedia. In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device /dev/video0) and outputs a WebRTC stream.

Processing a Video File or Stream

The example above was running on a live camera, but you can also read and write a video file or network stream by substituting the path or URL to the --video-input and --video-output command-line arguments like this:

jetson-containers run \
  -v /path/to/your/videos:/mount
  $(autotag nano_llm) \
    python3 -m nano_llm.agents.video_query --api=mlc \
      --model Efficient-Large-Model/VILA1.5-3b \
      --max-context-len 256 \
      --max-new-tokens 32 \
      --video-input /mount/my_video.mp4 \
      --video-output /mount/output.mp4 \
      --prompt "What does the weather look like?"

This example processes and pre-recorded video (in MP4, MKV, AVI, FLV formats with H.264/H.265 encoding), but it also can input/output live network streams like RTP, RTSP, and WebRTC using Jetson's hardware-accelerated video codecs.

NanoDB Integration

If you launch the VideoQuery agent with the --nanodb flag along with a path to your NanoDB database, it will perform reverse-image search on the incoming feed against the database by re-using the CLIP embeddings generated by the VLM.

To enable this mode, first follow the NanoDB tutorial to download, index, and test the database. Then launch VideoQuery like this:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output \
    --nanodb /data/nanodb/coco/2017

You can also tag incoming images and add them to the database using the panel in the web UI.

Python Code

For a simplified code example of doing live VLM streaming from Python, see here in the NanoLLM docs.

You can use this to implement customized prompting techniques and integrate with other vision pipelines.