Tutorial - Live LLaVA

Recommended

Follow the chat-based LLaVA and NanoVLM tutorials to familiarize yourself with vision/language models and test the models first.

This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:

It uses models like LLaVA or VILA and has been quantized with 4-bit precision. This runs an optimized multimodal pipeline from the NanoLLM library, including running the CLIP/SigLIP vision encoder in TensorRT, event filters and alerts, and multimodal RAG (see the NanoVLM page for benchmarks)

Running the Live Llava Demo

What you need

One of the following Jetson devices:

Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)⚠️
Running one of the following versions of JetPack:

JetPack 6 (L4T r36.x)
Sufficient storage space (preferably with NVMe SSD).
- 22GB for nano_llm container image
- Space for models (>10GB)
Follow the chat-based LLaVA and NanoVLM tutorials first.
Supported vision/language models:

The VideoQuery agent applies prompts to the incoming video feed with the VLM. Navigate your browser to https://<IP_ADDRESS>:8050 after launching it with your camera (Chrome is recommended with chrome://flags#enable-webrtc-hide-local-ips-with-mdns disabled)

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output

This uses jetson_utils for video I/O, and for options related to protocols and file formats, see Camera Streaming and Multimedia. In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device /dev/video0) and outputs a WebRTC stream.

Processing a Video File or Stream

The example above was running on a live camera, but you can also read and write a video file or network stream by substituting the path or URL to the --video-input and --video-output command-line arguments like this:

jetson-containers run \
  -v /path/to/your/videos:/mount
  $(autotag nano_llm) \
    python3 -m nano_llm.agents.video_query --api=mlc \
      --model Efficient-Large-Model/VILA1.5-3b \
      --max-context-len 256 \
      --max-new-tokens 32 \
      --video-input /mount/my_video.mp4 \
      --video-output /mount/output.mp4 \
      --prompt "What does the weather look like?"

This example processes and pre-recorded video (in MP4, MKV, AVI, FLV formats with H.264/H.265 encoding), but it also can input/output live network streams like RTP, RTSP, and WebRTC using Jetson's hardware-accelerated video codecs.

NanoDB Integration

If you launch the VideoQuery agent with the --nanodb flag along with a path to your NanoDB database, it will perform reverse-image search on the incoming feed against the database by re-using the CLIP embeddings generated by the VLM.

To enable this mode, first follow the NanoDB tutorial to download, index, and test the database. Then launch VideoQuery like this:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-context-len 256 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output \
    --nanodb /data/nanodb/coco/2017

You can also tag incoming images and add them to the database using the web UI, for one-shot recognition tasks:

Video VILA

The VILA-1.5 family of models can understand multiple images per query, enabling video search/summarization, action & behavior analysis, change detection, and other temporal-based vision functions. The vision/video.py example keeps a rolling history of frames:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.vision.video \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-images 8 \
    --max-new-tokens 48 \
    --video-input /data/my_video.mp4 \
    --video-output /data/my_output.mp4 \
    --prompt 'What changes occurred in the video?'

Note: support will be added to the web UI for continuous multi-image queries on video sequences.

Python Code

For a simplified code example of doing live VLM streaming from Python, see here in the NanoLLM docs.

You can use this to implement customized prompting techniques and integrate with other vision pipelines. This code applies the same set of prompts to the latest image from the video feed. See here for the version that does multi-image queries on video sequences.