Skip to content

Tutorial - Live LLaVA

Recommended

Follow the chat-based LLaVA tutorial first and see the local_llm documentation to familiarize yourself with VLMs and make sure the models are working.

This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:

This example uses the popular LLaVA model (based on Llama and CLIP) and has been quantized with 4-bit precision to be deployed on Jetson Orin. It's using an optimized multimodal pipeline from the local_llm package and the MLC/TVM inferencing runtime, and acts as a building block for creating always-on edge applications that can trigger user-promptable alerts and actions with the flexibility of VLMs.

Clone and set up jetson-containers

git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt

Running the Live Llava Demo

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB)

  2. Running one of the following versions of JetPack:

    JetPack 5 (L4T r35.x) JetPack 6 (L4T r36.x)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 25GB for local_llm container image
    • Space for models
      • CLIP model : 1.7GB
      • llava-1.5-7b model : 10.5GB
  4. Follow the chat-based LLaVA tutorial first and see the local_llm documentation.

The VideoQuery agent processes an incoming camera or video feed on prompts in a closed loop with Llava.

./run.sh \
  -e SSL_KEY=/data/key.pem -e SSL_CERT=/data/cert.pem \
  $(./autotag local_llm) \
    python3 -m local_llm.agents.video_query --api=mlc --verbose \
      --model liuhaotian/llava-v1.5-7b \
      --max-new-tokens 32 \
      --video-input /dev/video0 \
      --video-output webrtc://@:8554/output \
      --prompt "How many fingers am I holding up?"

refer to Enabling HTTPS/SSL to generate self-signed SSL certificates for enabling client-side browser webcams.

This uses jetson_utils for video I/O, and for options related to protocols and file formats, see Camera Streaming and Multimedia. In the example above, it captures a V4L2 USB webcam connected to the Jetson (/dev/video0) and outputs a WebRTC stream that can be viewed from a browser at https://HOSTNAME:8554. When HTTPS/SSL is enabled, it can also capture from the browser's webcam.

Changing the Prompt

The --prompt can be specified multiple times, and changed at runtime by pressing the number of the prompt followed by enter on the terminal's keyboard (for example, 1 + Enter for the first prompt). These are the default prompts when no --prompt is specified:

  1. Describe the image concisely.
  2. How many fingers is the person holding up?
  3. What does the text in the image say?
  4. There is a question asked in the image. What is the answer?

Future versions of this demo will have the prompts dynamically editable from the web UI.

Processing a Video File or Stream

The example above was running on a live camera, but you can also read and write a video file or stream by substituting the path or URL to the --video-input and --video-output command-line arguments like this:

./run.sh \
  -v /path/to/your/videos:/mount
  $(./autotag local_llm) \
    python3 -m local_llm.agents.video_query --api=mlc --verbose \
      --model liuhaotian/llava-v1.5-7b \
      --max-new-tokens 32 \
      --video-input /mount/my_video.mp4 \
      --video-output /mount/output.mp4 \
      --prompt "What does the weather look like?"

This example processes and pre-recorded video (in MP4, MKV, AVI, FLV formats with H.264/H.265 encoding), but it also can input/output live network streams like RTP, RTSP, and WebRTC using Jetson's hardware-accelerated video codecs.