Tutorial - Live LLaVA
Recommended
Follow the chat-based LLaVA and NanoVLM tutorials to familiarize yourself with vision/language models and test the models first.
This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:
It uses models like LLaVA or VILA (based on Llama and CLIP) and has been quantized with 4-bit precision to be deployed on Jetson Orin. This runs an optimized multimodal pipeline from the NanoLLM
library, including event filters and alerts, and multimodal RAG:
For benchmarks and further discussion about multimodal optimizations, see the NanoVLM page.
Running the Live Llava Demo
What you need
-
One of the following Jetson devices:
Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)⚠️
-
Running one of the following versions of JetPack:
JetPack 6 (L4T r36.x)
-
Sufficient storage space (preferably with NVMe SSD).
22GB
fornano_llm
container image- Space for models (
>10GB
)
-
Supported vision/language models:
liuhaotian/llava-v1.5-7b
,liuhaotian/llava-v1.5-13b
,liuhaotian/llava-v1.6-vicuna-7b
,liuhaotian/llava-v1.6-vicuna-13b
Efficient-Large-Model/VILA-2.7b
,Efficient-Large-Model/VILA-7b
,Efficient-Large-Model/VILA-13b
Efficient-Large-Model/VILA1.5-3b
,Efficient-Large-Model/Llama-3-VILA1.5-8B
,Efficient-Large-Model/VILA1.5-13b
VILA-2.7b
,VILA1.5-3b
,VILA-7b
,Llava-7b
, andObsidian-3B
can run on Orin Nano 8GB
The VideoQuery agent applies prompts to the incoming video feed with the VLM. Navigate your browser to https://<IP_ADDRESS>:8050
after launching it with your camera (Chrome is recommended with chrome://flags#enable-webrtc-hide-local-ips-with-mdns
disabled)
jetson-containers run $(autotag nano_llm) \
python3 -m nano_llm.agents.video_query --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--max-context-len 256 \
--max-new-tokens 32 \
--video-input /dev/video0 \
--video-output webrtc://@:8554/output
This uses jetson_utils
for video I/O, and for options related to protocols and file formats, see Camera Streaming and Multimedia. In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device /dev/video0
) and outputs a WebRTC stream.
Processing a Video File or Stream
The example above was running on a live camera, but you can also read and write a video file or network stream by substituting the path or URL to the --video-input
and --video-output
command-line arguments like this:
jetson-containers run \
-v /path/to/your/videos:/mount
$(autotag nano_llm) \
python3 -m nano_llm.agents.video_query --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--max-context-len 256 \
--max-new-tokens 32 \
--video-input /mount/my_video.mp4 \
--video-output /mount/output.mp4 \
--prompt "What does the weather look like?"
This example processes and pre-recorded video (in MP4, MKV, AVI, FLV formats with H.264/H.265 encoding), but it also can input/output live network streams like RTP, RTSP, and WebRTC using Jetson's hardware-accelerated video codecs.
NanoDB Integration
If you launch the VideoQuery agent with the --nanodb
flag along with a path to your NanoDB database, it will perform reverse-image search on the incoming feed against the database by re-using the CLIP embeddings generated by the VLM.
To enable this mode, first follow the NanoDB tutorial to download, index, and test the database. Then launch VideoQuery like this:
jetson-containers run $(autotag nano_llm) \
python3 -m nano_llm.agents.video_query --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--max-context-len 256 \
--max-new-tokens 32 \
--video-input /dev/video0 \
--video-output webrtc://@:8554/output \
--nanodb /data/nanodb/coco/2017
You can also tag incoming images and add them to the database using the panel in the web UI.
Python Code
For a simplified code example of doing live VLM streaming from Python, see here in the NanoLLM docs.
You can use this to implement customized prompting techniques and integrate with other vision pipelines.