Voice + Vision with Multimodal AI Studio
Build speech AI pipelines with Multi-modal AI Studio on Jetson.
In this chapter, you’ll explore voice-driven AI interactions using Multi-modal AI Studio on Jetson Thor.
📍 Run on Jetson
All commands in this lab should be run in your Jetson terminal (SSH session), not on your client PC.
Conversational AI Pipeline
A complete conversational AI pipeline consists of three core components:
| Component | Function | Example Models |
|---|---|---|
| ASR (Automatic Speech Recognition) | Detects speech activity (VAD) and converts spoken audio to text | NVIDIA Parakeet, Whisper |
| LLM/VLM (Language/Vision Model) | Generates intelligent responses with optional visual understanding | Llama, Qwen, Cosmos-Reason |
| TTS (Text-to-Speech) | Converts text responses to natural speech | NVIDIA Magpie, Kokoro TTS, Piper TTS |
Multi-modal AI Studio
Multi-modal AI Studio is a reference application for building and evaluating speech and vision-enabled AI pipelines on Jetson. It’s designed to help you:
- Evaluate different backend models — The pipeline is modular and built on standard web APIs (OpenAI-compatible, Riva’s gRPC, etc.), so you can swap ASR, LLM, and TTS backends independently to compare models side by side
- Configure the audio pipeline visually — A GUI lets you easily adjust pipeline parameters like VAD sensitivity, ASR settings, LLM prompts, and TTS voice — no code changes needed
- Analyze and minimize latency — Built-in timeline visualization and per-stage latency metrics help you identify bottlenecks and evaluate strategies to reduce end-to-end response time
Architectural Setup
Step 1: Start NVIDIA Riva (ASR + TTS)
NVIDIA Riva provides the ASR and TTS services for the voice pipeline. It runs as a Docker container exposing a gRPC endpoint on port 50051.
cd ~/.cache/riva/riva_quickstart_arm64_v2.24.0
bash riva_start.sh config.sh -s
Wait for the server to be ready. You can monitor the logs:
docker logs -f riva-speech
Look for:
Riva server listening on 0.0.0.0:50051
All models loaded successfully
This may take 2–5 minutes on first startup as models are loaded into GPU memory.
📦 Pre-configured for This Workshop
The Riva quickstart bundle is already downloaded, configured, and initialized on your Jetson Thor. You only need to run riva_start.sh.
On your own Jetson, the full setup involves:
- Install NGC CLI and configure credentials
- Download the Riva ARM64 quickstart (
ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.24.0) - Edit
config.shto select ASR/TTS models and Jetson platform - Run
riva_init.shto download Docker images and models (~15–45 min) - Run
riva_start.sh
See the NVIDIA Riva documentation for the full setup guide.
Step 2: Start vLLM with Cosmos-Reason2
Next, start the LLM backend. We’ll use Cosmos-Reason2 on vLLM again, but on a different port — Riva’s container occupies ports 8000–8002, so we use port 8010.
sudo sysctl -w vm.drop_caches=3
sudo docker run -it --rm --runtime=nvidia --network host \
-v ~/models/cosmos-reason2-8b:/models/cosmos-reason2-8b:ro \
-v ${HOME}/.cache/vllm:/root/.cache/vllm \
ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
vllm serve /models/cosmos-reason2-8b \
--served-model-name nvidia/cosmos-reason2-8b-fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.7 \
--reasoning-parser qwen3 \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--enable-prefix-caching \
--port 8010
Wait for vLLM to be ready:
INFO: Uvicorn running on http://0.0.0.0:8010
⚠️ Port Conflict with Riva
The Riva container exposes ports 8000–8002 (and 8888, 50051). Always use a different port for vLLM when running alongside Riva. We use --port 8010 here.
Step 3: Start Multi-modal AI Studio
In a new terminal (keep Riva and vLLM running), clone the Multi-modal AI Studio repository, set up, and start the application:
git clone https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio.git ~/multi_modal_ai_studio
cd ~/multi_modal_ai_studio
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install in development mode
pip install -e .
Then launch Multi-modal AI Studio:
multi-modal-ai-studio --port 8092 \
--asr-server localhost:50051 \
--tts-server localhost:50051 \
--llm-api-base http://localhost:8010/v1 \
--llm-model nvidia/cosmos-reason2-8b-fp8
Access the Interface
On your client PC browser, navigate to:
https://<JETSON_IP>:8092
Accept the self-signed SSL certificate (same process as Live VLM WebUI — click Advanced → Proceed).
Configure the Pipeline
Click “New Voice Chat” to activate the configuration panel. Walk through each tab:
1. ASR Tab → Select “NVIDIA Riva”
| Setting | Value |
|---|---|
| Server Address | localhost:50051 |
| ASR Language | en-US |
| ASR Model | parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer |
The Silero VAD variant provides better voice activity detection — it detects when you start and stop speaking, so the system knows when to begin transcription and when your turn is over.
2. LLM Tab
| Setting | Value |
|---|---|
| API Base URL | http://localhost:8010/v1 |
| Model | nvidia/cosmos-reason2-8b-fp8 |
| Utility Model | nvidia/cosmos-reason2-8b-fp8 |
| Enable Streaming Responses | âś… Checked |
| Include Conversation History | âś… Checked |
| Enable Vision (VLM) | Video Input |
| System Prompt | See below |
Suggested system prompt for concise vision responses:
You are a vision assistant. Give ONE short sentence answers only. Be direct. No explanations. Use plain text only — no markdown or formatting.
📝 Why these settings?
- Streaming Responses lets TTS start speaking before the full LLM response is generated, reducing perceived latency
- Conversation History gives the LLM context from previous turns, enabling follow-up questions
- Vision (Video Input) captures frames from the camera and includes them in the LLM prompt
- System Prompt shapes the AI’s behavior — shorter responses mean faster TTS and a more conversational feel
3. TTS Tab
| Setting | Value |
|---|---|
| Riva Server | localhost:50051 |
| TTS Model | magpie_tts_ensemble_Magpie-Multilingual |
| Language | English (US) |
| Sample Rate (Hz) | 22050 |
| Quality | High (Better) |
| Start speaking before LLM finishes | âś… Checked |
| Words before first speech | 10 |
“Start speaking before LLM finishes” is key for low latency — TTS begins synthesizing after the first 10 words arrive from the LLM, rather than waiting for the complete response.
4. Devices Tab
| Setting | Value |
|---|---|
| Camera Device | Default (browser) |
| Microphone Device | Default (browser) |
| Speaker Device | Default (browser) |
These use your client PC’s browser devices via WebRTC.
5. App Tab
| Setting | Value |
|---|---|
| Start sessions with microphone muted | ❌ Unchecked |
| Barge-in | ❌ Unchecked |
| Session Directory | Default (sessions) |
Start a Session
- Press “Start Session”
- Start speaking — watch the timeline at the bottom as it visualizes each stage:
- 🔵 ASR transcribing your speech
- đźź LLM generating a response
- đź”´ TTS synthesizing audio
- When done, click the red stop button to end the session
- Review your session by clicking it in the session history sidebar — you’ll see the full transcript, timeline, and latency metrics
đźš‘ Troubleshooting
bash: multi-modal-ai-studio: command not found
You need to activate the Python virtual environment first:
cd ~/multi_modal_ai_studio
source .venv/bin/activateThen re-run the multi-modal-ai-studio command.
ASR transcription does not start after muting for a while
The ASR engine times out when it stops receiving audio data for an extended period. Unmuting won’t resume transcription in the current session. Click the red stop button to end the session, then press “Start Session” again to begin a fresh one.
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8092): address already in use
The application is already running on port 8092. Kill the existing process first:
fuser -k 8092/tcpThen restart the application.
GPU memory not released after stopping vLLM
Even after stopping the vLLM container, GPU memory may remain allocated. Run:
sudo sysctl -w vm.drop_caches=3More Things to Try
- Change the system prompt — Try a fun personality like: “You are a cat. You are the smartest cat in the world and can assist the user with anything, but you do it in a playful feline manner while behaving like a cute lovable cat.”
- Enable Barge-in (App tab) — Interrupt the AI mid-speech and see how the pipeline handles it
- Press
fto make the video preview full screen (hfor help with keyboard shortcuts) - Turn off “Start speaking before LLM finishes” — Compare the timeline to see how much latency this feature saves
- Try the server camera — Hook up a USB camera to Jetson and select it under Devices
- Save a preset — Once you find a configuration you like, save it as a preset for quick recall