Chapter 3: Voice + Vision with Multimodal AI Studio | GTC 2026

In this chapter, you’ll explore voice-driven AI interactions using Multi-modal AI Studio on Jetson Thor.

📍 Run on Jetson

All commands in this lab should be run in your Jetson terminal (SSH session), not on your client PC.

Conversational AI Pipeline

A complete conversational AI pipeline consists of three core components:

Component	Function	Example Models
ASR (Automatic Speech Recognition)	Detects speech activity (VAD) and converts spoken audio to text	NVIDIA Parakeet, Whisper
LLM/VLM (Language/Vision Model)	Generates intelligent responses with optional visual understanding	Llama, Qwen, Cosmos-Reason
TTS (Text-to-Speech)	Converts text responses to natural speech	NVIDIA Magpie, Kokoro TTS, Piper TTS

Multi-modal AI Studio is a reference application for building and evaluating speech and vision-enabled AI pipelines on Jetson. It’s designed to help you:

Evaluate different backend models — The pipeline is modular and built on standard web APIs (OpenAI-compatible, Riva’s gRPC, etc.), so you can swap ASR, LLM, and TTS backends independently to compare models side by side
Configure the audio pipeline visually — A GUI lets you easily adjust pipeline parameters like VAD sensitivity, ASR settings, LLM prompts, and TTS voice — no code changes needed
Analyze and minimize latency — Built-in timeline visualization and per-stage latency metrics help you identify bottlenecks and evaluate strategies to reduce end-to-end response time

Architectural Setup

Step 1: Start NVIDIA Riva (ASR + TTS)

NVIDIA Riva provides the ASR and TTS services for the voice pipeline. It runs as a Docker container exposing a gRPC endpoint on port 50051.

cd ~/.cache/riva/riva_quickstart_arm64_v2.24.0
bash riva_start.sh config.sh -s

Wait for the server to be ready. You can monitor the logs:

docker logs -f riva-speech

Look for:

Riva server listening on 0.0.0.0:50051
All models loaded successfully

This may take 2–5 minutes on first startup as models are loaded into GPU memory.

📦 Pre-configured for This Workshop

The Riva quickstart bundle is already downloaded, configured, and initialized on your Jetson Thor. You only need to run riva_start.sh.

On your own Jetson, the full setup involves:

Install NGC CLI and configure credentials
Download the Riva ARM64 quickstart (ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.24.0)
Edit config.sh to select ASR/TTS models and Jetson platform
Run riva_init.sh to download Docker images and models (~15–45 min)
Run riva_start.sh

See the NVIDIA Riva documentation for the full setup guide.

Step 2: Start vLLM with Cosmos-Reason2

Next, start the LLM backend. We’ll use Cosmos-Reason2 on vLLM again, but on a different port — Riva’s container occupies ports 8000–8002, so we use port 8010.

sudo sysctl -w vm.drop_caches=3

sudo docker run -it --rm --runtime=nvidia --network host \
  -v ~/models/cosmos-reason2-8b:/models/cosmos-reason2-8b:ro \
  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
  vllm serve /models/cosmos-reason2-8b \
    --served-model-name nvidia/cosmos-reason2-8b-fp8 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.7 \
    --reasoning-parser qwen3 \
    --media-io-kwargs '{"video": {"num_frames": -1}}' \
    --enable-prefix-caching \
    --port 8010

Wait for vLLM to be ready:

INFO:     Uvicorn running on http://0.0.0.0:8010

⚠️ Port Conflict with Riva

The Riva container exposes ports 8000–8002 (and 8888, 50051). Always use a different port for vLLM when running alongside Riva. We use --port 8010 here.

In a new terminal (keep Riva and vLLM running), clone the Multi-modal AI Studio repository, set up, and start the application:

git clone https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio.git ~/multi_modal_ai_studio
cd ~/multi_modal_ai_studio

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Install in development mode
pip install -e .

Then launch Multi-modal AI Studio:

multi-modal-ai-studio --port 8092 \
  --asr-server localhost:50051 \
  --tts-server localhost:50051 \
  --llm-api-base http://localhost:8010/v1 \
  --llm-model nvidia/cosmos-reason2-8b-fp8

Access the Interface

On your client PC browser, navigate to:

https://<JETSON_IP>:8092

Accept the self-signed SSL certificate (same process as Live VLM WebUI — click Advanced → Proceed).

Configure the Pipeline

Click “New Voice Chat” to activate the configuration panel. Walk through each tab:

1. ASR Tab → Select “NVIDIA Riva”

Setting	Value
Server Address	`localhost:50051`
ASR Language	`en-US`
ASR Model	`parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer`

The Silero VAD variant provides better voice activity detection — it detects when you start and stop speaking, so the system knows when to begin transcription and when your turn is over.

2. LLM Tab

Setting	Value
API Base URL	`http://localhost:8010/v1`
Model	`nvidia/cosmos-reason2-8b-fp8`
Utility Model	`nvidia/cosmos-reason2-8b-fp8`
Enable Streaming Responses	✅ Checked
Include Conversation History	✅ Checked
Enable Vision (VLM)	`Video Input`
System Prompt	See below

Suggested system prompt for concise vision responses:

You are a vision assistant. Give ONE short sentence answers only. Be direct. No explanations. Use plain text only — no markdown or formatting.

📝 Why these settings?

Streaming Responses lets TTS start speaking before the full LLM response is generated, reducing perceived latency
Conversation History gives the LLM context from previous turns, enabling follow-up questions
Vision (Video Input) captures frames from the camera and includes them in the LLM prompt
System Prompt shapes the AI’s behavior — shorter responses mean faster TTS and a more conversational feel

3. TTS Tab

Setting	Value
Riva Server	`localhost:50051`
TTS Model	`magpie_tts_ensemble_Magpie-Multilingual`
Language	English (US)
Sample Rate (Hz)	`22050`
Quality	High (Better)
Start speaking before LLM finishes	✅ Checked
Words before first speech	`10`

“Start speaking before LLM finishes” is key for low latency — TTS begins synthesizing after the first 10 words arrive from the LLM, rather than waiting for the complete response.

4. Devices Tab

Setting	Value
Camera Device	`Default (browser)`
Microphone Device	`Default (browser)`
Speaker Device	`Default (browser)`

These use your client PC’s browser devices via WebRTC.

5. App Tab

Setting	Value
Start sessions with microphone muted	❌ Unchecked
Barge-in	❌ Unchecked
Session Directory	`Default (sessions)`

Start a Session

Press “Start Session”
Start speaking — watch the timeline at the bottom as it visualizes each stage:
- 🔵 ASR transcribing your speech
- 🟠 LLM generating a response
- 🔴 TTS synthesizing audio
When done, click the red stop button to end the session
Review your session by clicking it in the session history sidebar — you’ll see the full transcript, timeline, and latency metrics

🚑 Troubleshooting

bash: multi-modal-ai-studio: command not found

You need to activate the Python virtual environment first:

cd ~/multi_modal_ai_studio
source .venv/bin/activate

Then re-run the multi-modal-ai-studio command.

ASR transcription does not start after muting for a while

The ASR engine times out when it stops receiving audio data for an extended period. Unmuting won’t resume transcription in the current session. Click the red stop button to end the session, then press “Start Session” again to begin a fresh one.

OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8092): address already in use

The application is already running on port 8092. Kill the existing process first:

fuser -k 8092/tcp

Then restart the application.

GPU memory not released after stopping vLLM

Even after stopping the vLLM container, GPU memory may remain allocated. Run:

sudo sysctl -w vm.drop_caches=3

More Things to Try

Change the system prompt — Try a fun personality like: “You are a cat. You are the smartest cat in the world and can assist the user with anything, but you do it in a playful feline manner while behaving like a cute lovable cat.”
Enable Barge-in (App tab) — Interrupt the AI mid-speech and see how the pipeline handles it
Press f to make the video preview full screen (h for help with keyboard shortcuts)
Turn off “Start speaking before LLM finishes” — Compare the timeline to see how much latency this feature saves
Try the server camera — Hook up a USB camera to Jetson and select it under Devices
Save a preset — Once you find a configuration you like, save it as a preset for quick recall

Voice + Vision with Multimodal AI Studio

Conversational AI Pipeline

Multi-modal AI Studio

Architectural Setup

Step 1: Start NVIDIA Riva (ASR + TTS)

Step 2: Start vLLM with Cosmos-Reason2

Step 3: Start Multi-modal AI Studio

Access the Interface

Configure the Pipeline

1. ASR Tab → Select “NVIDIA Riva”

2. LLM Tab

3. TTS Tab

4. Devices Tab

5. App Tab

Start a Session

🚑 Troubleshooting

More Things to Try