GTC 2026 Workshop • Chapter 3 of 3
~20 min Hands-On

Voice + Vision with Multimodal AI Studio

Build speech AI pipelines with Multi-modal AI Studio on Jetson.

In this chapter, you’ll explore voice-driven AI interactions using Multi-modal AI Studio on Jetson Thor.

📍 Run on Jetson

All commands in this lab should be run in your Jetson terminal (SSH session), not on your client PC.

Conversational AI Pipeline

A complete conversational AI pipeline consists of three core components:

ComponentFunctionExample Models
ASR (Automatic Speech Recognition)Detects speech activity (VAD) and converts spoken audio to textNVIDIA Parakeet, Whisper
LLM/VLM (Language/Vision Model)Generates intelligent responses with optional visual understandingLlama, Qwen, Cosmos-Reason
TTS (Text-to-Speech)Converts text responses to natural speechNVIDIA Magpie, Kokoro TTS, Piper TTS

Multi-modal AI Studio

Multi-modal AI Studio is a reference application for building and evaluating speech and vision-enabled AI pipelines on Jetson. It’s designed to help you:

  1. Evaluate different backend models — The pipeline is modular and built on standard web APIs (OpenAI-compatible, Riva’s gRPC, etc.), so you can swap ASR, LLM, and TTS backends independently to compare models side by side
  2. Configure the audio pipeline visually — A GUI lets you easily adjust pipeline parameters like VAD sensitivity, ASR settings, LLM prompts, and TTS voice — no code changes needed
  3. Analyze and minimize latency — Built-in timeline visualization and per-stage latency metrics help you identify bottlenecks and evaluate strategies to reduce end-to-end response time

Architectural Setup

Step 1: Start NVIDIA Riva (ASR + TTS)

NVIDIA Riva provides the ASR and TTS services for the voice pipeline. It runs as a Docker container exposing a gRPC endpoint on port 50051.

cd ~/.cache/riva/riva_quickstart_arm64_v2.24.0
bash riva_start.sh config.sh -s

Wait for the server to be ready. You can monitor the logs:

docker logs -f riva-speech

Look for:

Riva server listening on 0.0.0.0:50051
All models loaded successfully

This may take 2–5 minutes on first startup as models are loaded into GPU memory.

📦 Pre-configured for This Workshop

The Riva quickstart bundle is already downloaded, configured, and initialized on your Jetson Thor. You only need to run riva_start.sh.

On your own Jetson, the full setup involves:

  1. Install NGC CLI and configure credentials
  2. Download the Riva ARM64 quickstart (ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.24.0)
  3. Edit config.sh to select ASR/TTS models and Jetson platform
  4. Run riva_init.sh to download Docker images and models (~15–45 min)
  5. Run riva_start.sh

See the NVIDIA Riva documentation for the full setup guide.

Step 2: Start vLLM with Cosmos-Reason2

Next, start the LLM backend. We’ll use Cosmos-Reason2 on vLLM again, but on a different port — Riva’s container occupies ports 8000–8002, so we use port 8010.

sudo sysctl -w vm.drop_caches=3

sudo docker run -it --rm --runtime=nvidia --network host \
  -v ~/models/cosmos-reason2-8b:/models/cosmos-reason2-8b:ro \
  -v ${HOME}/.cache/vllm:/root/.cache/vllm \
  ghcr.io/nvidia-ai-iot/vllm:0.14.0-r38.3-arm64-sbsa-cu130-24.04 \
  vllm serve /models/cosmos-reason2-8b \
    --served-model-name nvidia/cosmos-reason2-8b-fp8 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.7 \
    --reasoning-parser qwen3 \
    --media-io-kwargs '{"video": {"num_frames": -1}}' \
    --enable-prefix-caching \
    --port 8010

Wait for vLLM to be ready:

INFO:     Uvicorn running on http://0.0.0.0:8010

⚠️ Port Conflict with Riva

The Riva container exposes ports 8000–8002 (and 8888, 50051). Always use a different port for vLLM when running alongside Riva. We use --port 8010 here.

Step 3: Start Multi-modal AI Studio

In a new terminal (keep Riva and vLLM running), clone the Multi-modal AI Studio repository, set up, and start the application:

git clone https://github.com/NVIDIA-AI-IOT/multi_modal_ai_studio.git ~/multi_modal_ai_studio
cd ~/multi_modal_ai_studio

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Install in development mode
pip install -e .

Then launch Multi-modal AI Studio:

multi-modal-ai-studio --port 8092 \
  --asr-server localhost:50051 \
  --tts-server localhost:50051 \
  --llm-api-base http://localhost:8010/v1 \
  --llm-model nvidia/cosmos-reason2-8b-fp8

Access the Interface

On your client PC browser, navigate to:

https://<JETSON_IP>:8092

Accept the self-signed SSL certificate (same process as Live VLM WebUI — click Advanced → Proceed).

Configure the Pipeline

Click “New Voice Chat” to activate the configuration panel. Walk through each tab:

1. ASR Tab → Select “NVIDIA Riva”

SettingValue
Server Addresslocalhost:50051
ASR Languageen-US
ASR Modelparakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer

The Silero VAD variant provides better voice activity detection — it detects when you start and stop speaking, so the system knows when to begin transcription and when your turn is over.

2. LLM Tab

SettingValue
API Base URLhttp://localhost:8010/v1
Modelnvidia/cosmos-reason2-8b-fp8
Utility Modelnvidia/cosmos-reason2-8b-fp8
Enable Streaming Responsesâś… Checked
Include Conversation Historyâś… Checked
Enable Vision (VLM)Video Input
System PromptSee below

Suggested system prompt for concise vision responses:

You are a vision assistant. Give ONE short sentence answers only. Be direct. No explanations. Use plain text only — no markdown or formatting.

📝 Why these settings?

  • Streaming Responses lets TTS start speaking before the full LLM response is generated, reducing perceived latency
  • Conversation History gives the LLM context from previous turns, enabling follow-up questions
  • Vision (Video Input) captures frames from the camera and includes them in the LLM prompt
  • System Prompt shapes the AI’s behavior — shorter responses mean faster TTS and a more conversational feel

3. TTS Tab

SettingValue
Riva Serverlocalhost:50051
TTS Modelmagpie_tts_ensemble_Magpie-Multilingual
LanguageEnglish (US)
Sample Rate (Hz)22050
QualityHigh (Better)
Start speaking before LLM finishesâś… Checked
Words before first speech10

“Start speaking before LLM finishes” is key for low latency — TTS begins synthesizing after the first 10 words arrive from the LLM, rather than waiting for the complete response.

4. Devices Tab

SettingValue
Camera DeviceDefault (browser)
Microphone DeviceDefault (browser)
Speaker DeviceDefault (browser)

These use your client PC’s browser devices via WebRTC.

5. App Tab

SettingValue
Start sessions with microphone muted❌ Unchecked
Barge-in❌ Unchecked
Session DirectoryDefault (sessions)

Start a Session

  1. Press “Start Session”
  2. Start speaking — watch the timeline at the bottom as it visualizes each stage:
    • 🔵 ASR transcribing your speech
    • đźź  LLM generating a response
    • đź”´ TTS synthesizing audio
  3. When done, click the red stop button to end the session
  4. Review your session by clicking it in the session history sidebar — you’ll see the full transcript, timeline, and latency metrics

đźš‘ Troubleshooting

bash: multi-modal-ai-studio: command not found

You need to activate the Python virtual environment first:

cd ~/multi_modal_ai_studio
source .venv/bin/activate

Then re-run the multi-modal-ai-studio command.

ASR transcription does not start after muting for a while

The ASR engine times out when it stops receiving audio data for an extended period. Unmuting won’t resume transcription in the current session. Click the red stop button to end the session, then press “Start Session” again to begin a fresh one.

OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8092): address already in use

The application is already running on port 8092. Kill the existing process first:

fuser -k 8092/tcp

Then restart the application.

GPU memory not released after stopping vLLM

Even after stopping the vLLM container, GPU memory may remain allocated. Run:

sudo sysctl -w vm.drop_caches=3

More Things to Try

  • Change the system prompt — Try a fun personality like: “You are a cat. You are the smartest cat in the world and can assist the user with anything, but you do it in a playful feline manner while behaving like a cute lovable cat.”
  • Enable Barge-in (App tab) — Interrupt the AI mid-speech and see how the pipeline handles it
  • Press f to make the video preview full screen (h for help with keyboard shortcuts)
  • Turn off “Start speaking before LLM finishes” — Compare the timeline to see how much latency this feature saves
  • Try the server camera — Hook up a USB camera to Jetson and select it under Devices
  • Save a preset — Once you find a configuration you like, save it as a preset for quick recall