Understanding LLMs and AI Microservices
Learn how to run and integrate AI microservices on NVIDIA Jetson for edge deployment.
In this chapter, you’ll learn about AI microservices architecture and how to deploy inference services on Jetson Thor using Ollama with an OpenAI-compatible API.
📍 Run on Jetson
All commands in this lab should be run in your Jetson terminal (SSH session), not on your client PC.
What are AI Microservices?
AI microservices package AI models as standalone services with standardized APIs. Instead of embedding models directly into applications, you deploy them as independent services that applications can call.
Why Microservices for Edge AI?
| Traditional Approach | Microservices Approach |
|---|---|
| Model embedded in app | Model runs as separate service |
| Rebuild app to update model | Update model independently |
| One app = one model | Many apps share one model |
| Custom API per model | Standardized API (OpenAI-compatible) |
Ollama as an AI Microservice
Ollama makes it easy to run LLMs and VLMs locally as microservices with an OpenAI-compatible API.
Why Ollama on Jetson Thor?
- Simple deployment: One command to download and run models
- OpenAI API: Drop-in replacement for cloud APIs on port 11434
- Model management: Easy to pull, list, and switch models
- Low overhead: Lightweight server optimized for edge devices
Step 1: Verify Ollama is Running
Ollama is pre-installed on your Jetson Thor. Let’s verify it’s working.
ollama --version
💡 If Ollama is not installed
Linux / macOS:
curl -fsSL https://ollama.com/install.sh | shWindows (PowerShell):
irm https://ollama.com/install.ps1 | iexList Available Models
ollama list
Step 2: Run a Model Interactively
Start a chat session with Nemotron-3-Nano (pre-installed on your Jetson Thor):
ollama run nemotron-3-nano:latest
Try a prompt:
>>> Explain Physical AI in one sentence.
Press Ctrl+D or type /bye to exit.
Step 3: Test the API
Ollama runs an OpenAI-compatible API server on port 11434.
List Available Models via API
curl http://localhost:11434/v1/models
Chat Completion Request
curl -X POST http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5:9b",
"messages": [
{"role": "user", "content": "Explain Physical AI in one sentence."}
],
"max_tokens": 100
}'
You’ll receive a JSON response with the model’s completion.
💡 OpenAI-Compatible
This is the same API format used by OpenAI’s GPT models — see the OpenAI Chat Completions API reference. Any application built for OpenAI can work with your local Ollama server!
Architecture Overview
Here’s how AI microservices work — the model runs as a server that exposes an API endpoint, and any application can interact with it through standard HTTP requests:
Multiple applications can share the same inference service, reducing resource usage and simplifying deployment.
Step 4: Open WebUI — Chat Interface
Now let’s connect a web-based chat interface to your Ollama microservice. This demonstrates how any application can use the OpenAI-compatible API.
Launch Open WebUI
docker run -d --network=host \
-v ${HOME}/open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Access the Interface
Open a browser and navigate to:
http://<JETSON_IP>:8080
Quick Setup
- Click “Get started” on the welcome screen
- Create an account (e.g.,
jetson@jetson.com/jetson) - Select qwen3.5:9b from the model dropdown
- Start chatting with your local model!
🔒 Privacy
All account information stays local — nothing is verified or stored externally. Your data never leaves your Jetson.
💡 Why Open WebUI?
This shows the power of microservices: Ollama runs the model, Open WebUI provides the interface, and they communicate via the standard OpenAI API. You can swap either component independently.
Bonus Challenge: Try a Vision Language Model (VLM)
Ollama’s model library includes many vision-capable models — these are Vision Language Models (VLMs) that can understand both text and images.
Challenge yourself to quickly get a VLM running and use it through Open WebUI:
- Pull a vision-capable model:
ollama pull gemma3:4b - In Open WebUI, select gemma3:4b from the model dropdown
- Attach an image file to your message and ask the model to describe it — you’re running multimodal AI locally on your Jetson!
📝 Gemma3 Vision Support
gemma3 supports vision input at 4B and above (4B, 12B, 27B). The 1B variant does not support image input.
Notice that Open WebUI only lets you upload static images one at a time. What if you wanted to continuously feed a live camera stream into a VLM — for real-time video understanding? There’s no built-in way to do that here.
That’s exactly what we’ll tackle in the next chapter!
Before Moving to Chapter 2
Stop Open WebUI (Optional)
In the next chapter, we don’t use Open WebUI, so you can stop the container:
docker stop open-webui
docker rm open-webui
Before starting the next chapter (which uses vLLM), stop Ollama to free up GPU memory:
sudo systemctl stop ollama
Verify it’s stopped:
nvidia-smi
You should see no Ollama processes using GPU memory.
💡 GPU Memory
For best GPU utilization, stop Ollama before starting vLLM in the next lab. Running both simultaneously splits VRAM between them.