Gemma 4 on Jetson
Run Google Gemma 4 models on Jetson with vLLM or llama.cpp. Covers E2B, E4B, 26B-A4B, and 31B on Orin and Thor, including reasoning, tool calling, and runtime selection.
Gemma 4 was released in four practical variants for Jetson: E2B, E4B, 26B-A4B, and 31B. The E2B and E4B models support audio, text, and image input with text output. 26B-A4B is the MoE model, and 31B is the larger dense model.
The full family is supported on Jetson through both vLLM and llama.cpp. All of the models are supported on Orin and Thor, but memory is what really decides what makes sense. So far, E2B is the one that fits best on Orin Nano. On Orin NX, E2B and E4B are the natural choices. On AGX Orin, both small models fit well and give you good performance for different use cases, and that is where the larger models start to become realistic too. On Thor, the whole family is the intended path.
💡 Runtime choice
In practice, vLLM tends to deliver better serving performance, while llama.cpp remains a good option if you want the GGUF path.
Prerequisites
| Requirement | Details |
|---|---|
| Devices | Jetson Orin Nano, Orin NX, AGX Orin, Jetson Thor |
| JetPack | JP 6 (L4T r36.x) for Orin, JP 7 (L4T r38.x) for Thor |
| Storage | NVMe SSD strongly recommended for model downloads and container caches |
What fits where
| Device | Best Gemma 4 choices |
|---|---|
| Orin Nano | E2B |
| Orin NX | E2B, E4B |
| AGX Orin | Full Gemma 4 family |
| Jetson Thor | Full Gemma 4 family |
Loading Gemma 4 with vLLM
If you are on Orin NX, AGX Orin, or Thor, this is the cleanest place to start. The flow is the same for the whole family. You mainly change the container image for your device and the model ID for the variant you want.
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-orin \
vllm serve MODEL_ID \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-thor \
vllm serve MODEL_ID \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4Use these model IDs:
| Model | Orin MODEL_ID | Thor MODEL_ID |
|---|---|---|
| E2B | google/gemma-4-E2B-it | google/gemma-4-E2B-it |
| E4B | google/gemma-4-E4B-it | google/gemma-4-E4B-it |
| 26B-A4B | cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit | bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 |
| 31B | cyankiwi/gemma-4-31B-it-AWQ-4bit | nvidia/Gemma-4-31B-IT-NVFP4 |
🔊 Audio on E2B and E4B
If you are on E2B or E4B, you do not need to enable audio separately. It is supported by default on those models.
🖼️ 26B-A4B and 31B
26B-A4B and 31B are the larger text-and-image Gemma 4 models. They are not the audio-capable part of the family.
🧠 Reasoning and tool calling
The important flags are --enable-auto-tool-choice, --reasoning-parser gemma4, and --tool-call-parser gemma4. If you want the Gemma 4 reasoning and tool-calling path ready from the start, keep those in the launch command.
Orin Nano
If you are on Orin Nano, E2B is a great fit and llama.cpp is the straightforward path.
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S
Then, access http://localhost:8080 to see the UI.
ℹ️ Ollama on Orin Nano
With Ollama, Gemma 4 does not work on Orin Nano right now. The Ollama path still works on the others if that interests you.
If you want the same GGUF-style llama.cpp flow on bigger Jetson devices, the pattern stays the same and you mainly swap the container image and checkpoint.
| Model | GGUF checkpoint |
|---|---|
| E2B | unsloth/gemma-4-E2B-it-GGUF:Q4_K_S |
| E4B | ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M |
| 26B-A4B | ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M |
| 31B | ggml-org/gemma-4-31B-it-GGUF:Q4_K_M |
On Thor, use ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor instead.
⚠️ E2B audio on llama.cpp for Orin
There is currently an audio issue with E2B under llama.cpp. If audio is important for your setup, use the small Gemma 4 models through vLLM.
Reasoning and Tool Calling
Gemma 4 supports reasoning and tool calling, but reasoning is not enabled by default at request time.
{
"chat_template_kwargs": {
"enable_thinking": true
}
}
Here is a minimal request example:
curl -sN http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "google/gemma-4-E2B-it",
"messages": [{"role": "user", "content": "hi"}],
"chat_template_kwargs": {"enable_thinking": true},
"stream": true
}'
📝 Important
Even if you launched the server with the Gemma 4 parser flags, you still need chat_template_kwargs.enable_thinking=true in the request if you want thinking mode.
Audio Support
| Model | Input mode |
|---|---|
| E2B | text, image, audio |
| E4B | text, image, audio |
| 26B-A4B | text, image |
| 31B | text, image |
For Jetson, the practical takeaway is simple: use vLLM if you want the small-model audio path, and keep in mind that there is currently an audio issue for E2B under llama.cpp.
Things to Watch Out For with vLLM
If you are using Gemma 4 through vLLM, this is the main thing to watch out for. If you are using streaming, you are generally good. If you are not, you should look out for cases where the model’s thought text can leak into content instead of being cleanly separated from the final answer.
If you are testing with non-streaming requests, try:
{
"skip_special_tokens": false
}
Do not mix formats casually. Use the listed vLLM checkpoints with the Gemma 4 vLLM containers, and use the listed GGUF checkpoints with llama.cpp.
For 26B-A4B and 31B, startup problems are often memory-related rather than model-related.
Troubleshooting
If you are retrying a large model launch, clear the page cache first:
sudo sysctl -w vm.drop_caches=3
Before launching another model, make sure the previous server or container is no longer holding memory. If a model hangs during load or fails to start, free memory, clear caches, and retry with the exact command for your device.
Next Steps
- Browse the Supported Models page for copy/paste commands by device
- Read Introduction to GenAI on Jetson: How to Run LLMs and VLMs for the broader runtime picture
- Use Ollama on Jetson if you want a simpler local LLM workflow