Gemma 4 on Jetson

Run Google Gemma 4 models on Jetson with vLLM or llama.cpp. Covers E2B, E4B, 26B-A4B, and 31B on Orin and Thor, including reasoning, tool calling, and runtime selection.

Gemma 4 was released in four practical variants for Jetson: E2B, E4B, 26B-A4B, and 31B. The E2B and E4B models support audio, text, and image input with text output. 26B-A4B is the MoE model, and 31B is the larger dense model.

The full family is supported on Jetson through both vLLM and llama.cpp. All of the models are supported on Orin and Thor, but memory is what really decides what makes sense. So far, E2B is the one that fits best on Orin Nano. On Orin NX, E2B and E4B are the natural choices. On AGX Orin, both small models fit well and give you good performance for different use cases, and that is where the larger models start to become realistic too. On Thor, the whole family is the intended path.

💡 Runtime choice

In practice, vLLM tends to deliver better serving performance, while llama.cpp remains a good option if you want the GGUF path.

Prerequisites

RequirementDetails
DevicesJetson Orin Nano, Orin NX, AGX Orin, Jetson Thor
JetPackJP 6 (L4T r36.x) for Orin, JP 7 (L4T r38.x) for Thor
StorageNVMe SSD strongly recommended for model downloads and container caches

What fits where

DeviceBest Gemma 4 choices
Orin NanoE2B
Orin NXE2B, E4B
AGX OrinFull Gemma 4 family
Jetson ThorFull Gemma 4 family

Loading Gemma 4 with vLLM

If you are on Orin NX, AGX Orin, or Thor, this is the cleanest place to start. The flow is the same for the whole family. You mainly change the container image for your device and the model ID for the variant you want.

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-orin \
  vllm serve MODEL_ID \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-thor \
  vllm serve MODEL_ID \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4

Use these model IDs:

ModelOrin MODEL_IDThor MODEL_ID
E2Bgoogle/gemma-4-E2B-itgoogle/gemma-4-E2B-it
E4Bgoogle/gemma-4-E4B-itgoogle/gemma-4-E4B-it
26B-A4Bcyankiwi/gemma-4-26B-A4B-it-AWQ-4bitbg-digitalservices/Gemma-4-26B-A4B-it-NVFP4
31Bcyankiwi/gemma-4-31B-it-AWQ-4bitnvidia/Gemma-4-31B-IT-NVFP4

🔊 Audio on E2B and E4B

If you are on E2B or E4B, you do not need to enable audio separately. It is supported by default on those models.

🖼️ 26B-A4B and 31B

26B-A4B and 31B are the larger text-and-image Gemma 4 models. They are not the audio-capable part of the family.

🧠 Reasoning and tool calling

The important flags are --enable-auto-tool-choice, --reasoning-parser gemma4, and --tool-call-parser gemma4. If you want the Gemma 4 reasoning and tool-calling path ready from the start, keep those in the launch command.

Orin Nano

If you are on Orin Nano, E2B is a great fit and llama.cpp is the straightforward path.

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S

Then, access http://localhost:8080 to see the UI.

ℹ️ Ollama on Orin Nano

With Ollama, Gemma 4 does not work on Orin Nano right now. The Ollama path still works on the others if that interests you.

If you want the same GGUF-style llama.cpp flow on bigger Jetson devices, the pattern stays the same and you mainly swap the container image and checkpoint.

ModelGGUF checkpoint
E2Bunsloth/gemma-4-E2B-it-GGUF:Q4_K_S
E4Bggml-org/gemma-4-E4B-it-GGUF:Q4_K_M
26B-A4Bggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M
31Bggml-org/gemma-4-31B-it-GGUF:Q4_K_M

On Thor, use ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor instead.

⚠️ E2B audio on llama.cpp for Orin

There is currently an audio issue with E2B under llama.cpp. If audio is important for your setup, use the small Gemma 4 models through vLLM.

Reasoning and Tool Calling

Gemma 4 supports reasoning and tool calling, but reasoning is not enabled by default at request time.

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

Here is a minimal request example:

curl -sN http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "hi"}],
    "chat_template_kwargs": {"enable_thinking": true},
    "stream": true
  }'

📝 Important

Even if you launched the server with the Gemma 4 parser flags, you still need chat_template_kwargs.enable_thinking=true in the request if you want thinking mode.

Audio Support

ModelInput mode
E2Btext, image, audio
E4Btext, image, audio
26B-A4Btext, image
31Btext, image

For Jetson, the practical takeaway is simple: use vLLM if you want the small-model audio path, and keep in mind that there is currently an audio issue for E2B under llama.cpp.

Things to Watch Out For with vLLM

If you are using Gemma 4 through vLLM, this is the main thing to watch out for. If you are using streaming, you are generally good. If you are not, you should look out for cases where the model’s thought text can leak into content instead of being cleanly separated from the final answer.

If you are testing with non-streaming requests, try:

{
  "skip_special_tokens": false
}

Do not mix formats casually. Use the listed vLLM checkpoints with the Gemma 4 vLLM containers, and use the listed GGUF checkpoints with llama.cpp.

For 26B-A4B and 31B, startup problems are often memory-related rather than model-related.

Troubleshooting

If you are retrying a large model launch, clear the page cache first:

sudo sysctl -w vm.drop_caches=3

Before launching another model, make sure the previous server or container is no longer holding memory. If a model hangs during load or fails to start, free memory, clear caches, and retry with the exact command for your device.

Next Steps