Gemma 4 on Jetson | Jetson AI Lab

Gemma 4 was released in four practical variants for Jetson: E2B, E4B, 26B-A4B, and 31B. The E2B and E4B models support audio, text, and image input with text output. 26B-A4B is the MoE model, and 31B is the larger dense model.

The full family is supported on Jetson through both vLLM and llama.cpp. All of the models are supported on Orin and Thor, but memory is what really decides what makes sense. So far, E2B is the one that fits best on Orin Nano. On Orin NX, E2B and E4B are the natural choices. On AGX Orin, both small models fit well and give you good performance for different use cases, and that is where the larger models start to become realistic too. On Thor, the whole family is the intended path.

💡 Runtime choice

In practice, vLLM tends to deliver better serving performance, while llama.cpp remains a good option if you want the GGUF path.

Prerequisites

Requirement	Details
Devices	Jetson Orin Nano, Orin NX, AGX Orin, Jetson Thor
JetPack	JP 6 (L4T r36.x) for Orin, JP 7 (L4T r38.x) for Thor
Storage	NVMe SSD strongly recommended for model downloads and container caches

What fits where

Device	Best Gemma 4 choices
Orin Nano	E2B
Orin NX	E2B, E4B
AGX Orin	Full Gemma 4 family
Jetson Thor	Full Gemma 4 family

Loading Gemma 4 with vLLM

If you are on Orin NX, AGX Orin, or Thor, this is the cleanest place to start. The flow is the same for the whole family. You mainly change the container image for your device and the model ID for the variant you want.

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-orin \
  vllm serve MODEL_ID \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-thor \
  vllm serve MODEL_ID \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4

Use these model IDs:

Model	Orin `MODEL_ID`	Thor `MODEL_ID`
E2B	`google/gemma-4-E2B-it`	`google/gemma-4-E2B-it`
E4B	`google/gemma-4-E4B-it`	`google/gemma-4-E4B-it`
26B-A4B	`cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit`	`bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4`
31B	`cyankiwi/gemma-4-31B-it-AWQ-4bit`	`nvidia/Gemma-4-31B-IT-NVFP4`

🔊 Audio on E2B and E4B

If you are on E2B or E4B, you do not need to enable audio separately. It is supported by default on those models.

🖼️ 26B-A4B and 31B

26B-A4B and 31B are the larger text-and-image Gemma 4 models. They are not the audio-capable part of the family.

🧠 Reasoning and tool calling

The important flags are --enable-auto-tool-choice, --reasoning-parser gemma4, and --tool-call-parser gemma4. If you want the Gemma 4 reasoning and tool-calling path ready from the start, keep those in the launch command.

Orin Nano

If you are on Orin Nano, E2B is a great fit and llama.cpp is the straightforward path.

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S

Then, access http://localhost:8080 to see the UI.

ℹ️ Ollama on Orin Nano

With Ollama, Gemma 4 does not work on Orin Nano right now. The Ollama path still works on the others if that interests you.

If you want the same GGUF-style llama.cpp flow on bigger Jetson devices, the pattern stays the same and you mainly swap the container image and checkpoint.

Model	GGUF checkpoint
E2B	`unsloth/gemma-4-E2B-it-GGUF:Q4_K_S`
E4B	`ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M`
26B-A4B	`ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M`
31B	`ggml-org/gemma-4-31B-it-GGUF:Q4_K_M`

On Thor, use ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-thor instead.

⚠️ E2B audio on llama.cpp for Orin

There is currently an audio issue with E2B under llama.cpp. If audio is important for your setup, use the small Gemma 4 models through vLLM.

Reasoning and Tool Calling

Gemma 4 supports reasoning and tool calling, but reasoning is not enabled by default at request time.

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

Here is a minimal request example:

curl -sN http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "hi"}],
    "chat_template_kwargs": {"enable_thinking": true},
    "stream": true
  }'

📝 Important

Even if you launched the server with the Gemma 4 parser flags, you still need chat_template_kwargs.enable_thinking=true in the request if you want thinking mode.

Audio Support

Model	Input mode
E2B	text, image, audio
E4B	text, image, audio
26B-A4B	text, image
31B	text, image

For Jetson, the practical takeaway is simple: use vLLM if you want the small-model audio path, and keep in mind that there is currently an audio issue for E2B under llama.cpp.

Things to Watch Out For with vLLM

If you are using Gemma 4 through vLLM, this is the main thing to watch out for. If you are using streaming, you are generally good. If you are not, you should look out for cases where the model’s thought text can leak into content instead of being cleanly separated from the final answer.

If you are testing with non-streaming requests, try:

{
  "skip_special_tokens": false
}

Do not mix formats casually. Use the listed vLLM checkpoints with the Gemma 4 vLLM containers, and use the listed GGUF checkpoints with llama.cpp.

For 26B-A4B and 31B, startup problems are often memory-related rather than model-related.

Troubleshooting

If you are retrying a large model launch, clear the page cache first:

sudo sysctl -w vm.drop_caches=3

Before launching another model, make sure the previous server or container is no longer holding memory. If a model hangs during load or fails to start, free memory, clear caches, and retry with the exact command for your device.

Next Steps

Browse the Supported Models page for copy/paste commands by device
Read Introduction to GenAI on Jetson: How to Run LLMs and VLMs for the broader runtime picture
Use Ollama on Jetson if you want a simpler local LLM workflow