Multimodal

Gemma 4 26B-A4B

Google's 26B MoE frontier Gemma 4 model for fast high-end reasoning and multimodal workflows

Command to Run on Jetson Benchmark Model Details

Parameters 3.8B active (25.8B total, MoE)

Modalities

Text Image

Context Length 256K

License Apache 2.0

Precision

Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-26B-A4B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="ggml-org/gemma-4-26B-A4B-it-GGUF",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Gemma 4 26B-A4B · vLLM · NVFP4 / AWQ · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 26B-A4B is a larger Gemma 4 variant that can be served on Jetson with llama.cpp. Google presents this model as the latency-optimized high-end option in the family: a Mixture-of-Experts model that targets much better throughput than a dense model of similar total size.

Long-context agents with tool use
Local coding copilots and repository Q&A on higher-memory Jetson systems
Document and chart understanding workloads
Research-style assistants that need stronger reasoning than the edge-sized models

Inputs and Outputs

Input: Text and image

Output: Text

Supported Platforms

Jetson AGX Orin
Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

Google’s model card describes 26B-A4B as a Mixture-of-Experts model with 25.2B total parameters and 3.8B active parameters during inference.
It supports 256K context, text/image input, native function calling, and the same long-context reasoning features shared by the rest of Gemma 4.
Google explicitly notes that the model runs much faster than its total parameter count suggests because only a subset of experts are active per token.
In Google’s benchmark table, 26B-A4B tracks close to 31B dense on many reasoning and coding tasks while keeping a stronger latency profile.