Multimodal

Gemma 4 26B-A4B

Google's 26B MoE frontier Gemma 4 model for fast high-end reasoning and multimodal workflows

Parameters 3.8B active (25.8B total, MoE)
Modalities
Text Image
Context Length 256K
License Apache 2.0
Precision
Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Β·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-26B-A4B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmark

Gemma 4 26B-A4B  · vLLM  · NVFP4 / AWQ · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 26B-A4B is a larger Gemma 4 variant that can be served on Jetson with llama.cpp. Google presents this model as the latency-optimized high-end option in the family: a Mixture-of-Experts model that targets much better throughput than a dense model of similar total size.

  • Long-context agents with tool use
  • Local coding copilots and repository Q&A on higher-memory Jetson systems
  • Document and chart understanding workloads
  • Research-style assistants that need stronger reasoning than the edge-sized models

Inputs and Outputs

Input: Text and image

Output: Text

Supported Platforms

  • Jetson AGX Orin
  • Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

  • Google’s model card describes 26B-A4B as a Mixture-of-Experts model with 25.2B total parameters and 3.8B active parameters during inference.
  • It supports 256K context, text/image input, native function calling, and the same long-context reasoning features shared by the rest of Gemma 4.
  • Google explicitly notes that the model runs much faster than its total parameter count suggests because only a subset of experts are active per token.
  • In Google’s benchmark table, 26B-A4B tracks close to 31B dense on many reasoning and coding tasks while keeping a stronger latency profile.