Gemma 4 26B-A4B
Google's 26B MoE frontier Gemma 4 model for fast high-end reasoning and multimodal workflows
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
Β·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-org/gemma-4-26B-A4B-it-GGUF",
"messages": [{"role": "user", "content": "Hello!"}]
}' llama.cpp server (OpenAI-compatible API)
After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.
curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [{"role": "user", "content": "Hello!"}]
}' Benchmark
Gemma 4 26B-A4B · vLLM · NVFP4 / AWQ · ISL 2048 / OSL 128
C = concurrent requests. Results will vary with image, clocks, and workload.
Model Details
If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .
Gemma 4 26B-A4B is a larger Gemma 4 variant that can be served on Jetson with llama.cpp. Google presents this model as the latency-optimized high-end option in the family: a Mixture-of-Experts model that targets much better throughput than a dense model of similar total size.
- Long-context agents with tool use
- Local coding copilots and repository Q&A on higher-memory Jetson systems
- Document and chart understanding workloads
- Research-style assistants that need stronger reasoning than the edge-sized models
Inputs and Outputs
Input: Text and image
Output: Text
Supported Platforms
- Jetson AGX Orin
- Jetson Thor
Inference Engine
This model is configured to run on Jetson with vLLM and llama.cpp.
Official Highlights
- Googleβs model card describes 26B-A4B as a Mixture-of-Experts model with 25.2B total parameters and 3.8B active parameters during inference.
- It supports 256K context, text/image input, native function calling, and the same long-context reasoning features shared by the rest of Gemma 4.
- Google explicitly notes that the model runs much faster than its total parameter count suggests because only a subset of experts are active per token.
- In Googleβs benchmark table, 26B-A4B tracks close to 31B dense on many reasoning and coding tasks while keeping a stronger latency profile.