New Text
Gemma 4 26B-A4B
Google's 26B MoE frontier Gemma 4 model for fast high-end reasoning and multimodal workflows
Memory Requirement 24GB RAM
Precision Q4_K_M GGUF
Size 16.8GB
Jetson Inference - Supported Inference Engines
🚀
Container # Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M Model Details
Gemma 4 26B-A4B is a larger Gemma 4 variant that can be served on Jetson with llama.cpp. Google presents this model as the latency-optimized high-end option in the family: a Mixture-of-Experts model that targets much better throughput than a dense model of similar total size.
- Long-context agents with tool use
- Local coding copilots and repository Q&A on higher-memory Jetson systems
- Document and chart understanding workloads
- Research-style assistants that need stronger reasoning than the edge-sized models
Inputs and Outputs
Input: Text and image
Output: Text
Supported Platforms
- Jetson AGX Orin
- Jetson Thor
Inference Engine
This model is configured to run on Jetson with llama.cpp.
Official Highlights
- Google’s model card describes 26B-A4B as a Mixture-of-Experts model with 25.2B total parameters and 3.8B active parameters during inference.
- It supports 256K context, text/image input, native function calling, and the same long-context reasoning features shared by the rest of Gemma 4.
- Google explicitly notes that the model runs much faster than its total parameter count suggests because only a subset of experts are active per token.
- In Google’s benchmark table, 26B-A4B tracks close to 31B dense on many reasoning and coding tasks while keeping a stronger latency profile.