New Text

Gemma 4 E2B

Google's compact frontier Gemma 4 model for efficient multimodal and agentic workloads

Memory Requirement 8GB RAM
Precision Q8_0 GGUF
Size 5.0GB

Jetson Inference - Supported Inference Engines

🚀
Container
# Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-E2B-it-GGUF:Q8_0

Model Details

Gemma 4 E2B is the smallest variant in the Gemma 4 family. Google positions E2B as an edge-first model for low-latency, low-memory deployments where efficiency matters more than absolute model size.

  • Offline voice assistants and smart home controllers
  • Robotics copilots that combine speech and image understanding
  • Lightweight OCR and document QA on constrained Jetson devices
  • Local agent pipelines that need structured tool calling with a small footprint

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

  • Jetson Orin
  • Jetson Thor

Inference Engine

This model is configured to run on Jetson with llama.cpp.

Official Highlights

  • Google’s model card describes E2B as a dense multimodal model with 2.3B effective parameters and 5.1B parameters including embeddings.
  • It supports 128K context, text/image/audio input, and native function calling for agentic workflows.
  • The official Gemma 4 launch notes that E2B was engineered for offline mobile and IoT use, including devices like Jetson Orin Nano.
  • Google also documents built-in ASR and speech translation support on E2B, with audio clips up to 30 seconds.