Gemma 4 E2B

# Run Command sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-E2B-it-GGUF:Q8_0

Gemma 4 E2B is the smallest variant in the Gemma 4 family. Google positions E2B as an edge-first model for low-latency, low-memory deployments where efficiency matters more than absolute model size.

Offline voice assistants and smart home controllers
Robotics copilots that combine speech and image understanding
Lightweight OCR and document QA on constrained Jetson devices
Local agent pipelines that need structured tool calling with a small footprint

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

Jetson Orin
Jetson Thor

Inference Engine

This model is configured to run on Jetson with llama.cpp.

Official Highlights

Google’s model card describes E2B as a dense multimodal model with 2.3B effective parameters and 5.1B parameters including embeddings.
It supports 128K context, text/image/audio input, and native function calling for agentic workflows.
The official Gemma 4 launch notes that E2B was engineered for offline mobile and IoT use, including devices like Jetson Orin Nano.
Google also documents built-in ASR and speech translation support on E2B, with audio clips up to 30 seconds.

Jetson Inference - Supported Inference Engines

Model Details

Inputs and Outputs

Supported Platforms

Inference Engine

Official Highlights