Multimodal

Gemma 4 E2B

Google's compact frontier Gemma 4 model for efficient multimodal and agentic workloads

Parameters 2.3B effective (5.1B with embeddings)
Modalities
Text Image Audio
Context Length 128K
License Apache 2.0
Precision
Q4_K_S GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Β·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-E2B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmark

Gemma 4 E2B  · vLLM · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 E2B is the smallest variant in the Gemma 4 family. Google positions E2B as an edge-first model for low-latency, low-memory deployments where efficiency matters more than absolute model size.

  • Offline voice assistants and smart home controllers
  • Robotics copilots that combine speech and image understanding
  • Lightweight OCR and document QA on constrained Jetson devices
  • Local agent pipelines that need structured tool calling with a small footprint

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

  • Jetson Orin
  • Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

  • Google’s model card describes E2B as a dense multimodal model with 2.3B effective parameters and 5.1B parameters including embeddings.
  • It supports 128K context, text/image/audio input, and native function calling for agentic workflows.
  • The official Gemma 4 launch notes that E2B was engineered for offline mobile and IoT use, including devices like Jetson Orin Nano.
  • Google also documents built-in ASR and speech translation support on E2B, with audio clips up to 30 seconds.