Multimodal

Gemma 4 E2B

Google's compact frontier Gemma 4 model for efficient multimodal and agentic workloads

Command to Run on Jetson Benchmark Model Details

Parameters 2.3B effective (5.1B with embeddings)

Modalities

Text Image Audio

Context Length 128K

License Apache 2.0

Precision

Q4_K_S GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-E2B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="ggml-org/gemma-4-E2B-it-GGUF",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Gemma 4 E2B · vLLM · NVFP4 / BF16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 E2B is the smallest variant in the Gemma 4 family. Google positions E2B as an edge-first model for low-latency, low-memory deployments where efficiency matters more than absolute model size.

Offline voice assistants and smart home controllers
Robotics copilots that combine speech and image understanding
Lightweight OCR and document QA on constrained Jetson devices
Local agent pipelines that need structured tool calling with a small footprint

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

Jetson Orin
Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

Google’s model card describes E2B as a dense multimodal model with 2.3B effective parameters and 5.1B parameters including embeddings.
It supports 128K context, text/image/audio input, and native function calling for agentic workflows.
The official Gemma 4 launch notes that E2B was engineered for offline mobile and IoT use, including devices like Jetson Orin Nano.
Google also documents built-in ASR and speech translation support on E2B, with audio clips up to 30 seconds.