Gemma 4 E2B
Google's compact frontier Gemma 4 model for efficient multimodal and agentic workloads
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
Β·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-org/gemma-4-E2B-it-GGUF",
"messages": [{"role": "user", "content": "Hello!"}]
}' llama.cpp server (OpenAI-compatible API)
After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.
curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [{"role": "user", "content": "Hello!"}]
}' Benchmark
Gemma 4 E2B · vLLM · · ISL 2048 / OSL 128
C = concurrent requests. Results will vary with image, clocks, and workload.
Model Details
If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .
Gemma 4 E2B is the smallest variant in the Gemma 4 family. Google positions E2B as an edge-first model for low-latency, low-memory deployments where efficiency matters more than absolute model size.
- Offline voice assistants and smart home controllers
- Robotics copilots that combine speech and image understanding
- Lightweight OCR and document QA on constrained Jetson devices
- Local agent pipelines that need structured tool calling with a small footprint
Inputs and Outputs
Input: Text, image, and audio
Output: Text
Supported Platforms
- Jetson Orin
- Jetson Thor
Inference Engine
This model is configured to run on Jetson with vLLM and llama.cpp.
Official Highlights
- Googleβs model card describes E2B as a dense multimodal model with 2.3B effective parameters and 5.1B parameters including embeddings.
- It supports 128K context, text/image/audio input, and native function calling for agentic workflows.
- The official Gemma 4 launch notes that E2B was engineered for offline mobile and IoT use, including devices like Jetson Orin Nano.
- Google also documents built-in ASR and speech translation support on E2B, with audio clips up to 30 seconds.