Gemma 4 E4B
Google's Gemma 4 E4B variant with Q4_K_M GGUF support on Jetson through llama.cpp
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/gemma-4-E4B",
"messages": [{"role": "user", "content": "Hello!"}]
}' llama.cpp server (OpenAI-compatible API)
After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.
curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my_model",
"messages": [{"role": "user", "content": "Hello!"}]
}' Model Details
If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .
Gemma 4 E4B is a lightweight Gemma 4 model that can be served locally on Jetson with llama.cpp. In Google’s launch material, E4B is framed as the stronger edge-focused sibling to E2B, combining on-device efficiency with materially better coding, reasoning, and multimodal performance.
- Local coding assistants on Orin NX, AGX Orin, or Thor
- Multimodal document and screen-understanding with optional voice input
- Tool-using assistants that need better reasoning than E2B
- A balanced default for edge AI demos or products that need better quality without moving to the larger models
Inputs and Outputs
Input: Text, image, and audio
Output: Text
Supported Platforms
- Jetson Orin
- Jetson Thor
Inference Engine
This model is configured to run on Jetson with vLLM and llama.cpp.
Official Highlights
- Google’s model card describes E4B as a dense multimodal model with 4.5B effective parameters and 8B parameters including embeddings.
- It supports 128K context, text/image/audio input, function calling, and configurable thinking mode.
- In Google’s published benchmark table, E4B lands well above E2B on reasoning, coding, and vision tasks, making it the better general-purpose edge choice when memory allows.
- Like E2B, E4B includes official support for automatic speech recognition and speech translation on short audio clips.