Multimodal

Gemma 4 E4B

Google's Gemma 4 E4B variant with Q4_K_M GGUF support on Jetson through llama.cpp

Command to Run on Jetson Benchmark Model Details

Parameters 4.5B effective (8B with embeddings)

Modalities

Text Image Audio

Context Length 128K

License Apache 2.0

Precision

Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/gemma-4-E4B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="unsloth/gemma-4-E4B",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Gemma 4 E4B · vLLM · NVFP4 / W4A16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 E4B is a lightweight Gemma 4 model that can be served locally on Jetson with llama.cpp. In Google’s launch material, E4B is framed as the stronger edge-focused sibling to E2B, combining on-device efficiency with materially better coding, reasoning, and multimodal performance.

Local coding assistants on Orin NX, AGX Orin, or Thor
Multimodal document and screen-understanding with optional voice input
Tool-using assistants that need better reasoning than E2B
A balanced default for edge AI demos or products that need better quality without moving to the larger models

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

Jetson Orin
Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

Google’s model card describes E4B as a dense multimodal model with 4.5B effective parameters and 8B parameters including embeddings.
It supports 128K context, text/image/audio input, function calling, and configurable thinking mode.
In Google’s published benchmark table, E4B lands well above E2B on reasoning, coding, and vision tasks, making it the better general-purpose edge choice when memory allows.
Like E2B, E4B includes official support for automatic speech recognition and speech translation on short audio clips.