Multimodal

Gemma 4 E4B

Google's Gemma 4 E4B variant with Q4_K_M GGUF support on Jetson through llama.cpp

Parameters 4.5B effective (8B with embeddings)
Modalities
Text Image Audio
Context Length 128K
License Apache 2.0
Precision
Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/gemma-4-E4B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Model Details

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 E4B is a lightweight Gemma 4 model that can be served locally on Jetson with llama.cpp. In Google’s launch material, E4B is framed as the stronger edge-focused sibling to E2B, combining on-device efficiency with materially better coding, reasoning, and multimodal performance.

  • Local coding assistants on Orin NX, AGX Orin, or Thor
  • Multimodal document and screen-understanding with optional voice input
  • Tool-using assistants that need better reasoning than E2B
  • A balanced default for edge AI demos or products that need better quality without moving to the larger models

Inputs and Outputs

Input: Text, image, and audio

Output: Text

Supported Platforms

  • Jetson Orin
  • Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

  • Google’s model card describes E4B as a dense multimodal model with 4.5B effective parameters and 8B parameters including embeddings.
  • It supports 128K context, text/image/audio input, function calling, and configurable thinking mode.
  • In Google’s published benchmark table, E4B lands well above E2B on reasoning, coding, and vision tasks, making it the better general-purpose edge choice when memory allows.
  • Like E2B, E4B includes official support for automatic speech recognition and speech translation on short audio clips.