Multimodal

Gemma 4 31B

Google's Gemma 4 31B variant with Q4_K_M GGUF support on Jetson through llama.cpp

Command to Run on Jetson Benchmark Model Details

Parameters 31B

Modalities

Text Image

Context Length 256K

License Apache 2.0

Precision

Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-31B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="ggml-org/gemma-4-31B-it-GGUF",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Gemma 4 31B · vLLM · NVFP4 / AWQ · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 31B is the largest model in the current Gemma 4 set here, and it can be served on Jetson with llama.cpp. In Google’s launch post, 31B is the flagship dense model in the family, aimed at the best possible raw quality for local reasoning, coding, and agentic workflows.

Highest-quality local reasoning and coding on Jetson Thor or well-provisioned AGX Orin setups
Long-context assistants over large documents or repositories
Multimodal analysis of screenshots, charts, forms, and PDFs
Advanced agent systems where answer quality matters more than minimum latency

Inputs and Outputs

Input: Text and image

Output: Text

Supported Platforms

Jetson AGX Orin
Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

Google’s model card describes 31B as a dense multimodal model with 30.7B parameters, 256K context, and text/image input.
The Gemma 4 launch post positions 31B as the top-quality model in the family and states that it ranked #3 among open models on the Arena AI text leaderboard at launch.
In Google’s published benchmark table, 31B is the strongest Gemma 4 variant across the major reasoning, coding, and multimodal rows shown in the card.
Google also calls out 31B as a strong foundation for fine-tuning when quality matters more than latency.