Multimodal

Gemma 4 31B

Google's Gemma 4 31B variant with Q4_K_M GGUF support on Jetson through llama.cpp

Parameters 31B
Modalities
Text Image
Context Length 256K
License Apache 2.0
Precision
Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

·

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-4-31B-it-GGUF",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmark

Gemma 4 31B  · vLLM  · NVFP4 / AWQ · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

If you want to learn more about the Gemma 4 family and the different ways to run it on Jetson, check out the Gemma 4 on Jetson tutorial .

Gemma 4 31B is the largest model in the current Gemma 4 set here, and it can be served on Jetson with llama.cpp. In Google’s launch post, 31B is the flagship dense model in the family, aimed at the best possible raw quality for local reasoning, coding, and agentic workflows.

  • Highest-quality local reasoning and coding on Jetson Thor or well-provisioned AGX Orin setups
  • Long-context assistants over large documents or repositories
  • Multimodal analysis of screenshots, charts, forms, and PDFs
  • Advanced agent systems where answer quality matters more than minimum latency

Inputs and Outputs

Input: Text and image

Output: Text

Supported Platforms

  • Jetson AGX Orin
  • Jetson Thor

Inference Engine

This model is configured to run on Jetson with vLLM and llama.cpp.

Official Highlights

  • Google’s model card describes 31B as a dense multimodal model with 30.7B parameters, 256K context, and text/image input.
  • The Gemma 4 launch post positions 31B as the top-quality model in the family and states that it ranked #3 among open models on the Arena AI text leaderboard at launch.
  • In Google’s published benchmark table, 31B is the strongest Gemma 4 variant across the major reasoning, coding, and multimodal rows shown in the card.
  • Google also calls out 31B as a strong foundation for fine-tuning when quality matters more than latency.