New Text

MiniMax M2.7

MiniMax's 230B agentic MoE flagship for software engineering and self-evolving agent harnesses with llama.cpp at 4-bit

Command to Run on Jetson Benchmark Model Details

Parameters 229B total / 10B activated

Modalities

Text

Context Length 196K

License MiniMax Model License

Precision

UD-IQ4_XS GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

MiniMax M2.7 · llama.cpp · UD-IQ4_XS GGUF · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Try on build.nvidia.com

View on HuggingFace

MiniMax M2.7 is MiniMax’s flagship agentic Mixture-of-Experts model, designed to build complex agent harnesses and complete highly elaborate productivity tasks. M2.7 is the first MiniMax model that deeply participates in its own evolution — during development the model autonomously updated its own memory, built dozens of complex skills for RL experiments, and improved its own learning process based on experiment results.

This page describes serving the Unsloth dynamic 4-bit GGUF (UD-IQ4_XS, 100.96 GiB) on Jetson AGX Thor T5000 with llama.cpp.

Inputs and Outputs

Input: Text

Output: Text (with optional reasoning traces between <think>...</think>)

Highlights

229B total / 10B active sparse MoE (minimax-m2 arch), 196K context.
Strong real-world software engineering and agentic tool use.
Self-evolving training loop: M2.7 helped optimize its own programming scaffold during RL.

Intended Use Cases

Coding agents: bug triage, refactors, code review, security analysis, and SRE-style root-cause investigations
Long-running productivity agents: document and spreadsheet automation with multi-turn tool use
Agent harness research: as a strong open-source backbone for tool-using and self-improving agent loops
On-device RAG / repo Q&A at the edge, when very large parameter counts matter more than minimum latency

Additional Resources

Unsloth MiniMax-M2.7-GGUF on Hugging Face — quantized weights (this page uses UD-IQ4_XS)
MiniMaxAI/MiniMax-M2.7 — original BF16 weights and model card