New Text

Qwen3.6 35B-A3B (MoE)

Alibaba's latest Mixture-of-Experts model with 35B total / 3B active parameters, featuring native tool calling and MTP speculative decoding

Command to Run on Jetson Model Details

Parameters 24GB

Modalities

Text

Precision

NVFP4 AWQ-4bit

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Model Details

View on HuggingFace

Qwen3.6 35B-A3B is a Mixture-of-Experts (MoE) model from Alibaba Cloud’s Qwen3.6 family. It features 35 billion total parameters with only 3 billion active during inference, delivering strong performance with excellent efficiency on edge devices.

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

Reasoning: Advanced logical and analytical reasoning with chain-of-thought
Function Calling: Native support for tool use and function calling
Multilingual Instruction Following: Following instructions across 100+ languages
Code Generation: Programming assistance in multiple languages
Translation: High-quality translation between supported languages

Running with vLLM

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit \
    --gpu-memory-utilization 0.8 --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --max-model-len 4096

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  vllm/vllm-openai:nightly-aarch64 \
  bash -c "pip install -q 'vllm[audio]' && vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
    --gpu-memory-utilization 0.8 --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder"

Speculative Decoding with MTP

This model supports Multi-Token Prediction (MTP) speculative decoding, which can significantly improve generation throughput. To enable it, add the following flag to your vllm serve command:

--speculative-config '{"method": "mtp", "num_speculative_tokens": 4}'

Qwen3.6 Family

Model	Parameters	Active Params	Type	Best For
Qwen3.6 35B-A3B	35B	3B	MoE	Efficient high-performance inference
Qwen3.6 27B	27B	27B	Dense	Maximum accuracy on demanding tasks

Additional Resources

Hugging Face Model - Original model weights
NVFP4 Checkpoint (Thor) - Quantized for Jetson Thor
AWQ Checkpoint (Orin) - Quantized for Jetson Orin