Text

Qwen3.5 35B-A3B (MoE)

Alibaba's latest Mixture-of-Experts model with 35B total / 3B active parameters, featuring native tool calling and MTP speculative decoding

Parameters 35B total / 3B activated
Modalities
Text Image
Context Length 256K
License Apache 2.0
Precision
NVFP4 W4A16

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

ยท

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-35B-A3B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmark

Qwen3.5-35B-A3B  · vLLM  · NVFP4 / W4A16 · ISL 2048 / OSL 128

Engine
Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

Qwen3.5 35B-A3B is a Mixture-of-Experts (MoE) model from Alibaba Cloudโ€™s Qwen3.5 family. It features 35 billion total parameters with only 3 billion active during inference, delivering strong performance with excellent efficiency on edge devices.

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

  • Reasoning: Advanced logical and analytical reasoning with chain-of-thought
  • Function Calling: Native support for tool use and function calling
  • Multilingual Instruction Following: Following instructions across 100+ languages
  • Code Generation: Programming assistance in multiple languages
  • Translation: High-quality translation between supported languages

Running with vLLM

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve Kbenkhaled/Qwen3.5-35B-A3B-quantized.w4a16 \
    --gpu-memory-utilization 0.8 --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder

Speculative Decoding with MTP

This model supports Multi-Token Prediction (MTP) speculative decoding, which can significantly improve generation throughput. To enable it, add the following flag to your vllm serve command:

--speculative-config '{"method": "mtp", "num_speculative_tokens": 4}'

Qwen3.5 Family

ModelParametersActive ParamsTypeBest For
Qwen3.5 35B-A3B35B3BMoEEfficient high-performance inference
Qwen3.5 27B27B27BDenseMaximum accuracy on demanding tasks

Additional Resources