Text

Qwen3 4B

Alibaba's efficient 4 billion parameter instruction-tuned language model

Command to Run on Jetson Benchmark Model Details

Parameters 4B

Modalities

Text

Context Length 128K

License Apache 2.0

Precision

W4A16

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Qwen3-4B-quantized.w4a16",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="RedHatAI/Qwen3-4B-quantized.w4a16",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Qwen 3 4B · vLLM · NVFP4 / W4A16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

Qwen3 is Alibaba Cloud’s latest generation of large language models, offering state-of-the-art performance across a wide range of tasks. The Qwen3 4B model provides an excellent balance of capability and efficiency for edge deployment.

Inputs and Outputs

Input: Text

Output: Text

Intended Use Cases

Reasoning: Advanced logical and analytical reasoning tasks
Function Calling: Native support for tool use and function calling
Subject Matter Experts: Fine-tuning for domain-specific expertise
Multilingual Instruction Following: Following instructions across 100+ languages
Translation: High-quality translation between supported languages