Text

GPT OSS 20B

OpenAI's open-source 20 billion parameter language model

Command to Run on Jetson Benchmark Model Details

Parameters 21B total / 3.6B activated

Modalities

Text

Context Length 128K

License Apache 2.0

Precision

W4A16

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

GPT-OSS-20B · vLLM · MXFP4 / W4A16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

OpenAI GPT OSS 20B is OpenAI’s open-source 20 billion parameter language model. This model requires tiktoken encodings to be downloaded before serving.

Running with vLLM

Step 1: Download Tiktoken Encodings

mkdir -p $HOME/.cache/tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken \
  -O $HOME/.cache/tiktoken/cl100k_base.tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken \
  -O $HOME/.cache/tiktoken/o200k_base.tiktoken

Step 2: Serve

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -v $HOME/.cache/tiktoken:/etc/encodings \
  -e TIKTOKEN_ENCODINGS_BASE=/etc/encodings \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve openai/gpt-oss-20b --gpu-memory-utilization 0.8

sudo docker run -it --rm --pull always --runtime=nvidia --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -v $HOME/.cache/tiktoken:/etc/encodings \
  -e TIKTOKEN_ENCODINGS_BASE=/etc/encodings \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve openai/gpt-oss-20b --gpu-memory-utilization 0.8

GPT OSS Family

Model	Parameters	Memory	Minimum Jetson
GPT OSS 20B	20B	16GB RAM	AGX Orin
GPT OSS 120B	120B	64GB RAM	Thor