GPT OSS 20B
OpenAI's open-source 20 billion parameter language model
Serve the model
Start server
Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.
Command
Β·
No command for this module and engine in model data.
Call the model over Web API
Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.
curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "Hello!"}]
}' Benchmark
GPT-OSS-20B · vLLM · MXFP4 / W4A16 · ISL 2048 / OSL 128
C = concurrent requests. Results will vary with image, clocks, and workload.
Model Details
OpenAI GPT OSS 20B is OpenAIβs open-source 20 billion parameter language model. This model requires tiktoken encodings to be downloaded before serving.
Running with vLLM
Step 1: Download Tiktoken Encodings
mkdir -p $HOME/.cache/tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken \
-O $HOME/.cache/tiktoken/cl100k_base.tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken \
-O $HOME/.cache/tiktoken/o200k_base.tiktoken
Step 2: Serve
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $HOME/.cache/tiktoken:/etc/encodings \
-e TIKTOKEN_ENCODINGS_BASE=/etc/encodings \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve openai/gpt-oss-20b --gpu-memory-utilization 0.8
GPT OSS Family
| Model | Parameters | Memory | Minimum Jetson |
|---|---|---|---|
| GPT OSS 20B | 20B | 16GB RAM | AGX Orin |
| GPT OSS 120B | 120B | 64GB RAM | Thor |