Memory Requirement 16GB RAM
Precision W4A16
Size 12GB
Jetson Inference - Supported Inference Engines
🚀
Container # Installation
mkdir -p $HOME/.cache/tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken \
-O $HOME/.cache/tiktoken/cl100k_base.tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken \
-O $HOME/.cache/tiktoken/o200k_base.tiktoken
# Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $HOME/.cache/tiktoken:/etc/encodings \
-e TIKTOKEN_ENCODINGS_BASE=/etc/encodings \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve openai/gpt-oss-20b --gpu-memory-utilization 0.8 Model Details
OpenAI GPT OSS 20B is OpenAI’s open-source 20 billion parameter language model. This model requires tiktoken encodings to be downloaded before serving.
Running with vLLM
Step 1: Download Tiktoken Encodings
mkdir -p $HOME/.cache/tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken \
-O $HOME/.cache/tiktoken/cl100k_base.tiktoken
wget -q https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken \
-O $HOME/.cache/tiktoken/o200k_base.tiktoken
Step 2: Serve
sudo docker run -it --rm --pull always --runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $HOME/.cache/tiktoken:/etc/encodings \
-e TIKTOKEN_ENCODINGS_BASE=/etc/encodings \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve openai/gpt-oss-20b --gpu-memory-utilization 0.8
GPT OSS Family
| Model | Parameters | Memory | Minimum Jetson |
|---|---|---|---|
| GPT OSS 20B | 20B | 16GB RAM | AGX Orin |
| GPT OSS 120B | 120B | 64GB RAM | Thor |