TensorRT Edge-LLM on Jetson

Use NVIDIA TensorRT Edge-LLM with two example models: Cosmos Reason2 8B (VLM) on Jetson Thor and Qwen3-4B-Instruct (LLM) on Jetson Orin Nano. Covers quantization, ONNX export, TensorRT engine builds, and pure C++ on-device inference.

TensorRT Edge-LLM on Jetson

TensorRT Edge-LLM is NVIDIA’s high-performance C++ inference runtime for LLMs and VLMs on embedded platforms. The workflow compiles trained models into optimized TensorRT engines; at run time, a small native binary loads those engines and serves requests with no Python interpreter in the inference path. Quantization (INT4, NVFP4, FP8) reduces weight footprint so larger models remain practical on memory-constrained devices. The SDK supports a wide range of models; see the full Supported Models list.

Overview

Edge-LLM supports a wide range of LLMs and VLMs across the entire Jetson family, from Orin Nano to Thor. See the full Supported Models list. In this tutorial, we walk through two examples that showcase the spectrum:

ModelTypeParametersQuantizationTarget Device
Cosmos-Reason2-8BVLM8BNVFP4Jetson Thor
Qwen3-4B-InstructLLM4BINT4 AWQJetson Orin Nano 8 GB

The workflow:

  1. Step 1: Export models (Python, x86 or Thor). Quantize and convert HuggingFace models to portable ONNX files. Transfer them to your target Jetson(s).
  2. Step 2: Build the C++ runtime (each Jetson). Clone the repo, compile the C++ engine builder and inference binary. TensorRT engines are hardware-specific and must be built on the device that will run them.
  3. Step 3: Cosmos Reason2 8B on Thor (NVFP4). Build engines and run VLM inference on Jetson Thor.
  4. Step 4: Qwen3-4B-Instruct on Orin Nano (INT4 AWQ). Build engines and run LLM inference on Jetson Orin Nano 8 GB.

Prerequisites

x86 Host / Jetson Thor (for Step 1: Model Export)

RequirementDetails
OSUbuntu 22.04 or 24.04
GPUNVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
GPU VRAM24 GB+ recommended (48 GB+ for FP8 export of 8B models)
CUDA12.x or 13.x
Python3.10+
DockerOptional but recommended

Jetson Target Device (for Step 2: Build and Inference)

RequirementJetson Orin (AGX Orin / Orin NX / Orin Nano)Thor
JetPack6.2.x7.1
CUDA12.6 (included)13.x (included)
TensorRT10.x+ (included)10.x+ (included)
Storage20–50 GB free (ONNX + engines)20–50 GB free

Quantization and Platform Compatibility

PrecisionMemory savings
(vs FP16)
Jetson Orin
CC 8.7 (sm_87)
Jetson Thor
sm_110
FP16BaselineSupportedSupported
FP82x reductionNot availableSupported
INT4 AWQ4x reductionSupportedSupported
NVFP44x reductionNot availableSupported

Step 1: Export Models (x86 Host or Jetson Thor)

This step converts HuggingFace models to quantized ONNX files. It requires significant GPU memory and Python, so it runs on either an x86 workstation or Jetson Thor, not on Orin devices.

  • x86 workstation: use this if you have a Linux PC or cloud GPU. After export, transfer the ONNX files to your Jetson.
  • Jetson Thor: run the export directly on Thor using the NVIDIA PyTorch container. No separate PC needed.

1.1 Set Up the Environment

On Thor, use the NVIDIA PyTorch container which ships with PyTorch, CUDA, TensorRT, and ModelOpt pre-installed for Jetson’s aarch64/SBSA architecture.

docker pull nvcr.io/nvidia/pytorch:25.12-py3

docker run -it --runtime nvidia \
    --name edgellm-export \
    -v $(pwd):/workspace \
    -w /workspace \
    nvcr.io/nvidia/pytorch:25.12-py3 \
    bash

Inside the container, clone the repository and install. The --system-site-packages flag lets the venv inherit the container’s NVIDIA-built PyTorch. We install Edge-LLM with --no-deps to prevent pip from replacing torch, then install the remaining dependencies separately while filtering out the torch lines.

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv --system-site-packages venv
source venv/bin/activate

pip3 install --no-deps .
sed '/^torch/d' requirements.txt > /tmp/reqs.txt
pip3 install -r /tmp/reqs.txt

If you are tempted to run a normal pip install instead, see Troubleshooting (first item: why plain pip breaks on Jetson).

Set the workspace directory to /workspace/ so exported files land directly on the host via the volume mount:

export WORKSPACE_DIR=/workspace/tensorrt-edgellm-workspace

Use /workspace, not $HOME

Inside the container $HOME is /root, which is not on the mounted volume. Always use /workspace/… so your ONNX files appear on the host automatically. When you exit the container, the workspace will be at ~/tensorrt-edgellm-workspace (or wherever you ran docker run from).

Pull and launch the NVIDIA PyTorch container:

docker pull nvcr.io/nvidia/pytorch:25.12-py3

docker run --gpus all -it \
    --name edgellm-export \
    -v $(pwd):/workspace \
    -w /workspace \
    nvcr.io/nvidia/pytorch:25.12-py3 \
    bash

Inside the container, clone the repository and install:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv venv
source venv/bin/activate
pip3 install .

If you prefer not to use Docker, set up a virtual environment directly on your x86 host:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv venv
source venv/bin/activate
pip3 install .

1.2 Verify Installation

tensorrt-edgellm-export-llm --help
tensorrt-edgellm-quantize-llm --help

Both commands should print their usage information without errors.

1.3 Log in to HuggingFace

Cosmos-Reason2-8B is a gated model: you must accept the license and authenticate before downloading.

  1. Visit nvidia/Cosmos-Reason2-8B on HuggingFace and click “Agree and access repository”.
  2. Generate a token at HuggingFace Settings - Tokens (read access is sufficient).
  3. Log in from the terminal:
huggingface-cli login

Paste your token when prompted. This persists across sessions inside the container.

1.4 Export Cosmos-Reason2-8B (VLM)

Cosmos Reason2 is a vision-language model, so you need to export two components: the language model and the visual encoder.

# Thor Docker users: WORKSPACE_DIR was already set to /workspace/tensorrt-edgellm-workspace in Step 1.1
# x86 / venv users: set it now
export WORKSPACE_DIR=${WORKSPACE_DIR:-$HOME/tensorrt-edgellm-workspace}
export MODEL_NAME=Cosmos-Reason2-8B
mkdir -p $WORKSPACE_DIR && cd $WORKSPACE_DIR

Quantize the Language Model

NVFP4 is the recommended precision for Thor (SM110+), offering 4x memory reduction with native hardware support.

tensorrt-edgellm-quantize-llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization nvfp4

INT4 AWQ reduces the 8B language-model weights to roughly 4 GB. Use it when deploying on Orin devices (SM87) or Thor.

tensorrt-edgellm-quantize-llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization int4_awq

FP8 requires SM89+ at build time. Use this if your target is Thor or an Ada Lovelace+ dev GPU.

tensorrt-edgellm-quantize-llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization fp8

Export Language Model to ONNX

tensorrt-edgellm-export-llm \
    --model_dir $MODEL_NAME/quantized \
    --output_dir $MODEL_NAME/onnx/llm

Export Visual Encoder to ONNX

The visual encoder is exported from the original (unquantized) model directory:

tensorrt-edgellm-export-visual \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/onnx/visual

1.5 Export Qwen3-4B-Instruct (LLM)

Qwen3-4B-Instruct is a text-only LLM. It supports INT4 AWQ, which brings the 4B model down to ~2 GB of weights, a comfortable fit for Orin Nano’s 8 GB unified memory. No HuggingFace login is needed (Apache 2.0 license).

# Thor Docker users: WORKSPACE_DIR was already set to /workspace/tensorrt-edgellm-workspace in Step 1.1
# x86 / venv users: set it now
export WORKSPACE_DIR=${WORKSPACE_DIR:-$HOME/tensorrt-edgellm-workspace}
export MODEL_NAME=Qwen3-4B-Instruct
mkdir -p $WORKSPACE_DIR && cd $WORKSPACE_DIR

Quantize and Export

INT4 AWQ reduces the 4B model to ~2 GB of weights, leaving plenty of headroom for KV cache on Orin Nano.

tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization int4_awq

tensorrt-edgellm-export-llm \
    --model_dir $MODEL_NAME/quantized \
    --output_dir $MODEL_NAME/onnx
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization fp8

tensorrt-edgellm-export-llm \
    --model_dir $MODEL_NAME/quantized \
    --output_dir $MODEL_NAME/onnx
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization nvfp4

tensorrt-edgellm-export-llm \
    --model_dir $MODEL_NAME/quantized \
    --output_dir $MODEL_NAME/onnx

1.6 Transfer ONNX Files to Jetson

ONNX from Step 1 sits on whatever machine ran the export (your x86 workstation or Jetson Thor). Copy each model’s ONNX folder only onto the Jetson that will build engines for that model: Cosmos Reason2 8B on Thor, Qwen3-4B-Instruct on Orin Nano.

💡 Exported on Jetson Thor?

Both models’ ONNX files are already on your Thor host (they landed in tensorrt-edgellm-workspace/ via the Docker volume mount). Skip the Thor scp below — you only need to copy Qwen3-4B-Instruct ONNX to the Orin Nano.

Exported on x86? Use both scp blocks below. Exported on Thor? Use only the Orin Nano block for Qwen3.

To Jetson Thor (Cosmos-Reason2-8B ONNX):

scp -r Cosmos-Reason2-8B/onnx <user>@<thor-ip>:~/tensorrt-edgellm-workspace/Cosmos-Reason2-8B/

To Jetson Orin Nano (Qwen3-4B-Instruct ONNX):

scp -r Qwen3-4B-Instruct/onnx <user>@<orin-nano-ip>:~/tensorrt-edgellm-workspace/Qwen3-4B-Instruct/

Create the target directories first if they do not exist:

ssh <user>@<thor-ip> "mkdir -p ~/tensorrt-edgellm-workspace/Cosmos-Reason2-8B"
ssh <user>@<orin-nano-ip> "mkdir -p ~/tensorrt-edgellm-workspace/Qwen3-4B-Instruct"

If you followed this tutorial’s Docker instructions (which set WORKSPACE_DIR under /workspace/), the ONNX files are already on the host and no extra copy step is needed.

Step 2: Build the C++ Runtime on Your Jetson

Everything from here forward runs on the target Jetson device and is pure C++; no Python needed. Run steps 2.1–2.5 on your target device, then follow the section for your device:

Why must this run on the target device?

TensorRT compiles ONNX graphs into engine binaries that are optimized for the exact GPU they run on: kernel selection, memory layout, and fused operations are all hardware-specific. An engine built on Thor (SM110) will not load on Orin Nano (SM87), and vice versa. Unlike the ONNX files from Step 1 (which are portable), engines must be built on the same device that will execute them.

💡 Thor users who exported on-device

Exit the Docker container (exit). Because WORKSPACE_DIR was set to /workspace/tensorrt-edgellm-workspace, the ONNX files are already on the host in the directory where you ran docker run. Fix root-owned permissions, then proceed:

sudo chown -R $(whoami):$(whoami) tensorrt-edgellm-workspace

2.1 Install Build Dependencies

sudo apt update
sudo apt install -y cmake build-essential git \
    cuda-toolkit-13-0 \
    libnvinfer-headers-dev libnvinfer-dev libnvonnxparsers-dev
export PATH=/usr/local/cuda/bin:$PATH
sudo apt update
sudo apt install -y cmake build-essential git \
    cuda-toolkit-12-6 \
    libnvinfer-headers-dev libnvinfer-dev libnvonnxparsers-dev
export PATH=/usr/local/cuda/bin:$PATH

Verify nvcc is available:

nvcc --version
Do not install Ubuntu nvidia-cuda-toolkit

Use the cuda-toolkit-* package from NVIDIA’s repo (as in the commands above), not the Ubuntu nvidia-cuda-toolkit package, which conflicts with JetPack CUDA libraries.

2.2 Clone the Repository

cd ~
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

2.3 Configure and Build

If you previously ran Docker with a volume mount into this repo, fix file ownership first:

sudo chown -R $(whoami):$(whoami) ~/TensorRT-Edge-LLM
cd ~/TensorRT-Edge-LLM
rm -rf build
mkdir build && cd build

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-thor

make -j$(nproc)
cd ~/TensorRT-Edge-LLM
rm -rf build
mkdir build && cd build

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-orin

make -j$(nproc)

2.4 Verify the Build

cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_build --help
./build/examples/llm/llm_inference --help

2.5 Set Up Environment Variables

The EDGELLM_PLUGIN_PATH variable tells the runtime where to find the Edge-LLM custom TensorRT plugins (AttentionPlugin, Int4GemmPlugin, etc.):

cd ~/TensorRT-Edge-LLM
export EDGELLM_PLUGIN_PATH=$(pwd)/build/libNvInfer_edgellm_plugin.so
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace

Choose Your Deployment Path

After completing Step 2, follow the section that matches your device. Steps 3 and 4 below are worked examples, but the same workflow applies to any Jetson (AGX Orin, Orin NX, etc.) with any supported model, as long as the model fits in memory and you use a quantization format your GPU supports (see the precision table above).

Step 3: Cosmos Reason2 8B on Jetson Thor (NVFP4)

🟢 Jetson Thor: 8B VLM with NVFP4 quantization

Cosmos Reason2 8B is an 8B vision-language model (LLM + visual encoder). NVFP4 is a Thor-exclusive precision (SM110+) that reduces weights to ~4 GB. This section runs entirely on Jetson Thor. If you only have an Orin Nano, skip to Step 4.

3.1 Build the Language Model Engine

export MODEL_NAME=Cosmos-Reason2-8B

./build/examples/llm/llm_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine/llm \
    --maxBatchSize 1 \
    --maxInputLen 1024 \
    --maxKVCacheCapacity 4096

3.2 Build the Visual Encoder Engine

./build/examples/multimodal/visual_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine

The visual engine is saved to $WORKSPACE_DIR/$MODEL_NAME/engine/visual/.

3.3 Create an Input File

Save the following as $WORKSPACE_DIR/input_vlm.json. Use an absolute path for the image:

cat > $WORKSPACE_DIR/input_vlm.json << 'EOF'
{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": "IMAGE_PATH_PLACEHOLDER"
                        },
                        {
                            "type": "text",
                            "text": "Describe what you see in this image."
                        }
                    ]
                }
            ]
        }
    ]
}
EOF

Then replace the image path placeholder with a real image (the repo ships sample images):

sed -i "s|IMAGE_PATH_PLACEHOLDER|$(pwd)/examples/multimodal/pics/red_panda.jpeg|" \
    $WORKSPACE_DIR/input_vlm.json

💡 Sample images

The repo ships test images at ~/TensorRT-Edge-LLM/examples/multimodal/pics/ including red_panda.jpeg, giant_panda.jpeg, woman_and_dog.jpeg, and database_er.jpeg.

3.4 Run Inference

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine/llm \
    --multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --inputFile $WORKSPACE_DIR/input_vlm.json \
    --outputFile $WORKSPACE_DIR/output_vlm.json \
    --dumpOutput

3.5 Verify Output

Example command and VLM output
cat $WORKSPACE_DIR/output_vlm.json

You should see a JSON response with the model’s description of the image. Example output:

“A red panda rests its head on a wooden surface, its fur a rich reddish-brown with white accents on its ears and face, while its dark eyes and black nose stand out against the soft, fluffy texture of its coat.”

Step 4: Qwen3-4B-Instruct on Jetson Orin Nano 8 GB (INT4 AWQ)

🟠 Jetson Orin Nano 8 GB: 4B LLM with INT4 AWQ quantization

INT4 AWQ reduces Qwen3-4B-Instruct to ~2 GB of weights, leaving ample room for the KV cache and OS within Orin Nano’s 8 GB unified memory. This section runs entirely on Jetson Orin Nano. Ensure you completed Step 2 on your Orin Nano first.

4.1 Build the Engine

The memory-optimized parameters below are tuned for Orin Nano 8 GB. If you hit CUDA out of memory during the build, reduce the limits further (e.g. --maxInputLen 256 --maxKVCacheCapacity 512) and free system memory first:

sudo sysctl -w vm.drop_caches=3
export MODEL_NAME=Qwen3-4B-Instruct

./build/examples/llm/llm_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --maxBatchSize 1 \
    --maxInputLen 512 \
    --maxKVCacheCapacity 1024

4.2 Create an Input File

cat > $WORKSPACE_DIR/input_qwen.json << 'EOF'
{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 512,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": "What are the benefits of running AI models on edge devices like NVIDIA Jetson?"
                }
            ]
        }
    ]
}
EOF

4.3 Run Inference

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --inputFile $WORKSPACE_DIR/input_qwen.json \
    --outputFile $WORKSPACE_DIR/output_qwen.json \
    --dumpOutput

4.4 Verify Output

Example command and Qwen3 output (Orin Nano INT4)
cat $WORKSPACE_DIR/output_qwen.json

Example output from Qwen3-4B-Instruct INT4 on Orin Nano 8 GB:

Running AI models on edge devices like NVIDIA Jetson offers several key benefits, making them ideal for real-time, decentralized, and privacy-sensitive applications. The main advantages include:

1. Low Latency and Real-Time Processing: Edge devices like NVIDIA Jetson process data locally, eliminating the need to send data to the cloud. This results in near-instant inference, which is critical for time-sensitive applications such as autonomous vehicles, industrial automation, and robotics.

2. Improved Privacy and Data Security: Sensitive data (e.g., video, audio, or images) is processed on the device itself, reducing the risk of data exposure, breaches, or unauthorized access.

3. Reduced Bandwidth Usage: Since raw data doesn’t need to be transmitted to a central server, bandwidth consumption is significantly reduced. This is cost-effective and beneficial in remote or low-connectivity areas.

4. Reliability and Resilience: Edge AI enables continuous operation even during network outages or connectivity issues. Devices can function autonomously, ensuring uninterrupted service in critical applications like smart cities or remote monitoring.

5. Compliance with Regulatory Requirements: Processing data locally helps organizations meet data sovereignty and privacy regulations.

Integrating Edge-LLM in Your C++ Application

The llm_inference binary used above is a reference application. For production use (robotics, camera apps, industrial inspection, kiosks), you integrate Edge-LLM directly via the C++ API. The API surface is three calls: create a runtime, capture CUDA graphs, then call handleRequest() per query. See the C++ runtime headers and example application on GitHub.

Troubleshooting

Why not use plain pip install on Jetson (Thor container)?

The generic torch-2.10.0 wheel from PyPI does not work on Jetson. It can raise AttributeError: module ‘torch._C’ has no attribute ‘_dlpack_exchange_api’. The NVIDIA PyTorch container includes a Jetson-built torch (for example torch 2.10.0a0+…nv25.12). The setup in this tutorial uses —system-site-packages on the venv so that build is visible, pip3 install —no-deps . so pip does not overwrite torch, and a filtered requirements.txt (with torch lines removed) to pull in the remaining packages (transformers, datasets, onnx, etc.) without replacing torch or torchvision.

Export fails with out-of-memory on x86 host

FP8 ONNX export can require up to 6x the model size in GPU VRAM and 20x in CPU RAM for 8B models. Use INT4 AWQ quantization instead, which is less memory-intensive, or add --shm-size=16g to the docker run command.

Slow build or make crashes on Orin Nano

Orin Nano has limited RAM. Reduce parallelism: make -j4 instead of make -j$(nproc), or run make without the -j flag for a sequential build.

References