TensorRT Edge-LLM on Jetson

TensorRT Edge-LLM is NVIDIA’s high-performance C++ inference runtime for LLMs and VLMs on embedded platforms. The workflow compiles trained models into optimized TensorRT engines; at run time, a small native binary loads those engines and serves requests with no Python interpreter in the inference path. Quantization (INT4, NVFP4, FP8) reduces weight footprint so larger models remain practical on memory-constrained devices. The SDK supports a wide range of models; see the full Supported Models list.

Overview

Edge-LLM supports a wide range of LLMs and VLMs across the entire Jetson family, from Orin Nano to Thor. See the full Supported Models list. In this tutorial, we walk through two examples that showcase the spectrum:

Model	Type	Parameters	Quantization	Target Device
Cosmos-Reason2-8B	VLM	8B	NVFP4	Jetson Thor
Qwen3-4B-Instruct	LLM	4B	INT4 AWQ	Jetson Orin Nano 8 GB

Beyond these two examples, the same export → build → inference workflow applies to the full range of supported model families:

Model Family	Type	Jetson Orin	Jetson Thor
Llama 3.x Instruct	LLM	FP16 · INT4	FP16 · INT4 · NVFP4
Qwen3 dense (0.6B–14B)	LLM	FP16 · INT4	FP16 · INT4 · NVFP4
Qwen3.5 / Qwen3.6 text (0.8B–27B)	LLM / VLM	FP16 · INT4	FP16 · INT4 · NVFP4
Qwen3-VL (2B–8B)	VLM	FP16 · INT4	FP16 · INT4 · NVFP4
InternVL3 / InternVL3.5 (1B–14B)	VLM	FP16 · INT4 AWQ	FP16 · INT4 · NVFP4
Phi-4-Multimodal	VLM	FP16 · INT4	FP16 · INT4 · NVFP4
Nemotron-Nano 4B / 9B	LLM (Mamba2+Attention)	BF16	BF16 · FP8 · NVFP4
Alpamayo R1	VLA (robotics)	FP16	FP16

The workflow:

Step 1: Export models (Python, x86 or Thor). Quantize and convert HuggingFace models to portable ONNX files. Transfer them to your target Jetson(s).
Step 2: Build the C++ runtime (each Jetson). Clone the repo, compile the C++ engine builder and inference binary. TensorRT engines are hardware-specific and must be built on the device that will run them.
Step 3: Cosmos Reason2 8B on Thor (NVFP4). Build engines and run VLM inference on Jetson Thor.
Step 4: Qwen3-4B-Instruct on Orin Nano (INT4 AWQ). Build engines and run LLM inference on Jetson Orin Nano 8 GB.

Prerequisites

x86 Host / Jetson Thor (for Step 1: Model Export)

Requirement	Details
OS	Ubuntu 22.04 or 24.04
GPU	NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
GPU VRAM	24 GB+ recommended (48 GB+ for FP8 export of 8B models)
CUDA	12.x or 13.x
Python	3.10+
Docker	Optional but recommended

Jetson Target Device (for Step 2: Build and Inference)

Requirement	Jetson Orin (AGX Orin / Orin NX / Orin Nano)	Thor
JetPack	7.2 / Jetson Linux R39.2	7.1 / 7.2
CUDA	13.2 (included)	13.x (included)
TensorRT	10.x+ (included)	10.x+ (included)
Storage	20–50 GB free (ONNX + engines)	20–50 GB free

Quantization and Platform Compatibility

Precision	Memory savings (vs FP16)	Jetson Orin CC 8.7 (`sm_87`)	Jetson Thor `sm_110`
FP16	Baseline	Supported	Supported
FP8	2x reduction	Not available	Supported
INT4 AWQ	4x reduction	Supported	Supported
NVFP4	4x reduction	Not available	Supported

Step 1: Export Models (x86 Host or Jetson Thor)

This step converts HuggingFace models to quantized ONNX files. It requires significant GPU memory and Python, so it runs on either an x86 workstation or Jetson Thor, not on Orin devices.

x86 workstation: use this if you have a Linux PC or cloud GPU. After export, transfer the ONNX files to your Jetson.
Jetson Thor: run the export directly on Thor using the NVIDIA PyTorch container. No separate PC needed.

1.1 Set Up the Environment

On Thor, use the NVIDIA PyTorch container which ships with PyTorch, CUDA, TensorRT, and ModelOpt pre-installed for Jetson’s aarch64/SBSA architecture.

docker pull nvcr.io/nvidia/pytorch:26.05-py3

docker run -it --runtime nvidia \
    --name edgellm-export \
    -v $(pwd):/workspace \
    -w /workspace \
    nvcr.io/nvidia/pytorch:26.05-py3 \
    bash

Inside the container, clone the repository and install. The --system-site-packages flag lets the venv inherit the container’s NVIDIA-built PyTorch. We install Edge-LLM with --no-deps to prevent pip from replacing torch, then install the remaining dependencies separately while filtering out the torch lines.

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv --system-site-packages venv
source venv/bin/activate

pip3 install --no-deps .
sed '/^torch/d' requirements.txt > /tmp/reqs.txt
pip3 install -r /tmp/reqs.txt

If you are tempted to run a normal pip install instead, see Troubleshooting (first item: why plain pip breaks on Jetson).

Set the workspace directory to /workspace/ so exported files land directly on the host via the volume mount:

export WORKSPACE_DIR=/workspace/tensorrt-edgellm-workspace

Use /workspace, not $HOME

Inside the container $HOME is /root, which is not on the mounted volume. Always use /workspace/… so your ONNX files appear on the host automatically. When you exit the container, the workspace will be at ~/tensorrt-edgellm-workspace (or wherever you ran docker run from).

Pull and launch the NVIDIA PyTorch container:

docker pull nvcr.io/nvidia/pytorch:26.05-py3

docker run --gpus all -it \
    --name edgellm-export \
    -v $(pwd):/workspace \
    -w /workspace \
    nvcr.io/nvidia/pytorch:26.05-py3 \
    bash

Inside the container, clone the repository and install:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv venv
source venv/bin/activate
pip3 install .

If you prefer not to use Docker, set up a virtual environment directly on your x86 host:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

python3 -m venv venv
source venv/bin/activate
pip3 install .

1.2 Verify Installation

tensorrt-edgellm-export --help
tensorrt-edgellm-quantize llm --help

Both commands should print their usage information without errors.

1.3 Log in to HuggingFace

Cosmos-Reason2-8B is a gated model: you must accept the license and authenticate before downloading.

Visit nvidia/Cosmos-Reason2-8B on HuggingFace and click “Agree and access repository”.
Generate a token at HuggingFace Settings - Tokens (read access is sufficient).
Log in from the terminal:

huggingface-cli login

Paste your token when prompted. This persists across sessions inside the container.

1.4 Export Cosmos-Reason2-8B (VLM)

Cosmos Reason2 is a vision-language model. TensorRT Edge-LLM exports the language model and visual encoder components from the quantized checkpoint with one command.

# Thor Docker users: WORKSPACE_DIR was already set to /workspace/tensorrt-edgellm-workspace in Step 1.1
# x86 / venv users: set it now
export WORKSPACE_DIR=${WORKSPACE_DIR:-$HOME/tensorrt-edgellm-workspace}
export MODEL_NAME=Cosmos-Reason2-8B
mkdir -p $WORKSPACE_DIR && cd $WORKSPACE_DIR

Quantize the Language Model

NVFP4 is the recommended precision for Thor (SM110+), offering 4x memory reduction with native hardware support.

tensorrt-edgellm-quantize llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization nvfp4

INT4 AWQ reduces the 8B language-model weights to roughly 4 GB. Use it when deploying on Orin devices (SM87) or Thor.

tensorrt-edgellm-quantize llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization int4_awq

FP8 requires SM89+ at build time. Use this if your target is Thor or an Ada Lovelace+ dev GPU.

tensorrt-edgellm-quantize llm \
    --model_dir nvidia/Cosmos-Reason2-8B \
    --output_dir $MODEL_NAME/quantized \
    --quantization fp8

Export Checkpoint to ONNX

tensorrt-edgellm-export \
    $MODEL_NAME/quantized \
    $MODEL_NAME/onnx

1.5 Export Qwen3-4B-Instruct (LLM)

Qwen3-4B-Instruct is a text-only LLM. It supports INT4 AWQ, which brings the 4B model down to ~2 GB of weights, a comfortable fit for Orin Nano’s 8 GB unified memory. No HuggingFace login is needed (Apache 2.0 license).

# Thor Docker users: WORKSPACE_DIR was already set to /workspace/tensorrt-edgellm-workspace in Step 1.1
# x86 / venv users: set it now
export WORKSPACE_DIR=${WORKSPACE_DIR:-$HOME/tensorrt-edgellm-workspace}
export MODEL_NAME=Qwen3-4B-Instruct
mkdir -p $WORKSPACE_DIR && cd $WORKSPACE_DIR

Quantize and Export

INT4 AWQ reduces the 4B model to ~2 GB of weights, leaving plenty of headroom for KV cache on Orin Nano.

tensorrt-edgellm-quantize llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization int4_awq

tensorrt-edgellm-export \
    $MODEL_NAME/quantized \
    $MODEL_NAME/onnx

tensorrt-edgellm-quantize llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization fp8

tensorrt-edgellm-export \
    $MODEL_NAME/quantized \
    $MODEL_NAME/onnx

tensorrt-edgellm-quantize llm \
    --model_dir Qwen/Qwen3-4B-Instruct-2507 \
    --output_dir $MODEL_NAME/quantized \
    --quantization nvfp4

tensorrt-edgellm-export \
    $MODEL_NAME/quantized \
    $MODEL_NAME/onnx

1.6 Transfer ONNX Files to Jetson

ONNX from Step 1 sits on whatever machine ran the export (your x86 workstation or Jetson Thor). Copy each model’s ONNX folder only onto the Jetson that will build engines for that model: Cosmos Reason2 8B on Thor, Qwen3-4B-Instruct on Orin Nano.

💡 Exported on Jetson Thor?

Both models’ ONNX files are already on your Thor host (they landed in tensorrt-edgellm-workspace/ via the Docker volume mount). Skip the Thor scp below — you only need to copy Qwen3-4B-Instruct ONNX to the Orin Nano.

Exported on x86? Use both scp blocks below. Exported on Thor? Use only the Orin Nano block for Qwen3.

To Jetson Thor (Cosmos-Reason2-8B ONNX):

scp -r Cosmos-Reason2-8B/onnx <user>@<thor-ip>:~/tensorrt-edgellm-workspace/Cosmos-Reason2-8B/

To Jetson Orin Nano (Qwen3-4B-Instruct ONNX):

scp -r Qwen3-4B-Instruct/onnx <user>@<orin-nano-ip>:~/tensorrt-edgellm-workspace/Qwen3-4B-Instruct/

Create the target directories first if they do not exist:

ssh <user>@<thor-ip> "mkdir -p ~/tensorrt-edgellm-workspace/Cosmos-Reason2-8B"
ssh <user>@<orin-nano-ip> "mkdir -p ~/tensorrt-edgellm-workspace/Qwen3-4B-Instruct"

If you followed this tutorial’s Docker instructions (which set WORKSPACE_DIR under /workspace/), the ONNX files are already on the host and no extra copy step is needed.

Step 2: Build the C++ Runtime on Your Jetson

Everything from here forward runs on the target Jetson device and is pure C++; no Python needed. Run steps 2.1–2.5 on your target device, then follow the section for your device:

Why must this run on the target device?

TensorRT compiles ONNX graphs into engine binaries that are optimized for the exact GPU they run on: kernel selection, memory layout, and fused operations are all hardware-specific. An engine built on Thor (SM110) will not load on Orin Nano (SM87), and vice versa. Unlike the ONNX files from Step 1 (which are portable), engines must be built on the same device that will execute them.

💡 Thor users who exported on-device

Exit the Docker container (exit). Because WORKSPACE_DIR was set to /workspace/tensorrt-edgellm-workspace, the ONNX files are already on the host in the directory where you ran docker run. Fix root-owned permissions, then proceed:

sudo chown -R $(whoami):$(whoami) tensorrt-edgellm-workspace

2.1 Install Build Dependencies

sudo apt update
sudo apt install -y cmake build-essential git \
    cuda-toolkit-13-2 \
    libnvinfer-headers-dev libnvinfer-dev libnvonnxparsers-dev
export PATH=/usr/local/cuda/bin:$PATH

sudo apt update
sudo apt install -y cmake build-essential git \
    cuda-toolkit-13-2 \
    libnvinfer-headers-dev libnvinfer-dev libnvonnxparsers-dev
export PATH=/usr/local/cuda/bin:$PATH

Verify nvcc is available:

nvcc --version

For Jetson Orin, also confirm the device is running the JetPack 7.2 stack before building:

cat /etc/nv_tegra_release
dpkg-query -W nvidia-l4t-core 'cuda-toolkit-13-2'

Expected versions are Jetson Linux 39.2.x and cuda-toolkit-13-2.

Do not install Ubuntu nvidia-cuda-toolkit

Use the cuda-toolkit-* package from NVIDIA’s repo (as in the commands above), not the Ubuntu nvidia-cuda-toolkit package, which conflicts with JetPack CUDA libraries.

2.2 Clone the Repository

cd ~
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

2.3 Configure and Build

If you previously ran Docker with a volume mount into this repo, fix file ownership first:

sudo chown -R $(whoami):$(whoami) ~/TensorRT-Edge-LLM

cd ~/TensorRT-Edge-LLM
rm -rf build
mkdir build && cd build

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-thor \
    -DCUDA_CTK_VERSION=13.0 \
    -DENABLE_CUTE_DSL=ALL

make -j$(nproc)

cd ~/TensorRT-Edge-LLM
rm -rf build
mkdir build && cd build

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-orin \
    -DCUDA_CTK_VERSION=13.2 \
    -DENABLE_CUTE_DSL=ALL

make -j$(nproc)

2.4 Verify the Build

cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_build --help
./build/examples/llm/llm_inference --help

2.5 Set Up Environment Variables

The EDGELLM_PLUGIN_PATH variable tells the runtime where to find the Edge-LLM custom TensorRT plugins (AttentionPlugin, Int4GemmPlugin, etc.):

cd ~/TensorRT-Edge-LLM
export EDGELLM_PLUGIN_PATH=$(pwd)/build/libNvInfer_edgellm_plugin.so
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace

Choose Your Deployment Path

After completing Step 2, follow the section that matches your device. Steps 3 and 4 below are worked examples, but the same workflow applies to any Jetson (AGX Orin, Orin NX, etc.) with any supported model, as long as the model fits in memory and you use a quantization format your GPU supports (see the precision table above).

Step 3: Cosmos Reason2 8B on Jetson Thor (NVFP4)

🟢 Jetson Thor: 8B VLM with NVFP4 quantization

Cosmos Reason2 8B is an 8B vision-language model (LLM + visual encoder). NVFP4 is a Thor-exclusive precision (SM110+) that reduces weights to ~4 GB. This section runs entirely on Jetson Thor. If you only have an Orin Nano, skip to Step 4.

3.1 Build the Language Model Engine

export MODEL_NAME=Cosmos-Reason2-8B

./build/examples/llm/llm_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine/llm \
    --maxBatchSize 1 \
    --maxInputLen 1024 \
    --maxKVCacheCapacity 4096

3.2 Build the Visual Encoder Engine

./build/examples/multimodal/visual_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine

The visual engine is saved to $WORKSPACE_DIR/$MODEL_NAME/engine/visual/.

3.3 Create an Input File

Save the following as $WORKSPACE_DIR/input_vlm.json. Use an absolute path for the image:

cat > $WORKSPACE_DIR/input_vlm.json << 'EOF'
{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "image": "IMAGE_PATH_PLACEHOLDER"
                        },
                        {
                            "type": "text",
                            "text": "Describe what you see in this image."
                        }
                    ]
                }
            ]
        }
    ]
}
EOF

Then replace the image path placeholder with a real image (the repo ships sample images):

sed -i "s|IMAGE_PATH_PLACEHOLDER|$(pwd)/examples/multimodal/pics/red_panda.jpeg|" \
    $WORKSPACE_DIR/input_vlm.json

💡 Sample images

The repo ships test images at ~/TensorRT-Edge-LLM/examples/multimodal/pics/ including red_panda.jpeg, giant_panda.jpeg, woman_and_dog.jpeg, and database_er.jpeg.

3.4 Run Inference

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine/llm \
    --multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --inputFile $WORKSPACE_DIR/input_vlm.json \
    --outputFile $WORKSPACE_DIR/output_vlm.json \
    --dumpOutput

3.5 Verify Output

Example command and VLM output

cat $WORKSPACE_DIR/output_vlm.json

You should see a JSON response with the model’s description of the image. Example output:

“A red panda rests its head on a wooden surface, its fur a rich reddish-brown with white accents on its ears and face, while its dark eyes and black nose stand out against the soft, fluffy texture of its coat.”

Step 4: Qwen3-4B-Instruct on Jetson Orin Nano 8 GB (INT4 AWQ)

🟠 Jetson Orin Nano 8 GB: 4B LLM with INT4 AWQ quantization

INT4 AWQ reduces Qwen3-4B-Instruct to ~2 GB of weights, leaving ample room for the KV cache and OS within Orin Nano’s 8 GB unified memory. This section runs entirely on Jetson Orin Nano. Ensure you completed Step 2 on your Orin Nano first.

4.1 Build the Engine

The memory-optimized parameters below are tuned for Orin Nano 8 GB. If you hit CUDA out of memory during the build, reduce the limits further (e.g. --maxInputLen 256 --maxKVCacheCapacity 512) and free system memory first:

sudo sysctl -w vm.drop_caches=3

export MODEL_NAME=Qwen3-4B-Instruct

./build/examples/llm/llm_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --maxBatchSize 1 \
    --maxInputLen 512 \
    --maxKVCacheCapacity 1024

4.2 Create an Input File

cat > $WORKSPACE_DIR/input_qwen.json << 'EOF'
{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 512,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": "What are the benefits of running AI models on edge devices like NVIDIA Jetson?"
                }
            ]
        }
    ]
}
EOF

4.3 Run Inference

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --inputFile $WORKSPACE_DIR/input_qwen.json \
    --outputFile $WORKSPACE_DIR/output_qwen.json \
    --dumpOutput

4.4 Verify Output

Example command and Qwen3 output (Orin Nano INT4)

cat $WORKSPACE_DIR/output_qwen.json

Example output from Qwen3-4B-Instruct INT4 on Orin Nano 8 GB:

Running AI models on edge devices like NVIDIA Jetson offers several key benefits, making them ideal for real-time, decentralized, and privacy-sensitive applications. The main advantages include:

1. Low Latency and Real-Time Processing: Edge devices like NVIDIA Jetson process data locally, eliminating the need to send data to the cloud. This results in near-instant inference, which is critical for time-sensitive applications such as autonomous vehicles, industrial automation, and robotics.

2. Improved Privacy and Data Security: Sensitive data (e.g., video, audio, or images) is processed on the device itself, reducing the risk of data exposure, breaches, or unauthorized access.

3. Reduced Bandwidth Usage: Since raw data doesn’t need to be transmitted to a central server, bandwidth consumption is significantly reduced. This is cost-effective and beneficial in remote or low-connectivity areas.

4. Reliability and Resilience: Edge AI enables continuous operation even during network outages or connectivity issues. Devices can function autonomously, ensuring uninterrupted service in critical applications like smart cities or remote monitoring.

5. Compliance with Regulatory Requirements: Processing data locally helps organizations meet data sovereignty and privacy regulations.

Trying Other Models

The workflow in Steps 1–4 generalises to any model in the Supported Models list. Swap the --model_dir argument in the quantize and export commands, then rebuild the TensorRT engine on your Jetson. A few worked examples:

InternVL3 / InternVL3.5

InternVL3 is an open-source VLM family ranging from 1B to 14B. The 1B and 2B models fit Orin Nano with INT4 AWQ; larger variants target AGX Orin or Thor.

# Option A: quantize from the original checkpoint
tensorrt-edgellm-quantize llm \
    --model_dir OpenGVLab/InternVL3-2B-hf \
    --output_dir InternVL3-2B/quantized \
    --quantization int4_awq

tensorrt-edgellm-export \
    InternVL3-2B/quantized \
    InternVL3-2B/onnx \
    --externalize-weights int4_ffn

# Option B: use a pre-quantized AWQ checkpoint (skip the quantize step)
tensorrt-edgellm-export \
    OpenGVLab/InternVL3-2B-AWQ \
    InternVL3-2B/onnx \
    --externalize-weights int4_ffn

Build and run on your Jetson exactly as in Steps 3–4 (use llm_build for the LLM engine and visual_build for the vision encoder, then llm_inference with --multimodalEngineDir).

Qwen3.5 / Qwen3.6 Text

Qwen3.5 and Qwen3.6 dense text models follow the same workflow as Qwen3. Available sizes are 0.8B, 2B, 4B, 9B, and 27B — there are no separate Instruct checkpoints; the base checkpoints support instruction following directly.

# Qwen3.5-4B with INT4 AWQ — fits Orin Nano 8 GB
tensorrt-edgellm-quantize llm \
    --model_dir Qwen/Qwen3.5-4B \
    --output_dir Qwen3.5-4B/quantized \
    --quantization int4_awq

tensorrt-edgellm-export \
    Qwen3.5-4B/quantized \
    Qwen3.5-4B/onnx \
    --externalize-weights int4_ffn

Nemotron-Nano 4B

NVIDIA Nemotron-Nano uses a hybrid Mamba2+Attention architecture. Pre-quantized checkpoints export directly — no separate quantize step needed.

# NVFP4 pre-quantized — Thor only (SM110+)
tensorrt-edgellm-export \
    nvidia/NVIDIA-Nemotron-3-Nano-4B-NVFP4 \
    Nemotron-Nano-4B/onnx

# BF16 original — Orin (export directly, runs as FP16 on device)
tensorrt-edgellm-export \
    nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    Nemotron-Nano-4B/onnx

Memory guidance for Orin Nano INT4 builds

Pass —externalize-weights int4_ffn to tensorrt-edgellm-export for dense INT4 checkpoints to reduce peak engine-build memory. For MoE checkpoints add int4_moe to that flag.

Performance and Benchmarking

TensorRT Edge-LLM publishes released performance results in its Performance Benchmarks section. The benchmark page covers Jetson AGX Thor results across LLM and VLM workloads, including prefill latency, prefill throughput, generation throughput, GPU memory usage, visual encoder throughput, and speculative decoding speedups.

For the engines built in this tutorial, use llm_inference when you want end-to-end application timing with real JSON requests, and use llm_bench when you want synthetic prefill or decode measurements without preparing request files.

Benchmark a built LLM engine:

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_bench \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --mode prefill \
    --inputLen 128 \
    --batchSize 1

Collect a layer-level profile:

./build/examples/llm/llm_bench \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --mode generation \
    --inputLen 128 \
    --outputLen 128 \
    --batchSize 1 \
    --profile

For end-to-end application measurements, run llm_inference with --dumpProfile and, optionally, --profileOutputFile:

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engine \
    --inputFile $WORKSPACE_DIR/input_qwen.json \
    --outputFile $WORKSPACE_DIR/output_qwen.json \
    --dumpProfile \
    --profileOutputFile $WORKSPACE_DIR/profile_qwen.json

Benchmark context

The released Edge-LLM benchmark tables use default TensorRT Edge-LLM inference settings on Jetson AGX Thor. Local results can vary with Jetson model, power mode, memory pressure, thermal state, CUDA/TensorRT version, batch size, prompt length, and generation length.

Integrating Edge-LLM in Your C++ Application

The llm_inference binary used above is a reference application. For production use (robotics, camera apps, industrial inspection, kiosks), you integrate Edge-LLM directly via the C++ API. The API surface is three calls: create a runtime, capture CUDA graphs, then call handleRequest() per query. See the C++ runtime headers and example application on GitHub.

Troubleshooting

Why not use plain pip install on Jetson (Thor container)?

The generic PyPI torch wheel does not work on Jetson. It can raise AttributeError: module ‘torch._C’ has no attribute ‘_dlpack_exchange_api’. The NVIDIA PyTorch container includes a Jetson-built torch. The setup in this tutorial uses —system-site-packages on the venv so that build is visible, pip3 install —no-deps . so pip does not overwrite torch, and a filtered requirements.txt (with torch lines removed) to pull in the remaining packages (transformers, datasets, onnx, etc.) without replacing torch or torchvision.

Export fails with out-of-memory on x86 host

FP8 ONNX export can require up to 6x the model size in GPU VRAM and 20x in CPU RAM for 8B models. Use INT4 AWQ quantization instead, which is less memory-intensive, or add --shm-size=16g to the docker run command.

Slow build or make crashes on Orin Nano

Orin Nano has limited RAM. Reduce parallelism: make -j4 instead of make -j$(nproc), or run make without the -j flag for a sequential build.

TensorRT Edge-LLM on Jetson

TensorRT Edge-LLM on Jetson

Overview

Prerequisites

x86 Host / Jetson Thor (for Step 1: Model Export)

Jetson Target Device (for Step 2: Build and Inference)

Quantization and Platform Compatibility

Step 1: Export Models (x86 Host or Jetson Thor)

1.1 Set Up the Environment

1.2 Verify Installation

1.3 Log in to HuggingFace

1.4 Export Cosmos-Reason2-8B (VLM)

Quantize the Language Model

Export Checkpoint to ONNX

1.5 Export Qwen3-4B-Instruct (LLM)

Quantize and Export

1.6 Transfer ONNX Files to Jetson

Step 2: Build the C++ Runtime on Your Jetson

2.1 Install Build Dependencies

2.2 Clone the Repository

2.3 Configure and Build

2.4 Verify the Build

2.5 Set Up Environment Variables

Choose Your Deployment Path

Step 3: Cosmos Reason2 8B on Jetson Thor (NVFP4)

3.1 Build the Language Model Engine

3.2 Build the Visual Encoder Engine

3.3 Create an Input File

3.4 Run Inference

3.5 Verify Output

Step 4: Qwen3-4B-Instruct on Jetson Orin Nano 8 GB (INT4 AWQ)

4.1 Build the Engine

4.2 Create an Input File

4.3 Run Inference

4.4 Verify Output

Trying Other Models

InternVL3 / InternVL3.5

Qwen3.5 / Qwen3.6 Text

Nemotron-Nano 4B

Performance and Benchmarking

Integrating Edge-LLM in Your C++ Application

Troubleshooting

References