Fine-tune LLMs on Jetson

Learn how to fine-tune large language models directly on Jetson using PyTorch and Hugging Face TRL. Covers Full SFT (4B), LoRA (9B), and QLoRA (27B).

Fine-tune LLMs on Jetson

Fine-tuning lets you customize a pre-trained LLM on your own data so it becomes better at a specific task β€” like following domain-specific instructions, answering questions in a particular style, or understanding specialized vocabulary.

Prerequisites

RequirementDetails
DeviceJetson AGX Thor (128GB)
SoftwareJetPack 7.x (R38) or later, Docker with NVIDIA runtime
AccountHugging Face (free) β€” Qwen models are openly available, no license acceptance required

⚠️ Jetson Thor Unified Memory

Jetson Thor shares its 128 GB memory between CPU and GPU. The OS, desktop environment, and background processes typically consume 4–6 GB, leaving ~115 GB available for training. All default configurations in this tutorial are tuned to stay well within this limit. If you increase batch size, sequence length, or use a larger model, monitor memory with tegrastats or jtop to avoid hitting the system OOM killer.

Which Method Should I Use?

MethodModelMeasured MemoryTraining TimeBest For
Full SFTQwen3.5 4B~42 GB~5 min
(1 epoch, 500 samples)
Maximum quality
LoRAQwen3.5 9B~50 GB~3.5 min
(1 epoch, 512 samples)
Good balance of quality and efficiency
QLoRA (4-bit)Qwen3.5 27B~28 GB~10 min
(1 epoch, 512 samples)
Largest models with least memory

These times are rough guides for the defaults in each section below; larger β€”dataset_size or more β€”num_epochs will take proportionally longer.

πŸ’‘ How to choose

  • Full SFT updates every parameter β€” maximum expressiveness, but only practical for models up to ~4–9B on Thor.
  • LoRA freezes the base model and trains small adapter matrices (~1–2% of parameters). Uses bf16 weights, works well up to ~14B.
  • QLoRA combines 4-bit quantization with LoRA β€” loads the model in 4-bit to save memory, trains the same small adapters.

All three methods produce models in standard Hugging Face SafeTensors format that can be deployed with vLLM, Ollama, llama.cpp, or TensorRT-LLM.

Environment Setup

Step 1: Pull the PyTorch Container

docker pull nvcr.io/nvidia/pytorch:25.11-py3

Step 2: Launch the Container

Navigate to the desired directory first, then launch the container β€” $(pwd) mounts your current directory as /workspace:

cd ~/Desktop/train/finetune

docker run --runtime nvidia -it --rm --ipc=host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd):/workspace \
  -w /workspace \
  nvcr.io/nvidia/pytorch:25.11-py3

Step 3: Install Dependencies

Inside the container:

pip install trl datasets accelerate peft bitsandbytes

Step 4: Authenticate with Hugging Face (Optional)

Qwen models are openly available and do not require authentication. This step is only needed if you use a gated model (e.g. Meta Llama) or experience rate-limiting during large downloads (the 27B QLoRA model is ~43 GB).

export HF_TOKEN="hf_your_token_here"

Replace hf_your_token_here with your actual token from huggingface.co/settings/tokens.

Step 5: Download the Scripts

wget https://www.jetson-ai-lab.com/code-samples/finetune/full_sft_finetuning.py
wget https://www.jetson-ai-lab.com/code-samples/finetune/lora_finetuning.py
wget https://www.jetson-ai-lab.com/code-samples/finetune/qlora_finetuning.py

Training Dataset

All three scripts use the tatsu-lab/alpaca dataset by default β€” a collection of ~52,000 instruction-following examples in this format:

FieldDescriptionExample
instructionThe task to perform”Summarize the following paragraph.”
inputOptional context(the paragraph text)
outputThe expected response(the summary)

Each example is formatted into a prompt template during training:

Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction: {instruction}

### Input: {input}

### Response: {output}

The model learns to generate the ### Response: portion. By default, only a subset is used (500–512 samples) to keep training fast for demonstration.

Using Your Own Dataset

To fine-tune on your own data, prepare a JSON Lines (.jsonl) or JSON file with the same three fields β€” instruction, input, and output:

[
  {
    "instruction": "Classify the sentiment of this review.",
    "input": "The battery life is amazing and the screen is crystal clear.",
    "output": "Positive"
  },
  {
    "instruction": "Extract the part number from this text.",
    "input": "Please ship 50 units of PN-4820-X to warehouse B.",
    "output": "PN-4820-X"
  },
  {
    "instruction": "Translate the following to Spanish.",
    "input": "The system is operating normally.",
    "output": "El sistema estΓ‘ funcionando normalmente."
  }
]

Then modify the get_alpaca_dataset() function in whichever script you’re using to load your file instead:

def get_custom_dataset(eos_token, data_path):
    dataset = load_dataset("json", data_files=data_path, split="train").shuffle(seed=42)

    def preprocess(x):
        texts = [
            ALPACA_PROMPT_TEMPLATE.format(instruction, inp, output) + eos_token
            for instruction, inp, output in zip(x["instruction"], x["input"], x["output"])
        ]
        return {"text": texts}

    return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)

Tips for your own dataset

  • Quality over quantity β€” 500 high-quality examples often outperform 5,000 noisy ones
  • Leave input empty ("") for tasks that don’t need additional context
  • Be consistent β€” use the same output style across all examples (e.g. always JSON, always one sentence, etc.)
  • Match your use case β€” if your task is classification, every example should be a classification task

Option 1: Full SFT (Qwen3.5 4B)

Full Supervised Fine-Tuning updates every parameter in the model. This gives maximum flexibility but uses more memory since the optimizer must store states for all parameters.

⚠️ Jetson Thor Unified Memory

Jetson Thor shares its 128 GB memory between CPU and GPU. The OS, desktop, and other processes typically use 4–6 GB, so your training process should stay under ~115 GB to avoid the system OOM killer.

python full_sft_finetuning.py --output_dir ./finetuned_model

You should see output like:

============================================================
TRAINING COMPLETED
============================================================
Training runtime: 298.18 seconds
Samples per second: 1.68
Steps per second: 0.21
Train loss: 1.1620
============================================================

Saving model to ./finetuned_model...
Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:09<00:00,  9.94s/it]
Model saved successfully!
Full list of configuration options
ParameterDefaultDescription
--model_nameQwen/Qwen3.5-4BModel to fine-tune
--batch_size4Per-device batch size
--gradient_accumulation_steps2Gradient accumulation
--seq_length2048Max sequence length
--num_epochs1Training epochs
--learning_rate5e-5Learning rate
--dataset_size500Samples to use
--gradient_checkpointingonSave memory at cost of speed
--use_torch_compileofftorch.compile (adds warmup time)
--output_dirβ€”Where to save model

πŸ’‘ Understanding Batch Size and Memory

Two parameters control how many samples the model processes per optimizer update:

  • β€”batch_size β€” samples processed at once on the GPU (directly affects memory usage)
  • β€”gradient_accumulation_steps β€” how many mini-batches to accumulate before updating weights

The effective batch size = batch_size Γ— gradient_accumulation_steps. Training quality depends on the effective batch size, not the per-device batch size. So you can lower β€”batch_size to save memory and raise β€”gradient_accumulation_steps to compensate β€” the model learns identically, just processes fewer samples per forward pass.

batch_sizeaccum_stepsEffective BatchMemory (Qwen3.5 4B Full SFT)
818~87 GB (may OOM with desktop running)
428~42 GB (default, safe)
248~25–30 GB (conservative)
188~18–22 GB (minimum)

To override the defaults, pass both flags together:

python full_sft_finetuning.py β€”batch_size 2 β€”gradient_accumulation_steps 4 β€”output_dir ./finetuned_model

Option 2: LoRA (Qwen3.5 9B)

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices injected into the attention and MLP layers. Only ~1–2% of parameters are trainable, which dramatically reduces memory usage and makes it possible to fine-tune larger models on Jetson.

python lora_finetuning.py --output_dir ./lora_adapter

The output will show how few parameters are actually trained:

============================================================
TRAINING COMPLETED
============================================================
Training runtime: 215.79 seconds
Samples per second: 2.37
Steps per second: 0.30
Train loss: 0.9587
============================================================

Saving LoRA adapter to ./lora_adapter...
LoRA adapter saved successfully!
Full list of configuration options
ParameterDefaultDescription
--model_nameQwen/Qwen3.5-9BModel to fine-tune
--batch_size4Per-device batch size
--gradient_accumulation_steps2Gradient accumulation
--seq_length2048Max sequence length
--num_epochs1Training epochs
--learning_rate1e-4Learning rate
--lora_rank8LoRA rank (higher = more params)
--lora_alpha16LoRA scaling factor
--dataset_size512Samples to use
--gradient_checkpointingoffSave memory at cost of speed
--use_torch_compileofftorch.compile (adds warmup time)
--output_dirβ€”Where to save LoRA adapter

Option 3: QLoRA (Qwen3.5 27B)

QLoRA (Quantized LoRA) loads the base model in 4-bit precision and trains LoRA adapters on top. This dramatically reduces memory β€” fine-tuning a 27B model uses less memory than Full SFT on a 4B model.

python qlora_finetuning.py --output_dir ./qlora_adapter

You should see output like:

============================================================
QLoRA FINE-TUNING CONFIGURATION
============================================================
Model: Qwen/Qwen3.5-27B
Training mode: QLoRA (4-bit, rank=16, alpha=32)
Batch size: 2
Gradient accumulation: 4
Effective batch size: 8
.....
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 618.94 seconds
Samples per second: 0.83
Steps per second: 0.10
Train loss: 0.8245
============================================================

Saving QLoRA adapter to ./qlora_adapter...
QLoRA adapter saved successfully!
Full list of configuration options
ParameterDefaultDescription
--model_nameQwen/Qwen3.5-27BModel to fine-tune
--batch_size2Per-device batch size
--gradient_accumulation_steps4Gradient accumulation
--seq_length2048Max sequence length
--num_epochs1Training epochs
--learning_rate2e-4Learning rate
--lora_rank16LoRA rank (higher = more params)
--lora_alpha32LoRA scaling factor
--dataset_size512Samples to use
--gradient_checkpointingonSave memory at cost of speed
--output_dirβ€”Where to save QLoRA adapter

Deploying Your Fine-tuned Model

After fine-tuning, you have model weights saved in Hugging Face SafeTensors format.

If you used Full SFT, your ./finetuned_model is already a complete model β€” ready to deploy as-is.

If you used LoRA or QLoRA, the output directory contains only adapter weights. Merge the adapter into the base model to produce a standalone model:

# For LoRA (Qwen3.5 9B)
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-9B', torch_dtype='auto', trust_remote_code=True)
model = PeftModel.from_pretrained(base, './lora_adapter')
merged = model.merge_and_unload()
merged.save_pretrained('./merged_model')
AutoTokenizer.from_pretrained('Qwen/Qwen3.5-9B').save_pretrained('./merged_model')
print('Merged model saved to ./merged_model')
"
# For QLoRA (Qwen3.5 27B)
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-27B', torch_dtype='auto', trust_remote_code=True)
model = PeftModel.from_pretrained(base, './qlora_adapter')
merged = model.merge_and_unload()
merged.save_pretrained('./merged_model_27b')
AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B').save_pretrained('./merged_model_27b')
print('Merged model saved to ./merged_model_27b')
"

Once you have a complete model, you can serve it with vLLM, Ollama, llama.cpp, or TensorRT-LLM. For a full walkthrough on deploying and serving models on Jetson, see the Introduction to GenAI on Jetson: How to Run LLMs and VLMs tutorial.

Troubleshooting

Out of memory (CUDA OOM)

Jetson Thor uses unified memory β€” the 128 GB is shared between CPU and GPU. If the training process plus the OS exceed available memory, the Linux OOM killer will terminate processes (including Cursor or the training itself).

Default memory usage: Full SFT ~42 GB, LoRA ~50 GB, QLoRA ~28 GB. If you still hit OOM:

  1. Reduce --batch_size (to 2 or 1)
  2. Increase --gradient_accumulation_steps proportionally to keep the effective batch size
  3. Reduce --seq_length to 1024 or 512
  4. Switch to a more memory-efficient method: Full SFT β†’ LoRA β†’ QLoRA
  5. Close memory-heavy desktop apps (browsers, IDEs) before training
  6. For QLoRA, models larger than ~27B may OOM during the weight loading phase (the bf16β†’4bit conversion requires temporarily holding the original weights in RAM)
Hugging Face authentication errors

Qwen models are openly available β€” no gating or license acceptance required. If you still encounter download issues:

  1. Create a Hugging Face account
  2. Create an access token
  3. Set the token: export HF_TOKEN="hf_your_token_here"
Fine-tuned model shows little improvement

If the fine-tuned model responses look similar to the base model:

  • Use more training data: --dataset_size 2000 or higher β€” this has the biggest impact
  • Train for more epochs: --num_epochs 3 to 5 with a larger dataset
  • Lower the learning rate if training loss spikes: try --learning_rate 1e-5
  • Use your own domain-specific dataset rather than Alpaca β€” these models have likely already seen similar instruction-following data during pre-training

References