Fine-tune LLMs on Jetson
Learn how to fine-tune large language models directly on Jetson using PyTorch and Hugging Face TRL. Covers Full SFT (4B), LoRA (9B), and QLoRA (27B).
Aditya Sahu Fine-tune LLMs on Jetson
Fine-tuning lets you customize a pre-trained LLM on your own data so it becomes better at a specific task β like following domain-specific instructions, answering questions in a particular style, or understanding specialized vocabulary.
Prerequisites
| Requirement | Details |
|---|---|
| Device | Jetson AGX Thor (128GB) |
| Software | JetPack 7.x (R38) or later, Docker with NVIDIA runtime |
| Account | Hugging Face (free) β Qwen models are openly available, no license acceptance required |
β οΈ Jetson Thor Unified Memory
Jetson Thor shares its 128 GB memory between CPU and GPU. The OS, desktop environment, and background processes typically consume 4β6 GB, leaving ~115 GB available for training. All default configurations in this tutorial are tuned to stay well within this limit. If you increase batch size, sequence length, or use a larger model, monitor memory with tegrastats or jtop to avoid hitting the system OOM killer.
Which Method Should I Use?
| Method | Model | Measured Memory | Training Time | Best For |
|---|---|---|---|---|
| Full SFT | Qwen3.5 4B | ~42 GB | ~5 min (1 epoch, 500 samples) | Maximum quality |
| LoRA | Qwen3.5 9B | ~50 GB | ~3.5 min (1 epoch, 512 samples) | Good balance of quality and efficiency |
| QLoRA (4-bit) | Qwen3.5 27B | ~28 GB | ~10 min (1 epoch, 512 samples) | Largest models with least memory |
These times are rough guides for the defaults in each section below; larger βdataset_size or more βnum_epochs will take proportionally longer.
π‘ How to choose
- Full SFT updates every parameter β maximum expressiveness, but only practical for models up to ~4β9B on Thor.
- LoRA freezes the base model and trains small adapter matrices (~1β2% of parameters). Uses bf16 weights, works well up to ~14B.
- QLoRA combines 4-bit quantization with LoRA β loads the model in 4-bit to save memory, trains the same small adapters.
All three methods produce models in standard Hugging Face SafeTensors format that can be deployed with vLLM, Ollama, llama.cpp, or TensorRT-LLM.
Environment Setup
Step 1: Pull the PyTorch Container
docker pull nvcr.io/nvidia/pytorch:25.11-py3
Step 2: Launch the Container
Navigate to the desired directory first, then launch the container β $(pwd) mounts your current directory as /workspace:
cd ~/Desktop/train/finetune
docker run --runtime nvidia -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd):/workspace \
-w /workspace \
nvcr.io/nvidia/pytorch:25.11-py3
Step 3: Install Dependencies
Inside the container:
pip install trl datasets accelerate peft bitsandbytes
Step 4: Authenticate with Hugging Face (Optional)
Qwen models are openly available and do not require authentication. This step is only needed if you use a gated model (e.g. Meta Llama) or experience rate-limiting during large downloads (the 27B QLoRA model is ~43 GB).
export HF_TOKEN="hf_your_token_here"
Replace hf_your_token_here with your actual token from huggingface.co/settings/tokens.
Step 5: Download the Scripts
wget https://www.jetson-ai-lab.com/code-samples/finetune/full_sft_finetuning.py
wget https://www.jetson-ai-lab.com/code-samples/finetune/lora_finetuning.py
wget https://www.jetson-ai-lab.com/code-samples/finetune/qlora_finetuning.py
Training Dataset
All three scripts use the tatsu-lab/alpaca dataset by default β a collection of ~52,000 instruction-following examples in this format:
| Field | Description | Example |
|---|---|---|
instruction | The task to perform | βSummarize the following paragraph.β |
input | Optional context | (the paragraph text) |
output | The expected response | (the summary) |
Each example is formatted into a prompt template during training:
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction: {instruction}
### Input: {input}
### Response: {output}
The model learns to generate the ### Response: portion. By default, only a subset is used (500β512 samples) to keep training fast for demonstration.
Using Your Own Dataset
To fine-tune on your own data, prepare a JSON Lines (.jsonl) or JSON file with the same three fields β instruction, input, and output:
[
{
"instruction": "Classify the sentiment of this review.",
"input": "The battery life is amazing and the screen is crystal clear.",
"output": "Positive"
},
{
"instruction": "Extract the part number from this text.",
"input": "Please ship 50 units of PN-4820-X to warehouse B.",
"output": "PN-4820-X"
},
{
"instruction": "Translate the following to Spanish.",
"input": "The system is operating normally.",
"output": "El sistema estΓ‘ funcionando normalmente."
}
]
Then modify the get_alpaca_dataset() function in whichever script youβre using to load your file instead:
def get_custom_dataset(eos_token, data_path):
dataset = load_dataset("json", data_files=data_path, split="train").shuffle(seed=42)
def preprocess(x):
texts = [
ALPACA_PROMPT_TEMPLATE.format(instruction, inp, output) + eos_token
for instruction, inp, output in zip(x["instruction"], x["input"], x["output"])
]
return {"text": texts}
return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)
Tips for your own dataset
- Quality over quantity β 500 high-quality examples often outperform 5,000 noisy ones
- Leave
inputempty ("") for tasks that donβt need additional context - Be consistent β use the same output style across all examples (e.g. always JSON, always one sentence, etc.)
- Match your use case β if your task is classification, every example should be a classification task
Option 1: Full SFT (Qwen3.5 4B)
Full Supervised Fine-Tuning updates every parameter in the model. This gives maximum flexibility but uses more memory since the optimizer must store states for all parameters.
β οΈ Jetson Thor Unified Memory
Jetson Thor shares its 128 GB memory between CPU and GPU. The OS, desktop, and other processes typically use 4β6 GB, so your training process should stay under ~115 GB to avoid the system OOM killer.
python full_sft_finetuning.py --output_dir ./finetuned_model
You should see output like:
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 298.18 seconds
Samples per second: 1.68
Steps per second: 0.21
Train loss: 1.1620
============================================================
Saving model to ./finetuned_model...
Writing model shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:09<00:00, 9.94s/it]
Model saved successfully!
Full list of configuration options
| Parameter | Default | Description |
|---|---|---|
--model_name | Qwen/Qwen3.5-4B | Model to fine-tune |
--batch_size | 4 | Per-device batch size |
--gradient_accumulation_steps | 2 | Gradient accumulation |
--seq_length | 2048 | Max sequence length |
--num_epochs | 1 | Training epochs |
--learning_rate | 5e-5 | Learning rate |
--dataset_size | 500 | Samples to use |
--gradient_checkpointing | on | Save memory at cost of speed |
--use_torch_compile | off | torch.compile (adds warmup time) |
--output_dir | β | Where to save model |
π‘ Understanding Batch Size and Memory
Two parameters control how many samples the model processes per optimizer update:
βbatch_sizeβ samples processed at once on the GPU (directly affects memory usage)βgradient_accumulation_stepsβ how many mini-batches to accumulate before updating weights
The effective batch size = batch_size Γ gradient_accumulation_steps. Training quality depends on the effective batch size, not the per-device batch size. So you can lower βbatch_size to save memory and raise βgradient_accumulation_steps to compensate β the model learns identically, just processes fewer samples per forward pass.
| batch_size | accum_steps | Effective Batch | Memory (Qwen3.5 4B Full SFT) |
|---|---|---|---|
| 8 | 1 | 8 | ~87 GB (may OOM with desktop running) |
| 4 | 2 | 8 | ~42 GB (default, safe) |
| 2 | 4 | 8 | ~25β30 GB (conservative) |
| 1 | 8 | 8 | ~18β22 GB (minimum) |
To override the defaults, pass both flags together:
python full_sft_finetuning.py βbatch_size 2 βgradient_accumulation_steps 4 βoutput_dir ./finetuned_modelOption 2: LoRA (Qwen3.5 9B)
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices injected into the attention and MLP layers. Only ~1β2% of parameters are trainable, which dramatically reduces memory usage and makes it possible to fine-tune larger models on Jetson.
python lora_finetuning.py --output_dir ./lora_adapter
The output will show how few parameters are actually trained:
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 215.79 seconds
Samples per second: 2.37
Steps per second: 0.30
Train loss: 0.9587
============================================================
Saving LoRA adapter to ./lora_adapter...
LoRA adapter saved successfully!
Full list of configuration options
| Parameter | Default | Description |
|---|---|---|
--model_name | Qwen/Qwen3.5-9B | Model to fine-tune |
--batch_size | 4 | Per-device batch size |
--gradient_accumulation_steps | 2 | Gradient accumulation |
--seq_length | 2048 | Max sequence length |
--num_epochs | 1 | Training epochs |
--learning_rate | 1e-4 | Learning rate |
--lora_rank | 8 | LoRA rank (higher = more params) |
--lora_alpha | 16 | LoRA scaling factor |
--dataset_size | 512 | Samples to use |
--gradient_checkpointing | off | Save memory at cost of speed |
--use_torch_compile | off | torch.compile (adds warmup time) |
--output_dir | β | Where to save LoRA adapter |
Option 3: QLoRA (Qwen3.5 27B)
QLoRA (Quantized LoRA) loads the base model in 4-bit precision and trains LoRA adapters on top. This dramatically reduces memory β fine-tuning a 27B model uses less memory than Full SFT on a 4B model.
python qlora_finetuning.py --output_dir ./qlora_adapter
You should see output like:
============================================================
QLoRA FINE-TUNING CONFIGURATION
============================================================
Model: Qwen/Qwen3.5-27B
Training mode: QLoRA (4-bit, rank=16, alpha=32)
Batch size: 2
Gradient accumulation: 4
Effective batch size: 8
.....
============================================================
TRAINING COMPLETED
============================================================
Training runtime: 618.94 seconds
Samples per second: 0.83
Steps per second: 0.10
Train loss: 0.8245
============================================================
Saving QLoRA adapter to ./qlora_adapter...
QLoRA adapter saved successfully!
Full list of configuration options
| Parameter | Default | Description |
|---|---|---|
--model_name | Qwen/Qwen3.5-27B | Model to fine-tune |
--batch_size | 2 | Per-device batch size |
--gradient_accumulation_steps | 4 | Gradient accumulation |
--seq_length | 2048 | Max sequence length |
--num_epochs | 1 | Training epochs |
--learning_rate | 2e-4 | Learning rate |
--lora_rank | 16 | LoRA rank (higher = more params) |
--lora_alpha | 32 | LoRA scaling factor |
--dataset_size | 512 | Samples to use |
--gradient_checkpointing | on | Save memory at cost of speed |
--output_dir | β | Where to save QLoRA adapter |
Deploying Your Fine-tuned Model
After fine-tuning, you have model weights saved in Hugging Face SafeTensors format.
If you used Full SFT, your ./finetuned_model is already a complete model β ready to deploy as-is.
If you used LoRA or QLoRA, the output directory contains only adapter weights. Merge the adapter into the base model to produce a standalone model:
# For LoRA (Qwen3.5 9B)
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-9B', torch_dtype='auto', trust_remote_code=True)
model = PeftModel.from_pretrained(base, './lora_adapter')
merged = model.merge_and_unload()
merged.save_pretrained('./merged_model')
AutoTokenizer.from_pretrained('Qwen/Qwen3.5-9B').save_pretrained('./merged_model')
print('Merged model saved to ./merged_model')
"
# For QLoRA (Qwen3.5 27B)
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3.5-27B', torch_dtype='auto', trust_remote_code=True)
model = PeftModel.from_pretrained(base, './qlora_adapter')
merged = model.merge_and_unload()
merged.save_pretrained('./merged_model_27b')
AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B').save_pretrained('./merged_model_27b')
print('Merged model saved to ./merged_model_27b')
"
Once you have a complete model, you can serve it with vLLM, Ollama, llama.cpp, or TensorRT-LLM. For a full walkthrough on deploying and serving models on Jetson, see the Introduction to GenAI on Jetson: How to Run LLMs and VLMs tutorial.
Troubleshooting
Out of memory (CUDA OOM)
Jetson Thor uses unified memory β the 128 GB is shared between CPU and GPU. If the training process plus the OS exceed available memory, the Linux OOM killer will terminate processes (including Cursor or the training itself).
Default memory usage: Full SFT ~42 GB, LoRA ~50 GB, QLoRA ~28 GB. If you still hit OOM:
- Reduce
--batch_size(to 2 or 1) - Increase
--gradient_accumulation_stepsproportionally to keep the effective batch size - Reduce
--seq_lengthto 1024 or 512 - Switch to a more memory-efficient method: Full SFT β LoRA β QLoRA
- Close memory-heavy desktop apps (browsers, IDEs) before training
- For QLoRA, models larger than ~27B may OOM during the weight loading phase (the bf16β4bit conversion requires temporarily holding the original weights in RAM)
Hugging Face authentication errors
Qwen models are openly available β no gating or license acceptance required. If you still encounter download issues:
- Create a Hugging Face account
- Create an access token
- Set the token:
export HF_TOKEN="hf_your_token_here"
Fine-tuned model shows little improvement
If the fine-tuned model responses look similar to the base model:
- Use more training data:
--dataset_size 2000or higher β this has the biggest impact - Train for more epochs:
--num_epochs 3to5with a larger dataset - Lower the learning rate if training loss spikes: try
--learning_rate 1e-5 - Use your own domain-specific dataset rather than Alpaca β these models have likely already seen similar instruction-following data during pre-training