Skip to content

Tutorial - Ollama

Ollama is a popular LLM tool that's easy to get started with, and includes a built-in model library of pre-quantized weights that will automatically be downloaded and run using llama.cpp underneath for inference. The ollama container was compiled with CUDA support.

Ollama Server

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)

  2. Running one of the following versions of JetPack:

    JetPack 5 (L4T r35.x) JetPack 6 (L4T r36.x)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 7GB for ollama container image
    • Space for models (>5GB)
# models cached under jetson-containers/data
jetson-containers run --name ollama $(autotag ollama)

# models cached under your user's home directory
docker run --runtime nvidia --rm --network=host -v ~/ollama:/ollama -e OLLAMA_MODELS=/ollama dustynv/ollama:r36.2.0

Running either of these will start the local Ollama server as a daemon in the background. It will save the models it downloads under your mounted jetson-containers/data/models/ollama directory (or another directory that you override with OLLAMA_MODELS)

Ollama Client

Start the Ollama command-line chat client with your desired model (for example: llama3, phi3, mistral)

# if running inside the same container as launched above
/bin/ollama run phi3

# if launching a new container for the client in another terminal
jetson-containers run $(autotag ollama) /bin/ollama run phi3

Or you can install Ollama's binaries for arm64 outside of container (without CUDA, which only the server needs)

# download the latest ollama release for arm64 into /bin
sudo wget$(git ls-remote --refs --sort="version:refname" --tags | cut -d/ -f3- | sed 's/-rc.*//g' | tail -n1)/ollama-linux-arm64 -O /bin/ollama
sudo chmod +x /bin/ollama

# use the client like normal outside container
/bin/ollama run phi3

Open WebUI

To run an Open WebUI server for client browsers to connect to, use the open-webui container:

docker run -it --rm --network=host --add-host=host.docker.internal:host-gateway

You can then navigate your browser to http://JETSON_IP:8080, and create a fake account to login (these credentials are only local)

Ollama uses llama.cpp for inference, which various API benchmarks and comparisons are provided for on the Llava page. It gets roughly half of peak performance versus the faster APIs like NanoLLM, but is generally considered fast enough for text chat.