Skip to content

Tutorial - Small Language Models (SLM)

Small Language Models (SLMs) represent a growing class of language models that have <7B parameters - for example StableLM, Phi-2, and Gemma-2B. Their smaller memory footprint and faster performance make them good candidates for deploying on Jetson Orin Nano. Some are very capable with abilities at a similar level as the larger models, having been trained on high-quality curated datasets.

This tutorial shows how to run optimized SLMs with quantization using the NanoLLM library and MLC/TVM backend. You can run these models through tools like text-generation-webui and llama.cpp as well, just not as fast - and since the focus of SLMs is reduced computational and memory requirements, here we'll use the most optimized path available. Those shown below have been profiled:

SLM Benchmarks

•   The HuggingFace Open LLM Leaderboard is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.
•   The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)
•   The Chat Model is the instruction-tuned variant for chatting with in the commands below, as opposed to the base completion model.

Based on user interactions, the recommended models to try are stabilityai/stablelm-zephyr-3b and princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT, for having output quality on par with Llama-2-7B and well-optimized neural architectures. These models have also been used as the base for various fine-tunes (for example Nous-Capybara-3B-V1.9) and mini VLMs. Others may not be particularly coherent.

Chatting with SLMs

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin (64GB) Jetson AGX Orin (32GB) Jetson Orin NX (16GB) Jetson Orin Nano (8GB)

  2. Running one of the following versions of JetPack:

    JetPack 6 (L4T r36.x)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 22GB for nano_llm container image
    • Space for models (>5GB)
  4. Clone and setup jetson-containers:

    git clone https://github.com/dusty-nv/jetson-containers
    bash jetson-containers/install.sh
    

The nano_llm.chat program will automatically download and quantize models from HuggingFace like those listed in the table above:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT

•   For models requiring authentication, use --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>
•   Press Ctrl+C twice in succession to exit (once will interrupt bot output)

This will enter into interactive mode where you chat back and forth using the keyboard (entering reset will clear the chat history)

Automated Prompts

During testing, you can specify prompts on the command-line that will run sequentially:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model stabilityai/stablelm-zephyr-3b \
    --max-new-tokens 512 \
    --prompt 'hi, how are you?' \
    --prompt 'whats the square root of 900?' \
    --prompt 'can I get a recipie for french onion soup?'

You can also load JSON files containing prompt sequences, like with --prompt /data/prompts/qa.json (the output of which is below)

Results

•   The model responses are with 4-bit quantization, and are truncated to 256 tokens for brevity.
•   These chat questions are from /data/prompts/qa.json (found in jetson-containers)