Tutorial - text-generation-webui
Interact with a local AI assistant by running a LLM with oobabooga's
text-generaton-webui on NVIDIA Jetson!
What you need
One of the following Jetson:
Jetson AGX Orin 64GB Jetson AGX Orin (32GB) Jetson Orin Nano Orin (8GB)⚠️1
Running one of the following JetPack.5x
JetPack 5.1.2 (L4T r35.4.1) JetPack 5.1.1 (L4T r35.3.1) JetPack 5.1 (L4T r35.2.1)
Sufficient storage space (preferably with NVMe SSD).
6.2GBfor container image
- Spaces for models
Set up a container for text-generation-webui
The jetson-containers project provides pre-built Docker images for
text-generation-webui along with all of the loader API's built with CUDA enabled (llama.cpp, ExLlama, AutoGPTQ, Transformers, ect). You can clone the repo to use its utilities that will automatically pull/start the correct container for you, or you can do it manually.
git clone --depth=1 https://github.com/dusty-nv/jetson-containers cd jetson-containers sudo apt update; sudo apt install -y python3-pip pip3 install -r requirements.txt
JetsonHacks provides an informative walkthrough video on
jetson-containers, showcasing the usage of both the
text-generation-webui containers. You can find the complete article with detailed instructions here.
How to start
autotag script to automatically pull or build a compatible container image:
cd jetson-containers ./run.sh $(./autotag text-generation-webui)
The container has a default run command (
CMD) that will automatically start the webserver like this:
cd /opt/text-generation-webui && python3 server.py \ --model-dir=/data/models/text-generation-webui \ --chat \ --listen
Open your browser and access
Download a model on web UI
./run.sh --workdir=/opt/text-generation-webui $(./autotag text-generation-webui) /bin/bash -c \ 'python3 download-model.py --output=/data/models/text-generation-webui TheBloke/Llama-2-7b-Chat-GPTQ'
From within the web UI, select Model tab and navigate to "Download model or LoRA" section.
You can find text generation models on Hugging Face Hub, then enter the Hugging Face username/model path (which you can have copied to your clipboard from the Hub). Then click the Download button.
The fastest oobabooga model loader to use is currently llama.cpp with 4-bit quantized GGUF models.
You can download a single model file for a particular quantization, like
*.Q4_K_M.bin. Input the file name and hit Download button.
Model selection for Jetson Orin Nano
Jetson Orin Nano Developer Kit has only 8GB RAM for both CPU (system) and GPU, so you need to pick a model that fits in the RAM size - see the Model Size section below. The 7B models with 4-bit quantization are the ones to use on Jetson Orin Nano. Make sure you go through the RAM optimization steps before attempting to load such model on Jetson Orin Nano.
Load a model
After you have downloaded a model, click the 🔄 button to refresh your model list, and select the model you want to use.
For a GGUF model, remember to
8if you using Llama-2-70B (on Jetson AGX Orin 64GB)
Then click the Load button.
If you're using a Llama model fine-tuned for chat, like the models listed above (except for
LLaMA-30b), you need to use the oobabooga Instruct mode and set the template. On the Parameters tab, go to the Instruction Template sub-tab, then select
Llama-v2 from the Instruction Template drop-down (or Vicuna, Guanaco, ect if you are using one of those models)
For the base text completion models (like
LLaMA-30b), use the Default or Notebook tab.
Selecting the right chat template will make sure the model is being prompted correctly - you can also change the system prompt in the Context box to alter the agent's personality and behavior. There are a lot of other settings under the Generation tab, like the maximum length it should output per reply, and token sampling parameters like
top_p for controlling randomness.
Then change back to the Chat tab, and under the mode section, make sure Instruct is selected (confusingly, not chat mode). Then you can start chatting with the LLM!
Things to do with your LLM
Here are some common test prompts for coding, math, history ect. You can also ask it about geography, travel, nature, recipies, fixing things, general life advice, and practically everything else. Also Llama-2 is quite playful and likes to play games to test its logic abilities!
>> What games do you like to play? I'm a large language model, so I can play text-based games and answer questions on a wide variety of topics. Here are some game ideas: - 20 Questions: You or I can think of an object, and the other player tries to guess what it is by asking up to 20 yes-or-no questions. The goal is to guess the object in as few questions as possible. - Hangman: I can think of a word or phrase and you try to guess it by suggesting letters. Each incorrect guess will bring the hangman one step closer to being complete! - Word association: I'll give you a word and you have to come up with a word that is related to it. We can keep playing back and forth as long as you'd like! - Trivia: I can provide questions on a wide range of topics, from history to science to pop culture. See how many questions you can answer correctly! - Storytelling: I can start telling a story and then stop at a cliffhanger. You can then try to guess what happens next or even take over the storytelling and continue it in your own - Jokes: I love to tell jokes and make people laugh with my "Dad humor"! Knock knock! *giggles*
Model size tested
With llama.cpp, GGUF model, 4-bit quantization:
|Model size||Jetson AGX Orin 64GB||Jetson AGX Orin 32GB||Jetson Orin Nano 8GB|
Want to explore using Python APIs to run LLMs directly?
jetson-containers for its LLM related packages and containers.
Limited to 7B model (4-bit quantized). ↩