Skip to content

NanoLLM - Optimized LLM Inference

NanoLLM is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.

It provides similar APIs to HuggingFace, backed by highly-optimized inference libraries and quantization tools:

NanoLLM Reference Documentation
from nano_llm import NanoLLM

model = NanoLLM.from_pretrained(
   "meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo/model name, or path to HF model checkpoint
   api='mlc',                              # supported APIs are: mlc, awq, hf
   api_token='hf_abc123def',               # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
   quantization='q4f16_ft'                 # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
)

response = model.generate("Once upon a time,", max_new_tokens=128)

for token in response:
   print(token, end='', flush=True)

Containers

To test a chat session with Llama from the command-line, install jetson-containers and run NanoLLM like this:

git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
jetson-containers run \
  --env HUGGINGFACE_TOKEN=hf_abc123def \
  $(autotag nano_llm) \
  python3 -m nano_llm.chat --api mlc \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --prompt "Can you tell me a joke about llamas?"

If you haven't already, request access to the Llama models on HuggingFace and substitute your account's API token above.

Resources

Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:

Benchmarks Benchmarking results for LLM, SLM, VLM using MLC/TVM backend.
API Examples Python code examples for chat, completion, and multimodal.
Documentation Reference documentation for the NanoLLM model and agent APIs.
Llamaspeak Talk verbally with LLMs using low-latency ASR/TTS speech models.
Small LLM (SLM) Focus on language models with reduced footprint (7B params and below)
Live LLaVA Realtime live-streaming vision/language models on recurring prompts.
Nano VLM Efficient multimodal pipeline with one-shot image tagging and RAG support.