NanoLLM - Optimized LLM Inference

NanoLLM is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.

It provides similar APIs to HuggingFace, backed by highly-optimized inference libraries and quantization tools:

NanoLLM Reference Documentation
from nano_llm import NanoLLM

model = NanoLLM.from_pretrained(
   "meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo/model name, or path to HF model checkpoint
   api='mlc',                              # supported APIs are: mlc, awq, hf
   api_token='hf_abc123def',               # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
   quantization='q4f16_ft'                 # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights

response = model.generate("Once upon a time,", max_new_tokens=128)

for token in response:
   print(token, end='', flush=True)


To test a chat session with Llama from the command-line, install jetson-containers and run NanoLLM like this:

git clone
bash jetson-containers/
jetson-containers run \
  --env HUGGINGFACE_TOKEN=hf_abc123def \
  $(autotag nano_llm) \
  python3 -m --api mlc \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --prompt "Can you tell me a joke about llamas?"

If you haven't already, request access to the Llama models on HuggingFace and substitute your account's API token above.


Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:

