Text

Nemotron3 Nano 4B

NVIDIA's compact 4B Nano model with day-0 llama.cpp support on Jetson Orin and Thor

Command to Run on Jetson Benchmark Model Details

Parameters 4B

Modalities

Text

Context Length 256K

License NVIDIA Nemotron Open Model License

Precision

Q4_K_M GGUF

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

llama.cpp server (OpenAI-compatible API)

After llama-server is running with --network host, call it from another machine on the LAN (set ${JETSON_HOST} or use the field). Default port is often 8080 unless you set --port.

curl -s http://${JETSON_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my_model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8080/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="my_model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

Benchmark

Nemotron3 Nano 4B · vLLM · W4A16 / BF16 · ISL 2048 / OSL 128

Engine

Concurrency

C = concurrent requests. Results will vary with image, clocks, and workload.

Model Details

View on HuggingFace

Nemotron3 Nano 4B is a compact NVIDIA language model that can be served locally on Jetson with llama.cpp, giving Jetson Orin and Jetson Thor day-0 support through a simple OpenAI-compatible llama-server workflow.

Inputs and Outputs

Input: Text

Output: Text

Supported Platforms

Jetson Orin
Jetson Thor

Inference Engine

This model is currently configured for llama.cpp using the GGUF checkpoint NVIDIA-Nemotron3-Nano-4B-Q4_K_M.gguf.

Notes

The provided command uses --alias my_model; you can change that alias to match your application if needed.
--n-gpu-layers 999 keeps the full model on GPU when memory allows for best performance.