Text

Llama 3.1 70B

Meta's flagship 70 billion parameter model delivering state-of-the-art performance on Jetson Thor

Command to Run on Jetson Model Details

Parameters 70B

Modalities

Text

Context Length 128K

License Llama 3.1 Community License

Precision

W4A16

Serve the model

Start server

Choose module, then engine and optional parameters on the left, then copy the serve command by clicking the button on the right.

Command

Call the model over Web API

Copy a client command below and paste it into your terminal to make a Web API request to the model you just served.

curl -s http://${JETSON_HOST}:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:8000/v1",
    api_key="not-needed",  # vLLM / llama.cpp typically do not enforce a key
)

completion = client.chat.completions.create(
    model="RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(completion.choices[0].message.content)

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

With ollama serve on the Jetson, call from another host (set ${JETSON_HOST} or use the field). Match the model name to what you pulled on device.

curl -s http://${JETSON_HOST}:11434/api/generate -d '{
  "model": "Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://${JETSON_HOST}:11434/v1",
    api_key="ollama",  # required by the client; Ollama ignores it
)

completion = client.chat.completions.create(
    model="Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(completion.choices[0].message.content)

import json
import urllib.request

url = "http://${JETSON_HOST}:11434/api/generate"
payload = json.dumps(
    {
        "model": "Meta-Llama-3.1-70B-Instruct-quantized.w4a16",
        "prompt": "Why is the sky blue?",
        "stream": False,
    }
).encode("utf-8")
req = urllib.request.Request(
    url,
    data=payload,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    body = json.load(resp)
    print(body.get("response", body))

One-shot inference

Choose a Jetson module, adjust optional parameters, then copy the command to run a single inference on the device.

Command

·Shell

Model Details

Try on build.nvidia.com

View on HuggingFace

Meta’s Llama 3.1 70B Instruct is the flagship model in the Llama 3.1 family, featuring 70 billion parameters for state-of-the-art performance. This quantized version (W4A16) enables deployment on Jetson Thor.

Ideal for complex reasoning tasks, detailed content generation, and applications requiring the highest quality outputs.