Text

Llama 3.1 8B

Meta's efficient 8 billion parameter instruction-tuned language model optimized for Jetson

Memory Requirement 8GB RAM
Precision W4A16
Size 4.5GB

Jetson Inference - Supported Inference Engines

πŸš€
Container
# Run Command
sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16

Performance on Jetson Orin with vLLM

28.14
output tokens/sec
Single request (concurrency=1)
112.33
output tokens/sec
8 parallel requests (concurrency=8)

Meta’s Llama 3.1 8B Instruct is a powerful instruction-tuned language model with 8 billion parameters. This quantized version (W4A16) provides excellent performance while being memory efficient for edge deployment on Jetson devices.

The model excels at following instructions, answering questions, and generating coherent text across a wide range of tasks.