PyTorch

Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically. In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference. Table of Contents What is W8A8 Quantization? Why Quantize? Prerequisites Project Setup The Quantization Recipe Quantization Implementation Serving with vLLM Testing Your Deployment Production Considerations Conclusion What is W8A8 Quantization? W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers. ...