Quantizing Llama 3 8B to W8A8: A Complete Guide with LLMCompressor and vLLM

Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically. In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference. Table of Contents What is W8A8 Quantization? Why Quantize? Prerequisites Project Setup The Quantization Recipe Quantization Implementation Serving with vLLM Testing Your Deployment Production Considerations Conclusion What is W8A8 Quantization? W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers. ...

January 29, 2026 · 9 min · 1783 words · Necati Demir

Meta's Code World Models: Understanding Code Execution, Not Just Syntax

I want to talk about an exciting research paper that has been on my list since its release last month. I finally had the opportunity to dive deep into it, and I believe this represents a fundamental shift in how AI understands code. What Are Code World Models? Let’s start with the basics and some impressive numbers. Meta’s Code World Model is an open weights large language model with 32 billion parameters. It features a dense, decoder-only architecture with a 131K token context size. While these specifications are noteworthy, the truly exciting aspect lies in what this model actually does. ...

October 29, 2025 · 5 min · 946 words · Necati Demir