Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically.

In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference.

Table of Contents

  1. What is W8A8 Quantization?
  2. Why Quantize?
  3. Prerequisites
  4. Project Setup
  5. The Quantization Recipe
  6. Quantization Implementation
  7. Serving with vLLM
  8. Testing Your Deployment
  9. Production Considerations
  10. Conclusion

What is W8A8 Quantization?

W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers.

FP16 Model:  16 bits per weight  →  ~16GB for 8B parameters
W8A8 Model:   8 bits per weight  →  ~8GB for 8B parameters

The challenge with activation quantization is that activations (the intermediate values computed during inference) often have outliers that are difficult to represent in low precision. This is where techniques like SmoothQuant become essential.

Why Quantize?

Three compelling reasons to quantize your LLMs:

1. Memory Reduction

A 50% reduction in memory footprint means you can either run larger models on the same hardware or serve more concurrent requests with the same model.

2. Faster Inference

Modern GPUs have specialized INT8 tensor cores that execute integer operations faster than floating-point equivalents. Combined with reduced memory bandwidth requirements, you get lower latency.

3. Cost Efficiency

Smaller memory footprint = smaller (cheaper) GPUs. Faster inference = more requests per dollar. For production workloads, this translates directly to infrastructure cost savings.

Prerequisites

Before starting, ensure you have:

  • Hardware: CUDA-capable GPU with 24GB+ VRAM for quantization
  • Software: Python 3.11, CUDA toolkit, uv package manager
  • Accounts: HuggingFace account with access to Llama 3

Create a .env file with your HuggingFace token:

HF_TOKEN=hf_your_token_here

Project Setup

Initialize the project with uv and install dependencies:

# Create project directory
mkdir llama3-quantization && cd llama3-quantization

# Initialize with uv
uv init

Here’s the pyproject.toml with pinned dependencies:

[project]
name = "llama3-quantization"
version = "0.1.0"
requires-python = ">=3.11,<3.12"
dependencies = [
    "accelerate==0.33.0",
    "compressed-tensors==0.8.1",
    "datasets==2.20.0",
    "llmcompressor==0.3.1",
    "python-dotenv==1.0.1",
    "torch==2.3.1+cu118",
    "transformers==4.57.6",
    "vllm==0.5.3",
]

[tool.uv.sources]
torch = { index = "pytorch-cu118" }

[[tool.uv.index]]
name = "pytorch-cu118"
url = "https://download.pytorch.org/whl/cu118"

Install everything:

uv sync

The Quantization Recipe

We use a two-stage recipe combining SmoothQuant and GPTQ:

Stage 1: SmoothQuant

LLM activations often have outliers, large values that are hard to represent in INT8. SmoothQuant addresses this by mathematically migrating the quantization difficulty from activations to weights.

The key insight: weights are static and can tolerate more aggressive quantization, while activations vary per input. By “smoothing” the activation distribution, we make activation quantization feasible.

Stage 2: GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ is a post-training quantization method that processes layers sequentially. For each layer, it:

  1. Runs calibration data through the layer
  2. Computes the quantization that minimizes output error
  3. Updates subsequent layer inputs to account for quantization error

The combination of SmoothQuant (handling activation outliers) followed by GPTQ (optimizing weight quantization) produces high-quality W8A8 models.

Quantization Implementation

Here’s the complete quantization script:

import os
from dotenv import load_dotenv
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

import torch
from llmcompressor.transformers.finetune import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.utils.layer_compressor import LayerCompressor

load_dotenv()

MODEL_ID = os.getenv("MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")
SAVE_DIR = os.getenv("SAVE_DIR", f"{MODEL_ID.split('/')[-1]}-W8A8")
HF_TOKEN = os.getenv("HF_TOKEN")

NUM_CALIBRATION_SAMPLES = int(os.getenv("NUM_CALIBRATION_SAMPLES", "512"))
MAX_SEQUENCE_LENGTH = int(os.getenv("MAX_SEQUENCE_LENGTH", "2048"))


# Compatibility fix for GPTQ layer calibration
_orig_calibrate_layer = LayerCompressor.calibrate_layer


def _calibrate_layer_wrap_output(self, intermediates):
    outputs = _orig_calibrate_layer(self, intermediates)
    fixed = []
    for output, kwargs in outputs:
        if isinstance(output, torch.Tensor):
            output = (output,)
        fixed.append((output, kwargs))
    return fixed


LayerCompressor.calibrate_layer = _calibrate_layer_wrap_output


# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    token=HF_TOKEN,
)

# Load calibration data (UltraChat resembles production prompts)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    if "messages" in example:
        text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
    else:
        text = example["text"]
    return {"text": text}


ds = ds.map(preprocess, remove_columns=ds.column_names)


def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=MAX_SEQUENCE_LENGTH,
        add_special_tokens=False,
    )


ds = ds.map(tokenize_fn, batched=True, remove_columns=ds.column_names)

# Define quantization recipe: SmoothQuant + GPTQ W8A8
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]

# Run one-shot quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the quantized model
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

print(f"Saved INT8 W8A8 checkpoint to: {SAVE_DIR}")

Key Implementation Details

Calibration Data: We use UltraChat, a diverse conversational dataset. The calibration samples should resemble your production data distribution for best results.

SmoothQuant Strength: The smoothing_strength=0.8 parameter controls how much quantization difficulty is transferred from activations to weights. Higher values make activations easier to quantize but put more pressure on weight quantization.

Ignored Layers: We exclude lm_head from quantization because the output projection layer significantly impacts generation quality.

Run the quantization:

uv run python quantize.py

This produces a Meta-Llama-3-8B-Instruct-W8A8/ directory containing the quantized model.

Serving with vLLM

vLLM provides high-throughput inference with KV cache optimization and continuous batching. Create a serving script:

#!/usr/bin/env bash
set -euo pipefail

MODEL_DIR=${MODEL_DIR:-"./Meta-Llama-3-8B-Instruct-W8A8"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"llama3-w8a8"}
HOST=${HOST:-"0.0.0.0"}
PORT=${PORT:-"8000"}
GPU_MEM_UTIL=${GPU_MEM_UTIL:-"0.90"}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-"8192"}
MAX_BATCHED_TOKENS=${MAX_BATCHED_TOKENS:-"16384"}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-"256"}

exec uv run vllm serve "${MODEL_DIR}" \
  --served-model-name "${SERVED_MODEL_NAME}" \
  --host "${HOST}" \
  --port "${PORT}" \
  --dtype auto \
  --gpu-memory-utilization "${GPU_MEM_UTIL}" \
  --max-model-len "${MAX_MODEL_LEN}" \
  --max-num-batched-tokens "${MAX_BATCHED_TOKENS}" \
  --max-num-seqs "${MAX_NUM_SEQS}"

vLLM Configuration Explained

ParameterDefaultDescription
gpu-memory-utilization0.90Fraction of GPU memory for KV cache
max-model-len8192Maximum sequence length
max-num-batched-tokens16384Maximum tokens processed per batch
max-num-seqs256Maximum concurrent sequences

Start the server:

./scripts/serve_vllm.sh

The server exposes an OpenAI-compatible API:

# List models
curl http://localhost:8000/v1/models

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-w8a8",
    "messages": [
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Explain KV cache in one paragraph."}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

Testing Your Deployment

Smoke Test

Verify basic functionality:

#!/usr/bin/env bash
set -euo pipefail

HOST=${HOST:-"localhost"}
PORT=${PORT:-"8000"}
MODEL=${MODEL:-"llama3-w8a8"}

base_url="http://${HOST}:${PORT}"

# Check model is loaded
curl -sS "${base_url}/v1/models"

# Test chat completion
curl -sS "${base_url}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL}\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":32}"

Concurrency Test

Validate batching behavior under load:

import asyncio
import time
import httpx

URL = "http://localhost:8000/v1/chat/completions"
MODEL = "llama3-w8a8"

NUM_REQUESTS = 64
MAX_TOKENS = 64
TIMEOUT_S = 120

payloads = [
    {
        "model": MODEL,
        "messages": [
            {
                "role": "user",
                "content": f"Give me one tip to reduce LLM latency. #{i}",
            }
        ],
        "temperature": 0,
        "max_tokens": MAX_TOKENS,
    }
    for i in range(NUM_REQUESTS)
]


async def one(client, payload):
    start = time.perf_counter()
    resp = await client.post(URL, json=payload, timeout=TIMEOUT_S)
    resp.raise_for_status()
    elapsed_ms = (time.perf_counter() - start) * 1000
    content = resp.json()["choices"][0]["message"]["content"]
    return elapsed_ms, content


async def main():
    async with httpx.AsyncClient() as client:
        start = time.perf_counter()
        results = await asyncio.gather(*(one(client, p) for p in payloads))
        total_ms = (time.perf_counter() - start) * 1000

    latencies = sorted(dt for dt, _ in results)
    p50 = latencies[len(latencies) // 2]
    p95 = latencies[int(len(latencies) * 0.95) - 1]

    print(f"Requests: {len(latencies)}")
    print(f"Total wall time: {total_ms:.1f} ms")
    print(f"p50 latency: {p50:.1f} ms")
    print(f"p95 latency: {p95:.1f} ms")


if __name__ == "__main__":
    asyncio.run(main())

Run with:

uv run python scripts/concurrency_test.py

The key insight: if 64 concurrent requests complete in a few seconds total, that demonstrates effective batching. Sequential processing would take significantly longer.

Production Considerations

Memory Tuning

If you’re hitting OOM errors, adjust these parameters:

GPU_MEM_UTIL=0.85 \
MAX_MODEL_LEN=4096 \
MAX_NUM_SEQS=128 \
./scripts/serve_vllm.sh

Quantization Format

The quantized model uses compressed-tensors metadata in its config.json. vLLM automatically detects this format. If you need to explicitly specify:

EXTRA_ARGS="--quantization compressed-tensors" ./scripts/serve_vllm.sh

Monitoring

For production deployments, monitor:

  • GPU memory utilization
  • Request latency (p50, p95, p99)
  • Throughput (requests/second)
  • Queue depth (pending requests)

vLLM exposes Prometheus metrics at /metrics for integration with your monitoring stack.

Conclusion

We’ve walked through the complete pipeline for quantizing Llama 3 8B to W8A8:

  1. Setup: Environment configuration with pinned dependencies
  2. Quantization: SmoothQuant + GPTQ for high-quality INT8 compression
  3. Serving: vLLM with KV cache and continuous batching
  4. Testing: Smoke tests and concurrency validation

The result is a model that uses roughly half the memory while maintaining generation quality suitable for production workloads.

For most production scenarios, W8A8 quantization hits the sweet spot between compression and quality. If you need even smaller models, explore W4A16 (4-bit weights) or AWQ quantization, though expect some quality degradation.

The complete source code for this tutorial is available on GitHub: https://github.com/ndemir/meta-llama-Meta-Llama-3-8B-Instruct-quantization


Frequently Asked Questions (FAQ)

What is W8A8 quantization for LLMs?

W8A8 quantization compresses a large language model by converting both weights and activations from FP16/FP32 to INT8 (8-bit integers). This reduces memory footprint by approximately 50% compared to FP16 while maintaining most of the model’s accuracy, enabling faster inference and serving more concurrent users on the same hardware.

Why should I quantize Llama 3 8B to W8A8?

Quantizing to W8A8 reduces memory usage by ~50%, enabling the 8B model to run on smaller GPUs or serve more concurrent requests. INT8 operations are faster on modern GPUs with tensor cores. The combination of GPTQ and SmoothQuant techniques preserves model quality while delivering significant performance gains for production inference.

What is the difference between GPTQ and SmoothQuant?

GPTQ is a post-training quantization method that processes layers sequentially using calibration data to minimize quantization error. SmoothQuant addresses activation outliers by mathematically ‘smoothing’ the activation distribution, transferring quantization difficulty from activations to weights. Using both together (SmoothQuant followed by GPTQ) yields better W8A8 results.

How much calibration data do I need for quantization?

Typically 256-512 calibration samples are sufficient for good quantization results. The samples should resemble your production data distribution. This tutorial uses 512 samples from the UltraChat dataset, which provides diverse conversational examples suitable for instruction-tuned models like Llama 3 Instruct.

What is vLLM and why use it for serving quantized models?

vLLM is a high-throughput LLM inference engine featuring PagedAttention for efficient KV cache management, continuous batching for maximizing GPU utilization, and native support for quantized models via compressed-tensors format. It provides an OpenAI-compatible API, making it easy to integrate quantized models into existing applications.

What is continuous batching in vLLM?

Continuous batching dynamically groups incoming requests and processes them together, maximizing GPU utilization. Unlike static batching that waits for a fixed batch size, continuous batching adds new requests as soon as any sequence completes, significantly improving throughput for variable-length LLM outputs.

What hardware do I need to quantize Llama 3 8B?

You need a CUDA-capable GPU with at least 24GB VRAM (e.g., RTX 3090, RTX 4090, A10, or A100) for the quantization process. The quantized W8A8 model requires approximately 8GB VRAM for inference, allowing it to run on smaller GPUs like RTX 3080 or T4.

How do I test if my quantized model is working correctly?

After quantization, run inference tests comparing outputs to the original model. Use smoke tests to verify the vLLM server responds correctly, then run concurrency tests to validate batching behavior under load. Monitor latency metrics (p50, p95) and verify output quality hasn’t degraded significantly.