Performance

Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically. In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference. Table of Contents What is W8A8 Quantization? Why Quantize? Prerequisites Project Setup The Quantization Recipe Quantization Implementation Serving with vLLM Testing Your Deployment Production Considerations Conclusion What is W8A8 Quantization? W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers. ...

Table of Contents Introduction SIMD Implementation Fundamentals 2.1 AVX2 and 256-bit Registers 2.2 Memory Alignment Considerations 2.2.1 Unaligned Memory Access (Inefficient) 2.2.2 Aligned Memory Access (Efficient) 2.3 Loop Unrolling Technique 2.4 Compilation Requirements SIMD in Practice: Dot Product Case Study 3.1 Four Implementation Approaches 3.1.1 Scalar Implementation (Baseline) 3.1.2 Basic SIMD Implementation 3.1.3 Unrolled SIMD Implementation 3.1.4 Aligned SIMD Implementation Performance Analysis: Benchmark Results 4.1 Compilation Methodology 4.2 Results with -O3 Optimization 4.3 Results without -O3 Optimization 4.4 Compiler Optimization vs Manual SIMD: Key Insights 4.4.1 Compiler Auto-Vectorization is Remarkably Effective 4.4.2 Manual Optimization Value Depends on Context 4.4.3 Memory Hierarchy Effects Persist Regardless Conclusion & Practical Takeaways 5.1 Key Practical Takeaways 1. Introduction Do you keep hearing SIMD but don’t know what it is all about? Here is an article for you. SIMD is the go-to technique for squeezing every ounce of performance from modern CPUs. The promise of SIMD is: process 8 floating-point numbers simultaneously instead of one and in theory that will achieve 8x speed. That is just theory because there are other parameters that impact the results. ...

Performance

Quantizing Llama 3 8B to W8A8: A Complete Guide with LLMCompressor and vLLM

Understanding SIMD Performance: A Developer's Introduction with Real Benchmarks