Performance

Quantizing Llama 3 8B to W8A8: A Complete Guide with LLMCompressor and vLLM

Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically. In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference. Table of Contents What is W8A8 Quantization? Why Quantize? Prerequisites Project Setup The Quantization Recipe Quantization Implementation Serving with vLLM Testing Your Deployment Production Considerations Conclusion What is W8A8 Quantization? W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers. ...

An Introduction to StarRocks: My Experience Using It (Plus My Docker Setup)

An Introduction to StarRocks: My Experience Using It (Plus My Docker Setup) I’ve been working on projects including StarRocks for the last couple of years. When I first used it, that shocked me how lightning-fast the queries are. I decided to write an article about StarRocks. This article will serve me as future notes too ;) Let’s dive. Table of Contents What is StarRocks? My Journey with StarRocks 2.1. Key Features That Makes StarRocks StarRocks Why Choose StarRocks? 3.1. Performance That Matters 3.2. Cost-Effective Scaling 3.3. Real-World Use Cases Architecture Deep Dive 4.1. Frontend (FE) Nodes 4.2. Backend (BE) Nodes Getting Started with StarRocks Using Docker 5.1. Quick Setup Guide 5.2. Repository Features Your First StarRocks Query Advanced Features to Explore 7.1. Materialized Views for Acceleration 7.2. Real-Time Data Ingestion 7.3. Window Functions for Advanced Analytics Performance Optimization Tips Monitoring and Operations 9.1. Key Metrics to Track 9.2. Backup and Recovery Community Conclusion 1. What is StarRocks? We can define StarRocks as high-performance analytical database designed for online analytical processing (OLAP) workloads. It was forked from Apache Doris but it has evolved into a powerful solution. ...

Understanding SIMD Performance: A Developer's Introduction with Real Benchmarks

Table of Contents Introduction SIMD Implementation Fundamentals 2.1 AVX2 and 256-bit Registers 2.2 Memory Alignment Considerations 2.2.1 Unaligned Memory Access (Inefficient) 2.2.2 Aligned Memory Access (Efficient) 2.3 Loop Unrolling Technique 2.4 Compilation Requirements SIMD in Practice: Dot Product Case Study 3.1 Four Implementation Approaches 3.1.1 Scalar Implementation (Baseline) 3.1.2 Basic SIMD Implementation 3.1.3 Unrolled SIMD Implementation 3.1.4 Aligned SIMD Implementation Performance Analysis: Benchmark Results 4.1 Compilation Methodology 4.2 Results with -O3 Optimization 4.3 Results without -O3 Optimization 4.4 Compiler Optimization vs Manual SIMD: Key Insights 4.4.1 Compiler Auto-Vectorization is Remarkably Effective 4.4.2 Manual Optimization Value Depends on Context 4.4.3 Memory Hierarchy Effects Persist Regardless Conclusion & Practical Takeaways 5.1 Key Practical Takeaways 1. Introduction Do you keep hearing SIMD but don’t know what it is all about? Here is an article for you. SIMD is the go-to technique for squeezing every ounce of performance from modern CPUs. The promise of SIMD is: process 8 floating-point numbers simultaneously instead of one and in theory that will achieve 8x speed. That is just theory because there are other parameters that impact the results. ...