Quantizing Llama 3 8B to W8A8: A Complete Guide with LLMCompressor and vLLM

Running large language models in production is expensive. A single Llama 3 8B instance in FP16 consumes around 16GB of GPU memory, limiting how many concurrent requests you can serve. Quantization changes this equation dramatically. In this tutorial, I’ll walk you through quantizing Meta’s Llama 3 8B Instruct model to W8A8 (INT8 weights + INT8 activations) using LLMCompressor, then serving it with vLLM for production inference. Table of Contents What is W8A8 Quantization? Why Quantize? Prerequisites Project Setup The Quantization Recipe Quantization Implementation Serving with vLLM Testing Your Deployment Production Considerations Conclusion What is W8A8 Quantization? W8A8 stands for 8-bit Weights and 8-bit Activations. Instead of storing model parameters as 16-bit or 32-bit floating point numbers, we compress them to 8-bit integers. ...

January 29, 2026 · 9 min · 1783 words · Necati Demir

The AI vs Teen Driver Comparison Is Wrong (Here's What Everyone Forgets)

The Popular Comparison A teenager learns to drive in 10 hours, but an AI system needs millions of simulations and millions of hours of simulated data. This comparison appears frequently in AI discussions, but once you look closely, it’s not fair. Here’s why. Where This Example Comes From I recently saw this example in an interview with Ilya Sutskever, co-founder of OpenAI who later started his own AI startup with significant investment backing. ...

December 3, 2025 · 4 min · 729 words · Necati Demir

StarRocks Vector Search: Production-Ready Setup in 6 Steps

When StarRocks introduced vector search, I had to test and see myself. Is it just a toy feature just bolted on an OLAP DB or not? Turns out this thing is actually good enough that you can fold your approximate nearest-neighbor (ANN) workloads into the same cluster that’s already powering your analytics dashboards. In this article, I decided to skip skip the benchmarking details and focus on what you actually need to get vector search running. ...

September 22, 2025 · 6 min · 1177 words · Necati Demir

GPT-5 Fast vs Thinking vs Pro: How They Actually Work

OpenAI recently released the GPT-5 family, introducing three distinct options for Pro users: GPT-5 Fast, GPT-5 Thinking, and GPT-5 Pro. While it’s common knowledge that GPT-5 Fast handles simple tasks and GPT-5 Pro tackles complex ones, the underlying mechanisms remain unclear to many users. This post provides a concise explanation of how each variant operates and when to use them effectively. GPT-5 Fast For the context of this article, we will use GPT-5 Fast as our base model, think of it as a black box optimized for speed, good old LLM. When you submit a query, it processes the request and delivers an answer quickly without extensive deliberation. ...

August 25, 2025 · 3 min · 454 words · Necati Demir

Building an End-to-End Chat Bot with ONNX Runtime and Rust

Table of Contents Introduction Prerequisites Project Setup Architecture Overview Exporting Models to ONNX Loading an ONNX Model Text Generation Pipeline Building the CLI Chat Interface Going Further Conversation Memory Temperature & Top-p Sampling Streaming Tokens Performance Optimizations Testing Deployment Considerations Conclusion TLDR ...

July 6, 2025 · 8 min · 1684 words · Necati Demir