Table of Contents

  1. Introduction
  2. SIMD Implementation Fundamentals
  3. SIMD in Practice: Dot Product Case Study
  4. Performance Analysis: Benchmark Results
  5. Conclusion & Practical Takeaways

1. Introduction

Do you keep hearing SIMD but don’t know what it is all about? Here is an article for you. SIMD is the go-to technique for squeezing every ounce of performance from modern CPUs. The promise of SIMD is: process 8 floating-point numbers simultaneously instead of one and in theory that will achieve 8x speed. That is just theory because there are other parameters that impact the results.

Let’s start understand what SIMD is. SIMD stands for “Single Instruction, Multiple Data”. It is a parallel computing technique where a single CPU instruction operates on multiple data elements simultaneously, rather than processing them one at a time. Modern processors like Intel’s AVX2 can process 8 floating-point numbers in a single instruction by using 256-bit wide registers (each 32 bit). This approach is effective for mathematical operations on arrays, such as vector arithmetic.

That should give you the idea why you started hearing more and more. SIMD accelerates vector operations which covers the fundamental operations that most machine learning algorithms use.

In this article, we’ll explore SIMD with a case study and during this journey, we will run benchmarks to see how SIMD helps us to speed up vector operations.

2. SIMD Implementation Fundamentals

Before diving into our case study, let’s understand the essential building blocks of SIMD programming with Intel’s AVX2 instruction set.

2.1 AVX2 and 256-bit Registers

AVX2 provides 256-bit wide registers that can hold 8 single-precision floating-point numbers. The key intrinsics (C functions mapping to assembly instructions) we’ll use:

#include <immintrin.h>

// Initialize vector to all zeros
__m256 sum = _mm256_setzero_ps();

// Load 8 floats from memory (unaligned)
__m256 data = _mm256_loadu_ps(&array[i]);

// Load 8 floats from aligned memory (faster)
__m256 data = _mm256_load_ps(&aligned_array[i]);

// Add two vectors element-wise
__m256 result = _mm256_add_ps(vec_a, vec_b);

// Multiply two vectors element-wise
__m256 result = _mm256_mul_ps(vec_a, vec_b);

// Fused multiply-add: result = (vec_a * vec_b) + vec_c
__m256 result = _mm256_fmadd_ps(vec_a, vec_b, vec_c);

// Store vector back to memory
_mm256_storeu_ps(&result_array[i], result);

2.2 Memory Alignment Considerations

I wanted to allocate a seperate section for aligned vs unaligned memory. As you can see above, we have two functions to load array from memory: _mm256_loadu_ps & _mm256_load_ps

SIMD operations perform best with aligned memory. Aligned memory means data starts at addresses that are multiples of the vector size (32 bytes for AVX2) and it matters because the CPU can load entire vectors in a single efficient operation rather than multiple slower partial loads.

2.2.1 Unaligned Memory Access (Inefficient)

graph TD subgraph "Memory Blocks (32-byte boundaries)" A["Block 1: Addresses 0-31
⬜⬜⬜🟨🟨🟨🟨🟨"] B["Block 2: Addresses 32-63
🟨🟨🟨⬜⬜⬜⬜⬜"] end subgraph "Vector needs 32 bytes starting at address 12" C["Load 1: Get bytes 12-31 from Block 1
🔴 Only 20 bytes (partial)"] D["Load 2: Get bytes 32-43 from Block 2
🔴 Only 12 bytes (partial)"] E["Combine: Merge both partial loads
⚠️ Extra CPU work"] F["Result: 32-byte vector complete
🐌 Slower due to 2 loads + combine"] end A -->|"Partial: 20 bytes"| C B -->|"Partial: 12 bytes"| D C --> E D --> E E --> F style A fill:#ffeeee style B fill:#ffeeee style C fill:#ffcccc style D fill:#ffcccc style E fill:#ffaaaa style F fill:#ff6666

2.2.2 Aligned Memory Access (Efficient)

graph TD subgraph "Memory Blocks (32-byte boundaries)" A["Block 1: Addresses 0-31
🟩🟩🟩🟩🟩🟩🟩🟩"] B["Block 2: Addresses 32-63
⬜⬜⬜⬜⬜⬜⬜⬜"] end subgraph "Vector needs 32 bytes starting at address 0" C["Load 1: Get bytes 0-31 from Block 1
✅ Complete 32 bytes"] D["Result: 32-byte vector complete
🚀 Optimal: 1 load operation"] end A -->|"Complete: 32 bytes"| C C --> D style A fill:#eeffee style B fill:#eeffee style C fill:#ccffcc style D fill:#66ff66

You can allocate aligned memory like this:

// Allocate memory aligned to 32-byte boundary (required for AVX2)
float* data = (float*)aligned_alloc(32, size * sizeof(float));

Aligned loads avoid penalties when the address is not 32-byte aligned; on recent Intel/AMD cores a _mm256_loadu_ps on an already-aligned address runs just as fast. Unaligned loads provide more flexibility, allowing you to process data starting at any memory address without requiring special memory allocation or data restructuring. This means you can work with existing data structures, arrays allocated by other code, or data that doesn’t fit neatly into 32-byte boundaries.

2.3 Loop Unrolling Technique

Loop unrolling processes multiple vector operations per iteration to reduce loop overhead and increase instruction-level parallelism. Instead of processing 8 elements per loop iteration, you process 32 elements (4 vectors) at once.

// Basic SIMD: 8 elements per iteration
for (size_t i = 0; i < simd_end; i += 8) {
    __m256 va = _mm256_loadu_ps(&a[i]);
    __m256 vb = _mm256_loadu_ps(&b[i]);
    // ... process one vector
}

// Unrolled SIMD: 32 elements per iteration (4x unrolled)
for (size_t i = 0; i < simd_end; i += 32) {
    __m256 va0 = _mm256_loadu_ps(&a[i]);
    __m256 va1 = _mm256_loadu_ps(&a[i + 8]);
    __m256 va2 = _mm256_loadu_ps(&a[i + 16]);
    __m256 va3 = _mm256_loadu_ps(&a[i + 24]);
    // ... process four vectors in parallel
}

2.4 Compilation Requirements

To use AVX2 intrinsics, you need to compile with appropriate flags:

# Basic compilation with AVX2 support
gcc -march=native -mavx2 -mfma -O3 -o program source.c

# Alternative: specify exact architecture
gcc -march=haswell -mavx2 -O3 -o program source.c

# Minimum required: -mavx2 and -mfma for fused multiply-add
gcc -mavx2 -mfma -O3 -o program source.c

The -march flag is not mandatory, -mavx2 alone is sufficient to enable AVX2 intrinsics. However, -march=native enables all instructions supported by your CPU and often provides better overall optimization. This flag tells GCC to detect your specific processor and use all its available instruction sets (AVX2, SSE4.2, BMI, etc.), not just the ones you explicitly specify. Because the code uses _mm256_fmadd_ps, add -mfma (or pick an -march such as haswell/skylake that implies it) in addition to -mavx2.

3. SIMD in Practice: Dot Product Case Study

We will compare the performance of dot-production operation with and without SIMD. Dot product computation seems to be ideal for demonstrating SIMD benefits: it’s computationally intensive, parallelizable, and appears in many real-world applications from machine learning to graphics.

3.1 Four Implementation Approaches

We will experiment with four different implementation methods.

3.1.1 Scalar Implementation (Baseline)

double dot_product_scalar(const float* a, const float* b, size_t n) {
    double sum = 0.0;
    for (size_t i = 0; i < n; i++) {
        sum += (double)a[i] * (double)b[i];
    }
    return sum;
}

This is a straightforward implementation, it processes one element at a time. This will serve as our performance baseline.

3.1.2 Basic SIMD Implementation

double dot_product_simd_basic(const float* a, const float* b, size_t n) {
    __m256 sum = _mm256_setzero_ps();
    size_t simd_end = n - (n % 8);
    
    // Process 8 floats at a time
    for (size_t i = 0; i < simd_end; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 prod = _mm256_mul_ps(va, vb);
        sum = _mm256_add_ps(sum, prod);
    }
    
    // Sum the 8 floats in the vector
    float result[8];
    _mm256_storeu_ps(result, sum);
    double total = 0.0;
    for (int i = 0; i < 8; i++) {
        total += result[i];
    }
    
    // Handle remaining elements
    for (size_t i = simd_end; i < n; i++) {
        total += (double)a[i] * (double)b[i];
    }
    
    return total;
}

This implementation uses AVX2 to process 8 elements simultaneously, theoretically providing up to 8x speedup.

3.1.3 Unrolled SIMD Implementation

double dot_product_simd_unrolled(const float* a, const float* b, size_t n) {
    __m256 sum0 = _mm256_setzero_ps();
    __m256 sum1 = _mm256_setzero_ps();
    __m256 sum2 = _mm256_setzero_ps();
    __m256 sum3 = _mm256_setzero_ps();
    
    size_t simd_end = n - (n % 32);
    
    // Process 32 floats at a time (4x unrolled)
    for (size_t i = 0; i < simd_end; i += 32) {
        __m256 va0 = _mm256_loadu_ps(&a[i]);
        __m256 vb0 = _mm256_loadu_ps(&b[i]);
        __m256 va1 = _mm256_loadu_ps(&a[i + 8]);
        __m256 vb1 = _mm256_loadu_ps(&b[i + 8]);
        __m256 va2 = _mm256_loadu_ps(&a[i + 16]);
        __m256 vb2 = _mm256_loadu_ps(&b[i + 16]);
        __m256 va3 = _mm256_loadu_ps(&a[i + 24]);
        __m256 vb3 = _mm256_loadu_ps(&b[i + 24]);
        
        sum0 = _mm256_fmadd_ps(va0, vb0, sum0);  // Fused multiply-add
        sum1 = _mm256_fmadd_ps(va1, vb1, sum1);
        sum2 = _mm256_fmadd_ps(va2, vb2, sum2);
        sum3 = _mm256_fmadd_ps(va3, vb3, sum3);
    }
    
    // Combine the four sums
    sum0 = _mm256_add_ps(sum0, sum1);
    sum2 = _mm256_add_ps(sum2, sum3);
    sum0 = _mm256_add_ps(sum0, sum2);
    
    // Extract and sum (same as basic version)
    // ... remainder handling
}

This approach processes 32 elements per iteration, it will reduce loop overhead.

3.1.4 Aligned SIMD Implementation

double dot_product_simd_aligned(const float* a, const float* b, size_t n) {
    __m256 sum = _mm256_setzero_ps();
    size_t simd_end = n - (n % 8);
    
    for (size_t i = 0; i < simd_end; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);   // Aligned load
        __m256 vb = _mm256_load_ps(&b[i]);
        sum = _mm256_fmadd_ps(va, vb, sum);  // Fused multiply-add
    }
    
    // ... same extraction and remainder handling
}

This version assumes aligned memory and uses _mm256_load_ps for faster memory access.

4. Performance Analysis: Benchmark Results

In the previous section, we familiarized ourselves with four different implementation approaches that we will benchmark: 1 non-SIMD and 3 SIMD approaches. Now let’s see how they perform in practice. The full code can be found as a gist on GitHub.

I benchmarked all four implementations on arrays ranging from 1 million to 1 billion elements. We count a fused multiply-add as two floating-point operations (mul + add), following the usual HPC convention, even though it is one instruction.

Benchmarking Methodology:

  • Memory allocation: 32-byte aligned memory using aligned_alloc() to ensure optimal SIMD performance
  • Timing precision: High-resolution clock_gettime(CLOCK_MONOTONIC) for nanosecond-accurate measurements
  • Statistical accuracy: 10 warmup runs + 100 timed runs per implementation to minimize measurement noise
  • Compiler optimization prevention: volatile sink to prevent dead code elimination of results
  • Result verification: All implementations validated to produce identical results within floating-point tolerance
  • Performance metrics: GFLOPS calculation based on actual floating-point operations (2 ops per element: multiply + add)

Array sizes tested:

  • 1 million elements (~4MB per array)
  • 10 million elements (~40MB per array)
  • 100 million elements (~400MB per array)
  • 1 billion elements (~4GB per array)

4.1 Compilation Methodology

I tested two compilation scenarios:

# With aggressive optimization
gcc -march=native -mavx2 -O3 -o dot_product_optimized dot_product_benchmark.c

# Without optimization (for comparison)
gcc -march=native -mavx2 -o dot_product_basic dot_product_benchmark.c

4.2 Results with -O3 Optimization

Here are the performance results across different array sizes with -O3 optimization:

Array SizeImplementationPerformance (GFLOPS)Speedup vs Scalar
1M elementsScalar3.711.00x (baseline)
SIMD Basic14.673.95x
SIMD Unrolled14.743.97x
SIMD Aligned14.163.81x
10M elementsScalar3.351.00x (baseline)
SIMD Basic5.831.74x
SIMD Unrolled4.311.29x
SIMD Aligned4.831.44x
100M elementsScalar3.021.00x (baseline)
SIMD Basic4.381.45x
SIMD Unrolled3.991.32x
SIMD Aligned4.331.44x
1B elementsScalar2.291.00x (baseline)
SIMD Basic4.561.99x
SIMD Unrolled4.762.08x
SIMD Aligned4.572.00x

Key Observations When Compile with O3

  • SIMD benefits decrease with array size - smaller arrays show dramatic 3.8-4.0x speedups, while larger arrays show 2.0x speedups
  • Memory hierarchy effects dominate - performance drops significantly as data exceeds cache sizes
  • Manual SIMD provides substantial benefits even with compiler optimization (1.99x to 2.08x for 1B elements)
  • Cache-friendly workloads favor SIMD - 1M element arrays fit in cache and show maximum SIMD advantage

4.3 Results without -O3 Optimization

The same benchmark without compiler optimization tells a dramatically different story:

Array SizeImplementationPerformance (GFLOPS)Speedup vs Scalar
1M elementsScalar0.931.00x (baseline)
SIMD Basic3.603.87x
SIMD Unrolled6.156.62x
SIMD Aligned3.603.87x
10M elementsScalar0.931.00x (baseline)
SIMD Basic3.383.64x
SIMD Unrolled4.675.04x
SIMD Aligned3.303.56x
100M elementsScalar0.651.00x (baseline)
SIMD Basic2.313.55x
SIMD Unrolled3.244.99x
SIMD Aligned2.283.52x
1B elementsScalar0.651.00x (baseline)
SIMD Basic2.113.23x
SIMD Unrolled3.275.01x
SIMD Aligned2.253.44x

Key Observations When Compile without O3

  • Loop unrolling delivers consistent 5-6x speedups - shows clear benefits across all array sizes (5.01x to 6.62x)
  • Basic SIMD provides reliable 3.5x improvement - consistent 3.23x to 3.87x speedups regardless of data size
  • Aligned memory shows modest gains - performs similarly to basic SIMD (3.44x to 3.87x speedups)
  • Manual SIMD becomes essential - without compiler optimization, manual techniques are the only path to performance
  • Scalar code is severely handicapped - consistently low performance (0.65-0.93 GFLOPS) shows compiler dependency

4.4 Compiler Optimization vs Manual SIMD: Key Insights

Comparing the results from sections 4.2 and 4.3 reveals fundamental insights about the relationship between compiler intelligence and manual optimization:

4.4.1 Compiler Auto-Vectorization is Remarkably Effective

The performance gap between optimized (-O3) and non-optimized builds demonstrates how sophisticated modern compilers have become:

  • Scalar performance boost: 3.5-5.7x improvement with -O3 (from 0.65-0.93 to 2.29-3.71 GFLOPS)
  • Automatic SIMD generation: Compilers identify dot product patterns and generate efficient vectorized code
  • Gap reduction: Manual SIMD advantages shrink from 5-6x to 2-4x when compilers optimize

4.4.2 Manual Optimization Value Depends on Context

The value proposition of manual SIMD changes dramatically based on compiler optimization:

With -O3 optimization:

  • Manual SIMD provides modest but meaningful gains (2.0-4.0x speedups)
  • Loop unrolling shows mixed results (sometimes slower due to memory pressure)
  • Aligned memory access becomes most important factor for performance

Without -O3 optimization:

  • Manual SIMD becomes absolutely critical (3.2-6.6x speedups)
  • Loop unrolling shows consistent value (5.0-6.6x across all sizes)
  • Basic SIMD provides reliable baseline improvement (3.2-3.9x speedups)

4.4.3 Memory Hierarchy Effects Persist Regardless

Both scenarios show that memory hierarchy dominates at scale:

  • Cache-friendly workloads (1M elements) achieve maximum SIMD benefits
  • Large arrays (1B elements) show reduced speedups due to memory bandwidth limitations
  • Performance drops significantly as data exceeds cache capacity, regardless of optimization level

5. Conclusion & Practical Takeaways

In this article, my goal was to use dot-product as an example to start writing about SIMD but I ended up writing a benchmarking code to show how SIMD makes difference. Modern compilers are remarkably effective at automatic vectorization but there are scenarios where implementing manual SIMD helps.

5.1 Key Practical Takeaways

1. Compiler Optimization Changes Everything

  • Modern compilers excel at automatic vectorization for regular patterns
  • Always benchmark with -O3 to understand your baseline performance
  • Manual SIMD becomes less critical but still valuable when compilers optimize

2. Context Determines Value

  • Manual SIMD is essential when compiler optimization isn’t available
  • With compiler optimization, manual techniques provide incremental but meaningful gains
  • Loop unrolling shows particular value in both scenarios

3. Memory Hierarchy Dominates at Scale

  • Cache-friendly workloads see maximum SIMD benefits regardless of approach
  • Large datasets are limited by memory bandwidth, not computational throughput
  • Consider your data size relative to cache capacity when optimizing

4. Measure to Make Informed Decisions

  • Benchmark both optimized and non-optimized builds to understand the full picture
  • Test with realistic data sizes that match your actual use cases
  • Focus on sustained performance rather than peak theoretical speedups

5. When Manual SIMD Still Matters

  • Complex algorithms that compilers struggle to vectorize automatically
  • Scenarios where you need predictable performance across different compilers
  • Performance-critical applications where incremental gains are valuable
  • Legacy codebases or environments without modern compiler optimization

6. Simple Benchmarking Approach

# Test both scenarios to understand the complete picture
gcc -march=native -mavx2 -O3 -o optimized source.c
gcc -march=native -mavx2 -o basic source.c

Performance measurements conducted on Linux 6.11.0 with GCC 13.3.0, Intel CPU with AVX2 support. Results may vary across different hardware and compiler versions.


Frequently Asked Questions (FAQ)

What is SIMD and how does it improve performance?

SIMD (Single Instruction, Multiple Data) is a parallel computing technique where a single CPU instruction operates on multiple data elements simultaneously. With Intel’s AVX2, you can process 8 floating-point numbers in one instruction using 256-bit registers, theoretically providing up to 8x performance improvement for mathematical operations on arrays.

What's the difference between AVX2 and regular CPU instructions?

Regular CPU instructions process one data element at a time (scalar), while AVX2 instructions use 256-bit wide registers to process 8 single-precision floating-point numbers simultaneously. AVX2 includes specialized functions like _mm256_load_ps() for loading data and _mm256_fmadd_ps() for fused multiply-add operations.

How do I compile SIMD code with GCC?

Use GCC with AVX2 flags: ‘gcc -march=native -mavx2 -mfma -O3 -o program source.c’. The -mavx2 flag enables AVX2 intrinsics, -mfma enables fused multiply-add, and -march=native optimizes for your specific CPU. The -O3 flag enables aggressive compiler optimizations including auto-vectorization.

Why does memory alignment matter for SIMD performance?

SIMD operations perform best with 32-byte aligned memory for AVX2. Aligned memory allows the CPU to load entire 256-bit vectors in a single efficient operation, while unaligned access may require multiple partial loads and combining operations, reducing performance significantly.

What's the difference between aligned and unaligned memory access?

Aligned access (_mm256_load_ps) requires data to start at 32-byte boundaries but is faster, while unaligned access (_mm256_loadu_ps) works with any memory address but may be slower. However, on modern CPUs, unaligned loads on already-aligned addresses perform just as well as aligned loads.

How does loop unrolling improve SIMD performance?

Loop unrolling processes multiple vectors per iteration (e.g., 32 elements instead of 8), reducing loop overhead and increasing instruction-level parallelism. This technique showed consistent 5-6x speedups in benchmarks, especially when compiler optimization is disabled.

Why do compiler optimizations affect SIMD speedups?

Modern compilers with -O3 can automatically vectorize simple loops, reducing the advantage of manual SIMD. With optimization, manual SIMD provides 2-4x speedups, but without optimization, manual SIMD becomes essential, providing 3-6x improvements over scalar code.

What are the practical performance gains from manual SIMD?

Performance gains depend on data size and compiler optimization. For cache-friendly workloads (1M elements), manual SIMD shows 3.8-4.0x speedups with compiler optimization. For larger datasets (1B elements), gains are 2.0-2.1x due to memory bandwidth limitations.

When should I use manual SIMD vs compiler auto-vectorization?

Use manual SIMD when: compiler optimization isn’t available, you need predictable performance across compilers, working with complex algorithms compilers can’t vectorize, or in performance-critical applications where incremental gains matter. Modern compilers excel at auto-vectorization for regular patterns.

How does cache size affect SIMD performance?

Cache-friendly workloads see maximum SIMD benefits regardless of optimization approach. As data size exceeds cache capacity, performance drops significantly due to memory bandwidth limitations rather than computational bottlenecks, affecting both scalar and SIMD implementations.

What compilation flags are required for SIMD development?

Essential flags: -mavx2 (enables AVX2 intrinsics), -mfma (enables fused multiply-add). Recommended: -march=native (optimizes for your CPU), -O3 (enables auto-vectorization). Example: ‘gcc -march=native -mavx2 -mfma -O3 source.c’

How do I measure SIMD performance improvements accurately?

Use high-resolution timing (clock_gettime), multiple warmup runs, statistical averaging over many iterations, aligned memory allocation, and volatile sinks to prevent dead code elimination. Measure GFLOPS (billion floating-point operations per second) for meaningful comparisons across different implementations.