001 Notes

Cache-SIMD-DSA

0. Abstract

This is a standalone Cache-SIMD-DSA benchmark article. It takes 10 common DSA problems and measures four real C++ implementations for each one:

VariantMeaning
naivescalar baseline over the less cache-friendly layout
cache-awarescalar algorithm/layout chosen to reduce wasted cache traffic
SIMDAVX2 version over the less cache-friendly layout, usually gather-heavy
cache-aware + SIMDAVX2 version over the cache-aware layout

The main result:

Cache-aware + SIMD wins 8 of 10 questions at the 64 MiB row.
Cache-aware scalar wins rotated search and BFS.
SIMD over the bad layout is often worse than scalar.

Each question below includes the problem, an ASCII diagram, the measured hot-path C++ code, then the result table and graphs. Full source snapshots and CSVs are also published:

Reading the diagrams:

AoS record     [hot | cold cold cold ...]       one useful value inside a wide record
SoA values     [hot hot hot hot hot hot ...]    dense useful values
gather         lane0 <- addr0, lane1 <- addr1   SIMD lanes pull scattered addresses
packed load    [lane0 lane1 ... lane7]          SIMD lanes read one contiguous vector
CSR graph      offsets[] + edges[]              compact adjacency layout
tile/block     small matrix region              working set chosen for cache reuse

1. Shared Code

The snippets use these shared benchmark definitions. The production benchmark file also contains dataset builders, CPU feature checks, registration, and result plumbing; those are outside the hot path and are linked above.

#include <algorithm>
#include <array>
#include <bit>
#include <cstddef>
#include <cstdint>
#include <deque>
#include <immintrin.h>
#include <limits>
#include <numeric>
#include <vector>

constexpr std::size_t kBatchLanes = 8;
constexpr std::size_t kColdFieldCount = 15;
constexpr std::size_t kWindowWidth = 32;
constexpr std::size_t kTransposeBlock = 32;
constexpr std::size_t kAvxWidthFloats = 8;

#if defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute__((noinline))
#define NOVECTOR __attribute__((optimize("no-tree-vectorize")))
#else
#define NOINLINE
#define NOVECTOR
#endif

template <typename T>
struct WideRecord {
    T value{};
    std::array<T, kColdFieldCount> cold{};
};

struct Sort012Record {
    std::uint32_t value = 0;
    std::array<std::uint32_t, kColdFieldCount> cold{};
};

struct GraphEdgeRecord {
    std::uint32_t dst = 0;
    std::uint32_t cold0 = 0;
    std::uint32_t cold1 = 0;
    std::uint32_t cold2 = 0;
};

struct GraphAdjList {
    std::vector<std::vector<GraphEdgeRecord>> edges;
};

struct GraphCsr {
    std::vector<std::uint32_t> offsets;
    std::vector<std::uint32_t> edges;
};

template <typename T>
std::uint64_t sample_checksum(const T* data, std::size_t elements) {
    if (elements == 0) {
        return 0;
    }
    const std::size_t step = std::max<std::size_t>(1, elements / 8);
    std::uint64_t sum = 0;
    for (std::size_t index = 0; index < elements; index += step) {
        sum += static_cast<std::uint64_t>(data[index]);
    }
    sum += static_cast<std::uint64_t>(data[elements - 1]);
    return sum;
}

float horizontal_sum_ps(__m256 value) {
    alignas(32) std::array<float, kBatchLanes> lanes{};
    _mm256_store_ps(lanes.data(), value);
    return std::accumulate(lanes.begin(), lanes.end(), 0.0f);
}

__m256i cmp_lt_epi32(__m256i lhs, __m256i rhs) {
    return _mm256_cmpgt_epi32(rhs, lhs);
}

__m256i cmp_le_epi32(__m256i lhs, __m256i rhs) {
    return _mm256_or_si256(_mm256_cmpgt_epi32(rhs, lhs), _mm256_cmpeq_epi32(lhs, rhs));
}

int mask_bits(__m256i mask) {
    return _mm256_movemask_ps(_mm256_castsi256_ps(mask));
}

// Gather one hot field from eight wide AoS records. Each record is 16 u32/f32 words.
__m256 gather_hot_values(const WideRecord<float>* aos, std::size_t length, std::size_t index) {
    const float* base = reinterpret_cast<const float*>(aos);
    return _mm256_i32gather_ps(
        base,
        _mm256_set_epi32(static_cast<int>(((7 * length + index) * 16)),
                         static_cast<int>(((6 * length + index) * 16)),
                         static_cast<int>(((5 * length + index) * 16)),
                         static_cast<int>(((4 * length + index) * 16)),
                         static_cast<int>(((3 * length + index) * 16)),
                         static_cast<int>(((2 * length + index) * 16)),
                         static_cast<int>(((1 * length + index) * 16)),
                         static_cast<int>(((0 * length + index) * 16))),
        4);
}

__m256i gather_hot_u32_values(const WideRecord<std::uint32_t>* aos,
                              std::size_t length,
                              std::size_t index) {
    const int* base = reinterpret_cast<const int*>(aos);
    return _mm256_i32gather_epi32(
        base,
        _mm256_set_epi32(static_cast<int>(((7 * length + index) * 16)),
                         static_cast<int>(((6 * length + index) * 16)),
                         static_cast<int>(((5 * length + index) * 16)),
                         static_cast<int>(((4 * length + index) * 16)),
                         static_cast<int>(((3 * length + index) * 16)),
                         static_cast<int>(((2 * length + index) * 16)),
                         static_cast<int>(((1 * length + index) * 16)),
                         static_cast<int>(((0 * length + index) * 16))),
        4);
}

2. Measurement

FieldValue
Host.122 / lenovo
CPUIntel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz
ArchitectureHaswell, x86_64, AVX2
OSLinux 6.17.0-23-generic
Compilerg++ 13.3.0
Builddirect g++, release, -O3 -DNDEBUG -std=c++20 -pthread -mavx2
Cores4 physical, 8 logical
CPU masktaskset -c 0,2,4,6
Timing3 measured trials per row, median reported
Perfperf stat -x, -r 1 for every variant and size

Size ladder:

4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K,
1M, 2M, 4M, 8M, 16M, 32M, 64M

Data:

Perf events captured:

cycles, instructions, cache-references, cache-misses,
branches, branch-misses, task-clock, context-switches,
cpu-migrations, page-faults, L1-dcache-loads,
L1-dcache-load-misses, dTLB-loads, dTLB-load-misses

3. Verdict Table

The table uses the 64 MiB requested-size row. For element/query problems, lower ns is better. For matrix throughput rows, higher GiB/s is better.

QuestionWinner at 64 MiBEvidence summary
Stock profitcache-aware + SIMD0.6901 ns vs naive 10.3663 ns; timing strongly supports the verdict.
Product except selfcache-aware + SIMD3.5731 ns vs naive 20.2096 ns; timing strongly supports the verdict.
Maximum subarraycache-aware + SIMD1.1407 ns vs naive 10.7273 ns; timing strongly supports the verdict.
Trapping rain watercache-aware + SIMD2.4624 ns vs naive 21.4591 ns; timing strongly supports the verdict.
K-sized window maximumcache-aware + SIMD3.9415 ns vs naive 64.3484 ns; timing strongly supports the verdict.
Rotated searchcache-aware scalar264.8689 ns; SIMD variants lose. Evidence says branchy search did not benefit from SIMD here.
Sort 0/1/2cache-aware + SIMD1.9068 ns vs naive 35.8583 ns; timing strongly supports the verdict.
BFScache-aware scalar54.1225 ns; CSR scalar beats both SIMD forms. Evidence says layout mattered more than SIMD.
Matrix transposecache-aware + SIMD2.4765 GiB/s vs naive 0.2023 GiB/s; blocking and SIMD both matter.
Matrix multiplycache-aware + SIMD0.0094 GiB/s at true 64 MiB requested size; blocked AVX2 wins.

4. Question-by-Question Evidence

4.1 Best Time to Buy and Sell Stock

Problem:

Given daily prices, find the maximum profit from one buy followed by one sell.

Diagram:

naive:
  memory: [price|cold...][price|cold...][price|cold...]
            use 1 float    skip 15 floats  use 1 float
  work:   scan one lane at a time and update min/best

cache-aware:
  memory: prices[] = [p0 p1 p2 p3 p4 p5 ...]
  work:   same min/best scan, but every cache line is mostly useful prices

SIMD:
  lanes:  [lane0 lane1 lane2 lane3 lane4 lane5 lane6 lane7]
  load:   gather 8 prices from 8 wide AoS records
  issue:  SIMD arithmetic is wide, but memory still jumps around

cache-aware + SIMD:
  lanes:  packed load [p0 p1 p2 p3 p4 p5 p6 p7]
  work:   update 8 independent min/best states with one dense vector load

Code:

// Naive: one scalar pass per lane over 64-byte AoS records.
// Major difference: every useful price lives beside 15 cold fields.
NOINLINE NOVECTOR float stock_naive_aos(const WideRecord<float>* aos, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        float min_price = std::numeric_limits<float>::infinity();
        float best = 0.0f;
        for (std::size_t index = 0; index < length; ++index) {
            // This load is the cost center: one cache-line-sized record is touched
            // even though the algorithm only needs the value field.
            const float price = aos[lane * length + index].value;
            min_price = std::min(min_price, price);
            best = std::max(best, price - min_price);
        }
        total += best;
    }
    return total;
}

// Cache-aware: same algorithm, but hot prices are contiguous.
// Major difference: the data layout changes, not the stock-profit logic.
NOINLINE NOVECTOR float stock_cache_soa(const float* hot, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        // values points at a dense lane of prices; no cold fields are carried along.
        const float* values = hot + lane * length;
        float min_price = std::numeric_limits<float>::infinity();
        float best = 0.0f;
        for (std::size_t index = 0; index < length; ++index) {
            // This is the cache-aware win: values[index] advances linearly.
            min_price = std::min(min_price, values[index]);
            best = std::max(best, values[index] - min_price);
        }
        total += best;
    }
    return total;
}

// SIMD: eight independent stock streams, but each load is a gather from AoS.
// Major difference from naive: arithmetic is eight-wide, memory is still scattered.
NOINLINE float stock_simd_aos(const WideRecord<float>* aos, std::size_t length) {
    __m256 min_values = _mm256_set1_ps(std::numeric_limits<float>::infinity());
    __m256 best_values = _mm256_setzero_ps();
    for (std::size_t index = 0; index < length; ++index) {
        // The gather is the limiting part: AVX2 has to fetch one hot field from
        // each wide record instead of reading one contiguous 32-byte vector.
        const __m256 prices = gather_hot_values(aos, length, index);
        min_values = _mm256_min_ps(min_values, prices);
        best_values = _mm256_max_ps(best_values, _mm256_sub_ps(prices, min_values));
    }
    return horizontal_sum_ps(best_values);
}

// Cache-aware + SIMD: eight independent streams loaded as one packed AVX2 vector.
// Major difference from plain SIMD: this replaces gather with a dense vector load.
NOINLINE float stock_cache_simd_soa(const float* hot, std::size_t length) {
    __m256 min_values = _mm256_set1_ps(std::numeric_limits<float>::infinity());
    __m256 best_values = _mm256_setzero_ps();
    for (std::size_t index = 0; index < length; ++index) {
        // One aligned load gives all eight lane prices. This is the measured winner.
        const __m256 prices = _mm256_load_ps(hot + index * kBatchLanes);
        min_values = _mm256_min_ps(min_values, prices);
        best_values = _mm256_max_ps(best_values, _mm256_sub_ps(prices, min_values));
    }
    return horizontal_sum_ps(best_values);
}

Result:

Stock profit timing

Stock profit perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive10.3663 ns0.84439.2631.620.21
cache-aware5.1578 ns0.86431.4930.810.19
SIMD9.8042 ns0.83822.9431.990.24
cache-aware + SIMD0.6901 ns0.83822.9431.360.32

Evidence says cache-aware + SIMD wins. I expected the SoA layout to matter because AoS stores one hot price inside a 64-byte record. The timing confirms that expectation. Perf also shows the naive/cache-aware split reducing generic cache-miss pressure, while SIMD mainly improves useful work per loaded hot data.

4.2 Product of Array Except Self

Problem:

For every index, return the product of all other elements without division.

Diagram:

naive:
  pass 1 prefix: [x|cold...][x|cold...][x|cold...] -> work[]
  pass 2 suffix: [x|cold...][x|cold...][x|cold...] -> work[]
  issue: two passes repeatedly drag cold record fields through cache

cache-aware:
  pass 1 prefix: x[] -> work[]
  pass 2 suffix: x[] -> work[]
  change: same math, dense input stream

SIMD:
  prefix/suffix are vector lanes, but each x comes from gather(AoS)

cache-aware + SIMD:
  prefix/suffix are vector lanes and each x comes from packed load(SoA)

Code:

// Naive: prefix/suffix over wide records. Each pass pulls cold fields into cache.
// Major difference: the algorithm scans the input twice, so AoS waste is paid twice.
NOINLINE NOVECTOR std::uint64_t product_naive_aos(const WideRecord<std::uint32_t>* aos,
                                                  std::size_t length) {
    std::vector<std::uint32_t> work(length);
    std::uint64_t total = 0;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const WideRecord<std::uint32_t>* values = aos + lane * length;
        std::uint32_t prefix = 1U;
        for (std::size_t index = 0; index < length; ++index) {
            work[index] = prefix;
            // Cost center: values[index].value is useful, but the wide record
            // brings cold fields into the cache on the prefix pass.
            prefix *= values[index].value;
        }
        std::uint32_t suffix = 1U;
        for (std::size_t index = length; index-- > 0;) {
            work[index] *= suffix;
            // Same wide-record penalty again on the reverse suffix pass.
            suffix *= values[index].value;
        }
        total += sample_checksum(work.data(), length);
    }
    return total;
}

// Cache-aware: same prefix/suffix math over dense values.
// Major difference: only the representation changes; prefix/suffix semantics match naive.
NOINLINE NOVECTOR std::uint64_t product_cache_soa(const std::uint32_t* hot,
                                                  std::size_t length) {
    std::vector<std::uint32_t> work(length);
    std::uint64_t total = 0;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const std::uint32_t* values = hot + lane * length;
        std::uint32_t prefix = 1U;
        for (std::size_t index = 0; index < length; ++index) {
            work[index] = prefix;
            // Dense linear load: each cache line contains many useful values.
            prefix *= values[index];
        }
        std::uint32_t suffix = 1U;
        for (std::size_t index = length; index-- > 0;) {
            work[index] *= suffix;
            // Reverse scan is still linear over compact data.
            suffix *= values[index];
        }
        total += sample_checksum(work.data(), length);
    }
    return total;
}

// SIMD: vectorized prefix/suffix, but values are gathered from wide records.
// Major difference from cache-aware + SIMD: gather keeps the AoS memory penalty.
NOINLINE std::uint64_t product_simd_aos(const WideRecord<std::uint32_t>* aos,
                                        std::size_t length,
                                        std::uint32_t* temp) {
    __m256i prefix = _mm256_set1_epi32(1);
    for (std::size_t index = 0; index < length; ++index) {
        _mm256_store_si256(reinterpret_cast<__m256i*>(temp + index * kBatchLanes), prefix);
        // Eight lanes advance together, but every lane fetches from a different wide record.
        prefix = _mm256_mullo_epi32(prefix, gather_hot_u32_values(aos, length, index));
    }

    __m256i suffix = _mm256_set1_epi32(1);
    for (std::size_t index = length; index-- > 0;) {
        const __m256i prefix_values =
            _mm256_load_si256(reinterpret_cast<const __m256i*>(temp + index * kBatchLanes));
        _mm256_store_si256(reinterpret_cast<__m256i*>(temp + index * kBatchLanes),
                           _mm256_mullo_epi32(prefix_values, suffix));
        // Reverse pass repeats the gather-heavy access pattern.
        suffix = _mm256_mullo_epi32(suffix, gather_hot_u32_values(aos, length, index));
    }
    return sample_checksum(temp, length * kBatchLanes);
}

// Cache-aware + SIMD: vectorized prefix/suffix over packed lanes.
// Major difference from plain SIMD: hot values are already interleaved for AVX2.
NOINLINE std::uint64_t product_cache_simd_soa(const std::uint32_t* hot,
                                              std::size_t length,
                                              std::uint32_t* temp) {
    __m256i prefix = _mm256_set1_epi32(1);
    for (std::size_t index = 0; index < length; ++index) {
        _mm256_store_si256(reinterpret_cast<__m256i*>(temp + index * kBatchLanes), prefix);
        // Packed load: one 32-byte read supplies the next value for all eight lanes.
        prefix = _mm256_mullo_epi32(
            prefix,
            _mm256_load_si256(reinterpret_cast<const __m256i*>(hot + index * kBatchLanes)));
    }

    __m256i suffix = _mm256_set1_epi32(1);
    for (std::size_t index = length; index-- > 0;) {
        const __m256i prefix_values =
            _mm256_load_si256(reinterpret_cast<const __m256i*>(temp + index * kBatchLanes));
        _mm256_store_si256(reinterpret_cast<__m256i*>(temp + index * kBatchLanes),
                           _mm256_mullo_epi32(prefix_values, suffix));
        // The suffix pass stays packed too, so SIMD and cache layout reinforce each other.
        suffix = _mm256_mullo_epi32(
            suffix,
            _mm256_load_si256(reinterpret_cast<const __m256i*>(hot + index * kBatchLanes)));
    }
    return sample_checksum(temp, length * kBatchLanes);
}

Result:

Product except self timing

Product except self perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive20.2096 ns0.65536.0247.530.22
cache-aware8.7122 ns0.66431.0844.560.18
SIMD21.1712 ns0.64431.8147.800.24
cache-aware + SIMD3.5731 ns0.63031.6146.600.17

Evidence says cache-aware + SIMD wins. I expected the prefix/suffix passes to punish the wide AoS record because every pass rereads cold payload. Timing supports that. Perf supports the layout part through lower generic cache-miss pressure for cache-aware rows; the SIMD win is mostly throughput from doing more arithmetic per pass.

4.3 Maximum Subarray

Problem:

Find the contiguous subarray with the largest sum.

Diagram:

naive:
  lane0 AoS: [v|cold...][v|cold...][v|cold...] -> current,best
  issue: Kadane is simple, but each scalar load touches a wide record

cache-aware:
  lane0 SoA: [v v v v v v ...] -> current,best
  change: same dependency chain, less wasted cache traffic

SIMD:
  8 independent chains:
    current = max(v, current + v)
  issue: values arrive through gathers from AoS

cache-aware + SIMD:
  packed values feed 8 chains:
    [v0 v1 v2 v3 v4 v5 v6 v7] -> current/best vectors

Code:

// Naive: Kadane on each AoS lane.
// Major difference: the math is O(n), but the hot value is embedded in a cold record.
NOINLINE NOVECTOR float max_subarray_naive_aos(const WideRecord<float>* aos, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        float current = aos[lane * length].value;
        float best = current;
        for (std::size_t index = 1; index < length; ++index) {
            // This is the same Kadane recurrence as every other scalar version.
            // The waste is the AoS load, not the recurrence.
            const float value = aos[lane * length + index].value;
            current = std::max(value, current + value);
            best = std::max(best, current);
        }
        total += best;
    }
    return total;
}

// Cache-aware: Kadane over dense values.
// Major difference from naive: values[index] is a compact linear stream.
NOINLINE NOVECTOR float max_subarray_cache_soa(const float* hot, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const float* values = hot + lane * length;
        float current = values[0];
        float best = current;
        for (std::size_t index = 1; index < length; ++index) {
            // Same dependency chain, but the input cache line is all useful data.
            current = std::max(values[index], current + values[index]);
            best = std::max(best, current);
        }
        total += best;
    }
    return total;
}

// SIMD: eight independent Kadane chains, with AoS gathers.
// Major difference: this does not vectorize one chain; it runs eight chains in parallel.
NOINLINE float max_subarray_simd_aos(const WideRecord<float>* aos, std::size_t length) {
    __m256 current = gather_hot_values(aos, length, 0);
    __m256 best = current;
    for (std::size_t index = 1; index < length; ++index) {
        // Gather supplies one value per chain, but memory is still scattered.
        const __m256 values = gather_hot_values(aos, length, index);
        current = _mm256_max_ps(values, _mm256_add_ps(current, values));
        best = _mm256_max_ps(best, current);
    }
    return horizontal_sum_ps(best);
}

// Cache-aware + SIMD: eight chains with one packed load per step.
// Major difference from SIMD AoS: the recurrence is the same, the load is packed.
NOINLINE float max_subarray_cache_simd_soa(const float* hot, std::size_t length) {
    __m256 current = _mm256_load_ps(hot);
    __m256 best = current;
    for (std::size_t index = 1; index < length; ++index) {
        // This load is why the vector version finally pays off: no gather, no cold fields.
        const __m256 values = _mm256_load_ps(hot + index * kBatchLanes);
        current = _mm256_max_ps(values, _mm256_add_ps(current, values));
        best = _mm256_max_ps(best, current);
    }
    return horizontal_sum_ps(best);
}

Result:

Maximum subarray timing

Maximum subarray perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive10.7273 ns0.82133.5432.800.24
cache-aware8.9412 ns0.84427.2631.600.22
SIMD9.8056 ns0.84131.8033.000.21
cache-aware + SIMD1.1407 ns0.84426.5031.980.23

Evidence says cache-aware + SIMD wins. I expected a single Kadane chain to be dependency-heavy, so the measured SIMD version uses independent lanes. The timing confirms that independent lanes plus SoA are the decisive combination.

4.4 Trapping Rain Water

Problem:

Given bar heights, compute how much water is trapped.

Diagram:

naive:
  left pass:  [h|cold...][h|cold...][h|cold...] -> left_max[]
  right pass: [h|cold...][h|cold...][h|cold...] -> water sum
  issue: both passes reread wide records

cache-aware:
  left pass:  h[] -> left_max[]
  right pass: h[] -> water sum
  change: same two-pass method over dense heights

SIMD:
  vector left/right maxima, but h values come from gathers

cache-aware + SIMD:
  packed h vectors -> packed left_max[] -> packed water sum

Code:

// Naive: build left maxima and scan right maxima over wide records.
// Major difference: the algorithm touches heights twice, so AoS waste is repeated.
NOINLINE NOVECTOR float rain_naive_aos(const WideRecord<float>* aos, std::size_t length) {
    std::vector<float> left(length);
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const WideRecord<float>* values = aos + lane * length;
        float left_max = values[0].value;
        for (std::size_t index = 0; index < length; ++index) {
            // Forward pass: each height load also pulls cold fields into cache.
            left_max = std::max(left_max, values[index].value);
            left[index] = left_max;
        }
        float right_max = values[length - 1].value;
        for (std::size_t index = length; index-- > 0;) {
            // Reverse pass: the same wide-record pattern is paid again.
            right_max = std::max(right_max, values[index].value);
            total += std::max(0.0f, std::min(left[index], right_max) - values[index].value);
        }
    }
    return total;
}

// Cache-aware: same two-pass idea over dense heights.
// Major difference from naive: left/right scans are linear reads of hot heights.
NOINLINE NOVECTOR float rain_cache_soa(const float* hot, std::size_t length) {
    std::vector<float> left(length);
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const float* values = hot + lane * length;
        float left_max = values[0];
        for (std::size_t index = 0; index < length; ++index) {
            // Dense forward pass: left[] still costs memory, but input no longer wastes cache.
            left_max = std::max(left_max, values[index]);
            left[index] = left_max;
        }
        float right_max = values[length - 1];
        for (std::size_t index = length; index-- > 0;) {
            // Dense reverse pass: right_max and water use contiguous heights.
            right_max = std::max(right_max, values[index]);
            total += std::max(0.0f, std::min(left[index], right_max) - values[index]);
        }
    }
    return total;
}

// SIMD: vectorized left/right maxima, but heights are gathered from AoS.
// Major difference: vector math is wide, but both passes still gather from wide records.
NOINLINE float rain_simd_aos(const WideRecord<float>* aos, std::size_t length, float* left_temp) {
    __m256 left_max = _mm256_set1_ps(std::numeric_limits<float>::lowest());
    for (std::size_t index = 0; index < length; ++index) {
        // Gathered heights feed eight left-max streams.
        left_max = _mm256_max_ps(left_max, gather_hot_values(aos, length, index));
        _mm256_store_ps(left_temp + index * kBatchLanes, left_max);
    }

    __m256 right_max = _mm256_set1_ps(std::numeric_limits<float>::lowest());
    __m256 total = _mm256_setzero_ps();
    for (std::size_t index = length; index-- > 0;) {
        // Reverse pass gathers again, so SIMD cannot hide the poor layout.
        const __m256 values = gather_hot_values(aos, length, index);
        right_max = _mm256_max_ps(right_max, values);
        const __m256 left_values = _mm256_load_ps(left_temp + index * kBatchLanes);
        const __m256 water = _mm256_sub_ps(_mm256_min_ps(left_values, right_max), values);
        total = _mm256_add_ps(total, _mm256_max_ps(water, _mm256_setzero_ps()));
    }
    return horizontal_sum_ps(total);
}

// Cache-aware + SIMD: vectorized two-pass scan over packed heights.
// Major difference from SIMD AoS: every height vector is a packed load.
NOINLINE float rain_cache_simd_soa(const float* hot, std::size_t length, float* left_temp) {
    __m256 left_max = _mm256_set1_ps(std::numeric_limits<float>::lowest());
    for (std::size_t index = 0; index < length; ++index) {
        // Packed forward pass: one vector load updates eight left maxima.
        const __m256 values = _mm256_load_ps(hot + index * kBatchLanes);
        left_max = _mm256_max_ps(left_max, values);
        _mm256_store_ps(left_temp + index * kBatchLanes, left_max);
    }

    __m256 right_max = _mm256_set1_ps(std::numeric_limits<float>::lowest());
    __m256 total = _mm256_setzero_ps();
    for (std::size_t index = length; index-- > 0;) {
        // Packed reverse pass: this is where cache-aware layout and SIMD combine.
        const __m256 values = _mm256_load_ps(hot + index * kBatchLanes);
        right_max = _mm256_max_ps(right_max, values);
        const __m256 left_values = _mm256_load_ps(left_temp + index * kBatchLanes);
        const __m256 water = _mm256_sub_ps(_mm256_min_ps(left_values, right_max), values);
        total = _mm256_add_ps(total, _mm256_max_ps(water, _mm256_setzero_ps()));
    }
    return horizontal_sum_ps(total);
}

Result:

Trapping rain timing

Trapping rain perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive21.4591 ns0.85939.1933.230.19
cache-aware10.4751 ns0.86931.4231.390.21
SIMD27.2468 ns0.83424.9833.450.24
cache-aware + SIMD2.4624 ns0.85228.1831.570.20

Evidence says cache-aware + SIMD wins. I expected multiple memory passes to reward contiguous hot arrays. Timing supports that. Perf supports the layout expectation for the scalar rows; the SIMD AoS timing shows that vectorizing gathers over wide records is not enough.

4.5 K-Sized Subarray Maximum

Problem:

For every fixed-width window, report the maximum value.

Diagram:

naive:
  window 0: scan 32 values from AoS records
  window 1: scan mostly the same 32 values again
  issue: repeated work + cold fields

cache-aware:
  dense h[] + deque:
    push new index, pop expired/smaller indexes, front is max
  change: O(n * window) becomes O(n) per lane

SIMD:
  8 lanes scan each 32-wide window, but every value is gathered

cache-aware + SIMD:
  block left maxima + block right maxima over packed lanes
  max(window) = max(right[start], left[start + width - 1])

Code:

// Naive: recompute each window maximum over wide records.
// Major difference: this does repeated O(window) work for every output window.
NOINLINE NOVECTOR float window_max_naive_aos(const WideRecord<float>* aos, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const WideRecord<float>* values = aos + lane * length;
        for (std::size_t start = 0; start + kWindowWidth <= length; ++start) {
            // Every new start rereads most of the previous window.
            float window_max = values[start].value;
            for (std::size_t offset = 1; offset < kWindowWidth; ++offset) {
                // AoS penalty: each comparison touches a wide record for one hot value.
                window_max = std::max(window_max, values[start + offset].value);
            }
            total += window_max;
        }
    }
    return total;
}

// Cache-aware: monotonic deque over dense values, O(n) per lane.
// Major difference from naive: algorithmic work drops from repeated scans to one pass.
NOINLINE NOVECTOR float window_max_cache_soa(const float* hot, std::size_t length) {
    float total = 0.0f;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const float* values = hot + lane * length;
        std::deque<std::size_t> deque;
        for (std::size_t index = 0; index < length; ++index) {
            // Remove indexes that are no longer inside the window.
            while (!deque.empty() && deque.front() + kWindowWidth <= index) {
                deque.pop_front();
            }
            // Maintain decreasing values. This is the section that avoids rescanning
            // the whole window for every output.
            while (!deque.empty() && values[deque.back()] <= values[index]) {
                deque.pop_back();
            }
            deque.push_back(index);
            if (index + 1 >= kWindowWidth) {
                total += values[deque.front()];
            }
        }
    }
    return total;
}

// SIMD: recompute each window maximum for eight lanes with AoS gathers.
// Major difference: SIMD widens the repeated scan, but it does not fix repeated work.
NOINLINE float window_max_simd_aos(const WideRecord<float>* aos, std::size_t length) {
    __m256 total = _mm256_setzero_ps();
    for (std::size_t start = 0; start + kWindowWidth <= length; ++start) {
        __m256 window_max = gather_hot_values(aos, length, start);
        for (std::size_t offset = 1; offset < kWindowWidth; ++offset) {
            // This inner loop performs 32 gathered loads per output window.
            window_max = _mm256_max_ps(window_max, gather_hot_values(aos, length, start + offset));
        }
        total = _mm256_add_ps(total, window_max);
    }
    return horizontal_sum_ps(total);
}

// Cache-aware + SIMD: prefix/suffix block maxima over packed lanes.
// Major difference from SIMD AoS: no repeated 32-load window scan.
NOINLINE float window_max_cache_simd_soa(const float* hot,
                                         std::size_t length,
                                         float* left_temp,
                                         float* right_temp) {
    __m256 block_left = _mm256_setzero_ps();
    for (std::size_t index = 0; index < length; ++index) {
        // Build left maxima inside each kWindowWidth block using packed loads.
        const __m256 values = _mm256_load_ps(hot + index * kBatchLanes);
        block_left = (index % kWindowWidth == 0) ? values : _mm256_max_ps(block_left, values);
        _mm256_store_ps(left_temp + index * kBatchLanes, block_left);
    }

    __m256 block_right = _mm256_setzero_ps();
    for (std::size_t reverse = length; reverse-- > 0;) {
        // Build right maxima from the other direction, also packed.
        const __m256 values = _mm256_load_ps(hot + reverse * kBatchLanes);
        block_right = (reverse == length - 1 || ((reverse + 1) % kWindowWidth == 0))
                          ? values
                          : _mm256_max_ps(block_right, values);
        _mm256_store_ps(right_temp + reverse * kBatchLanes, block_right);
    }

    __m256 total = _mm256_setzero_ps();
    for (std::size_t start = 0; start + kWindowWidth <= length; ++start) {
        // Window max is composed from two precomputed block maxima.
        // This is the major speed difference versus rescanning each window.
        const __m256 left_value = _mm256_load_ps(left_temp + (start + kWindowWidth - 1) * kBatchLanes);
        const __m256 right_value = _mm256_load_ps(right_temp + start * kBatchLanes);
        total = _mm256_add_ps(total, _mm256_max_ps(left_value, right_value));
    }
    return horizontal_sum_ps(total);
}

Result:

Sliding window timing

Sliding window perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive64.3484 ns1.05334.0924.490.18
cache-aware20.6639 ns0.92332.1728.870.17
SIMD112.4712 ns0.87029.6024.890.17
cache-aware + SIMD3.9415 ns0.84833.9831.840.22

Evidence says cache-aware + SIMD wins. The perf counters do not reduce the whole story to cache-miss rate; the winner does not have the lowest generic cache-miss percentage. I had expected this case to be about algorithmic structure plus layout: the block-based cache-aware SIMD version does less repeated window work and uses dense loads. Timing verifies that verdict.

4.6 Search in Rotated Sorted Array

Problem:

Search target values in a sorted array that has been rotated.

Diagram:

naive:
  query q:
    left ---- mid ---- right
        branch decides which half survives
  issue: each probe reads values from wide records

cache-aware:
  same binary-search decisions, but probes read dense values[]

SIMD:
  lane0: left -> right half
  lane1: right -> left half
  lane2: found
  lane3: still active
  issue: lanes diverge, masks update per lane, gathers still happen

cache-aware + SIMD:
  dense/interleaved values reduce memory waste, but branchy lane divergence remains

Code:

// Scalar helper used by both scalar layouts.
// Major behavior: every iteration depends on branch decisions from previous probes.
template <typename LoadAt>
NOINLINE NOVECTOR int rotated_search_scalar_value(std::size_t length,
                                                  std::uint32_t target,
                                                  LoadAt&& load_at) {
    int left = 0;
    int right = static_cast<int>(length) - 1;
    while (left <= right) {
        // The branch state is sequential: mid, left, and right change after each decision.
        const int mid = left + (right - left) / 2;
        const std::uint32_t left_value = load_at(left);
        const std::uint32_t mid_value = load_at(mid);
        const std::uint32_t right_value = load_at(right);
        if (mid_value == target) {
            return mid;
        }
        if (left_value <= mid_value) {
            if (left_value <= target && target < mid_value) {
                right = mid - 1;
            } else {
                left = mid + 1;
            }
        } else if (mid_value < target && target <= right_value) {
            left = mid + 1;
        } else {
            right = mid - 1;
        }
    }
    return -1;
}

// Naive: binary search over wide records for each lane/query.
// Major difference: scalar control flow is already hard to predict, and AoS adds
// wasted memory traffic to every left/mid/right probe.
NOINLINE NOVECTOR std::uint64_t rotated_search_naive_aos(const WideRecord<std::uint32_t>* aos,
                                                         const std::uint32_t* queries,
                                                         std::size_t length,
                                                         std::size_t query_count) {
    std::uint64_t sum = 0;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const WideRecord<std::uint32_t>* values = aos + lane * length;
        const std::uint32_t* lane_queries = queries + lane * query_count;
        for (std::size_t query = 0; query < query_count; ++query) {
            // The load lambda exposes the AoS penalty at every binary-search probe.
            sum += static_cast<std::uint32_t>(
                rotated_search_scalar_value(length, lane_queries[query],
                                            [values](int i) { return values[i].value; }) + 1);
        }
    }
    return sum;
}

// Cache-aware: same control flow, but array and queries are dense.
// Major difference from naive: same branches, lower memory waste.
NOINLINE NOVECTOR std::uint64_t rotated_search_cache_soa(const std::uint32_t* hot,
                                                         const std::uint32_t* queries,
                                                         std::size_t length,
                                                         std::size_t query_count) {
    std::uint64_t sum = 0;
    for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
        const std::uint32_t* values = hot + lane * length;
        const std::uint32_t* lane_queries = queries + lane * query_count;
        for (std::size_t query = 0; query < query_count; ++query) {
            // The measured winner: scalar branch behavior remains, but probes are compact.
            sum += static_cast<std::uint32_t>(
                rotated_search_scalar_value(length, lane_queries[query],
                                            [values](int i) { return values[i]; }) + 1);
        }
    }
    return sum;
}

// SIMD helper: eight searches advance together, but lanes diverge quickly.
// Major difference: SIMD is fighting control flow. Each lane may choose a different half.
template <typename GatherFn>
std::uint64_t rotated_search_simd_common(const std::uint32_t* interleaved_queries,
                                         std::size_t length,
                                         std::size_t query_count,
                                         GatherFn&& gather_values) {
    const __m256i zero = _mm256_setzero_si256();
    const __m256i one = _mm256_set1_epi32(1);
    const __m256i all_bits = _mm256_set1_epi32(-1);
    const __m256i max_index = _mm256_set1_epi32(static_cast<int>(length - 1));
    const std::size_t max_iterations = std::bit_width(length) + 2;
    std::uint64_t sum = 0;
    alignas(32) std::array<std::int32_t, kBatchLanes> result_lanes{};

    for (std::size_t query = 0; query < query_count; ++query) {
        const __m256i target =
            _mm256_load_si256(reinterpret_cast<const __m256i*>(interleaved_queries + query * kBatchLanes));
        __m256i left = zero;
        __m256i right = max_index;
        __m256i result = _mm256_set1_epi32(-1);
        __m256i active = all_bits;

        for (std::size_t iteration = 0; iteration < max_iterations && mask_bits(active) != 0; ++iteration) {
            // Inactive lanes are masked out, but the vector loop still carries their state.
            const __m256i safe_left = _mm256_blendv_epi8(zero, left, active);
            const __m256i safe_right = _mm256_blendv_epi8(zero, right, active);
            const __m256i mid = _mm256_srli_epi32(_mm256_add_epi32(safe_left, safe_right), 1);
            // This probe may be a gather in both SIMD variants because every lane has
            // a different mid after a few iterations.
            const __m256i mid_values = gather_values(mid);
            const __m256i equal_mask = _mm256_and_si256(active, _mm256_cmpeq_epi32(mid_values, target));
            result = _mm256_blendv_epi8(result, mid, equal_mask);
            active = _mm256_andnot_si256(equal_mask, active);
            if (mask_bits(active) == 0) {
                break;
            }

            const __m256i left_values = gather_values(safe_left);
            const __m256i right_values = gather_values(safe_right);
            const __m256i left_sorted = cmp_le_epi32(left_values, mid_values);
            // These masks implement the scalar if/else tree. This is the major
            // reason SIMD loses here: decisions are per lane, not one shared branch.
            const __m256i target_in_left =
                _mm256_and_si256(cmp_le_epi32(left_values, target), cmp_lt_epi32(target, mid_values));
            const __m256i target_in_right =
                _mm256_and_si256(cmp_lt_epi32(mid_values, target), cmp_le_epi32(target, right_values));
            const __m256i go_right = _mm256_and_si256(
                active,
                _mm256_or_si256(
                    _mm256_and_si256(left_sorted, _mm256_andnot_si256(target_in_left, all_bits)),
                    _mm256_andnot_si256(left_sorted, target_in_right)));
            const __m256i go_left = _mm256_and_si256(active, _mm256_andnot_si256(go_right, all_bits));

            left = _mm256_blendv_epi8(left, _mm256_add_epi32(mid, one), go_right);
            right = _mm256_blendv_epi8(right, _mm256_sub_epi32(mid, one), go_left);
            active = _mm256_and_si256(active, cmp_le_epi32(left, right));
        }

        _mm256_store_si256(reinterpret_cast<__m256i*>(result_lanes.data()), result);
        for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
            sum += static_cast<std::uint32_t>(result_lanes[lane] + 1);
        }
    }
    return sum;
}

// SIMD: vectorized search with AoS gathers.
// Major difference from scalar AoS: eight queries run together, but all probes are gathered.
NOINLINE std::uint64_t rotated_search_simd_aos(const WideRecord<std::uint32_t>* aos,
                                               const std::uint32_t* interleaved_queries,
                                               std::size_t length,
                                               std::size_t query_count) {
    return rotated_search_simd_common(
        interleaved_queries,
        length,
        query_count,
        [aos, length](__m256i logical_indices) {
            alignas(32) std::array<std::int32_t, kBatchLanes> indices{};
            _mm256_store_si256(reinterpret_cast<__m256i*>(indices.data()), logical_indices);
            for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
                // Convert logical lane indexes into AoS byte/word positions.
                indices[lane] = indices[lane] * 16 + static_cast<std::int32_t>(lane * length * 16);
            }
            // Gather from wide records. This combines branch divergence with scattered memory.
            return _mm256_i32gather_epi32(reinterpret_cast<const int*>(aos),
                                          _mm256_load_si256(reinterpret_cast<const __m256i*>(indices.data())),
                                          4);
        });
}

// Cache-aware + SIMD: vectorized search over interleaved dense lanes.
// Major difference from SIMD AoS: compact data, but still one scattered probe per lane.
NOINLINE std::uint64_t rotated_search_cache_simd_soa(const std::uint32_t* interleaved_hot,
                                                     const std::uint32_t* interleaved_queries,
                                                     std::size_t length,
                                                     std::size_t query_count) {
    return rotated_search_simd_common(
        interleaved_queries,
        length,
        query_count,
        [interleaved_hot](__m256i logical_indices) {
            alignas(32) std::array<std::int32_t, kBatchLanes> indices{};
            _mm256_store_si256(reinterpret_cast<__m256i*>(indices.data()), logical_indices);
            for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
                // Interleaving reduces stride versus AoS, but per-lane mids still differ.
                indices[lane] = indices[lane] * static_cast<std::int32_t>(kBatchLanes) + static_cast<std::int32_t>(lane);
            }
            // This is cache-friendlier than AoS gather, but not as good as scalar SoA here.
            return _mm256_i32gather_epi32(reinterpret_cast<const int*>(interleaved_hot),
                                          _mm256_load_si256(reinterpret_cast<const __m256i*>(indices.data())),
                                          4);
        });
}

Result:

Rotated search timing

Rotated search perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive384.8632 ns1.19464.9921.680.77
cache-aware264.8689 ns1.30339.589.430.19
SIMD1313.2236 ns0.60622.7730.678.14
cache-aware + SIMD464.8578 ns0.99739.927.960.17

Evidence says cache-aware scalar wins. I expected branchy per-query control flow to limit SIMD usefulness, and the timing verifies that. Perf supports the scalar SoA verdict: it has better IPC than naive and much lower generic cache-miss and dTLB-miss rates than naive.

4.7 Sort 0s, 1s, and 2s

Problem:

Sort an array whose values are only 0, 1, and 2.

Diagram:

naive:
  swap records:
    [2|cold...] <-> [0|cold...]
  issue: sorting one small value moves whole records

cache-aware:
  count dense values:
    input:  [0 2 1 0 2 1 ...]
    counts: c0,c1,c2
    output: [0 0 ... 1 1 ... 2 2 ...]

SIMD:
  gather 8 AoS values, compare to 0/1/2, count masks

cache-aware + SIMD:
  packed load 8 dense values, compare to 0/1/2, count masks

Code:

// Naive: Dutch-national-flag swaps whole 64-byte records.
// Major difference: values are tiny, but every swap moves cold fields too.
NOINLINE NOVECTOR void sort012_naive_aos(Sort012Record* values, std::size_t length) {
    std::size_t low = 0;
    std::size_t mid = 0;
    std::size_t high = length - 1;
    while (mid <= high) {
        if (values[mid].value == 0U) {
            // Cost center: this swap moves the whole record, not just the 0/1/2 value.
            std::swap(values[low++], values[mid++]);
        } else if (values[mid].value == 1U) {
            ++mid;
        } else {
            // Same full-record movement for values that should go to the end.
            std::swap(values[mid], values[high--]);
        }
    }
}

// Cache-aware: count dense values, then materialize dense output.
// Major difference from naive: no record swaps, just count a tiny value domain.
NOINLINE NOVECTOR std::uint64_t sort012_cache_soa(const std::uint32_t* input,
                                                  std::uint32_t* output,
                                                  std::size_t length) {
    std::array<std::size_t, 3> counts{};
    for (std::size_t index = 0; index < length; ++index) {
        // Dense input makes the counting pass stream through useful data only.
        ++counts[input[index]];
    }
    // Materialization is three linear fills; this avoids data-dependent swaps.
    std::fill_n(output, counts[0], 0U);
    std::fill_n(output + counts[0], counts[1], 1U);
    std::fill_n(output + counts[0] + counts[1], counts[2], 2U);
    return sample_checksum(output, length);
}

// SIMD: count eight gathered AoS values at a time.
// Major difference: SIMD compares are cheap; gathered AoS loads are the weak part.
NOINLINE std::uint64_t sort012_simd_aos(const Sort012Record* input,
                                        std::uint32_t* output,
                                        std::size_t length) {
    constexpr std::int32_t kStrideWords = static_cast<std::int32_t>(sizeof(Sort012Record) / sizeof(std::uint32_t));
    const int* base = reinterpret_cast<const int*>(input);
    const __m256i zero = _mm256_setzero_si256();
    const __m256i one = _mm256_set1_epi32(1);
    const __m256i two = _mm256_set1_epi32(2);
    std::array<std::size_t, 3> counts{};
    std::size_t index = 0;
    for (; index + kBatchLanes <= length; index += kBatchLanes) {
        // These indexes jump record-by-record to fetch only the value field.
        const __m256i gather_indices = _mm256_setr_epi32(static_cast<int>((index + 0) * kStrideWords),
                                                         static_cast<int>((index + 1) * kStrideWords),
                                                         static_cast<int>((index + 2) * kStrideWords),
                                                         static_cast<int>((index + 3) * kStrideWords),
                                                         static_cast<int>((index + 4) * kStrideWords),
                                                         static_cast<int>((index + 5) * kStrideWords),
                                                         static_cast<int>((index + 6) * kStrideWords),
                                                         static_cast<int>((index + 7) * kStrideWords));
        const __m256i values = _mm256_i32gather_epi32(base, gather_indices, sizeof(std::uint32_t));
        // Mask popcount turns eight SIMD comparisons into scalar bucket counts.
        counts[0] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, zero)))));
        counts[1] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, one)))));
        counts[2] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, two)))));
    }
    for (; index < length; ++index) {
        ++counts[input[index].value];
    }
    std::fill_n(output, counts[0], 0U);
    std::fill_n(output + counts[0], counts[1], 1U);
    std::fill_n(output + counts[0] + counts[1], counts[2], 2U);
    return sample_checksum(output, length);
}

// Cache-aware + SIMD: count eight dense values per load.
// Major difference from SIMD AoS: same mask-count idea, but one packed load.
NOINLINE std::uint64_t sort012_cache_simd_soa(const std::uint32_t* input,
                                              std::uint32_t* output,
                                              std::size_t length) {
    const __m256i zero = _mm256_setzero_si256();
    const __m256i one = _mm256_set1_epi32(1);
    const __m256i two = _mm256_set1_epi32(2);
    std::array<std::size_t, 3> counts{};
    std::size_t index = 0;
    for (; index + kBatchLanes <= length; index += kBatchLanes) {
        // This is the important load: eight values are adjacent in memory.
        const __m256i values = _mm256_load_si256(reinterpret_cast<const __m256i*>(input + index));
        // Compare against each possible value and accumulate the masks.
        counts[0] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, zero)))));
        counts[1] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, one)))));
        counts[2] += static_cast<std::size_t>(std::popcount(static_cast<unsigned>(mask_bits(_mm256_cmpeq_epi32(values, two)))));
    }
    for (; index < length; ++index) {
        ++counts[input[index]];
    }
    std::fill_n(output, counts[0], 0U);
    std::fill_n(output + counts[0], counts[1], 1U);
    std::fill_n(output + counts[0] + counts[1], counts[2], 2U);
    return sample_checksum(output, length);
}

Result:

Sort 0/1/2 timing

Sort 0/1/2 perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive35.8583 ns1.17324.7210.890.27
cache-aware4.3231 ns1.35137.996.530.23
SIMD11.2804 ns1.24252.8213.010.37
cache-aware + SIMD1.9068 ns1.32042.888.430.25

Evidence says cache-aware + SIMD wins. I had expected tiny-domain counting over contiguous values to dominate a wide-record Dutch-flag style baseline. Timing confirms the winner. Perf is mixed on generic cache-miss percentage, so I would not claim cache misses alone prove the verdict; the evidence supports a combined layout plus fewer useful operations explanation.

4.8 BFS of Graph

Problem:

Traverse graph vertices in breadth-first order.

Diagram:

naive adjacency list:
  vertex u -> [(dst|cold cold cold), (dst|cold cold cold), ...]
  issue: each neighbor visit carries unused edge fields

cache-aware CSR:
  offsets: [0, 8, 16, 24, ...]
  edges:   [dst dst dst dst dst dst ...]
  change: neighbors are compact and contiguous

SIMD:
  gather dst values + gather visited[dst]
  issue: enqueue/visited updates still happen lane by lane

cache-aware + SIMD:
  load contiguous CSR dst values, but visited checks still serialize

Code:

// Naive: adjacency-list records carry cold fields with each destination.
// Major difference: each edge record has more data than BFS needs.
NOINLINE NOVECTOR std::uint64_t bfs_naive_adjlist(const GraphAdjList& graph, std::size_t passes) {
    const std::size_t vertices = graph.edges.size();
    std::vector<std::uint32_t> queue(vertices);
    std::vector<int> visited(vertices, 0);
    std::uint64_t total = 0;
    for (std::size_t pass = 0; pass < passes; ++pass) {
        std::fill(visited.begin(), visited.end(), 0);
        const std::uint32_t source = static_cast<std::uint32_t>((pass * 17) % vertices);
        std::size_t head = 0;
        std::size_t tail = 0;
        visited[source] = 1;
        queue[tail++] = source;
        while (head < tail) {
            const std::uint32_t node = queue[head++];
            total += node;
            for (const GraphEdgeRecord& edge : graph.edges[node]) {
                // BFS only needs edge.dst; cold0/cold1/cold2 still traveled with it.
                if (visited[edge.dst] == 0) {
                    // This visited check is data dependent and controls queue growth.
                    visited[edge.dst] = 1;
                    queue[tail++] = edge.dst;
                }
            }
        }
        total += tail;
    }
    return total;
}

// Cache-aware: CSR stores only destination IDs in one dense edge array.
// Major difference from naive: adjacency data is offsets[] + compact edges[].
NOINLINE NOVECTOR std::uint64_t bfs_cache_csr(const GraphCsr& graph, std::size_t passes) {
    const std::size_t vertices = graph.offsets.size() - 1;
    std::vector<std::uint32_t> queue(vertices);
    std::vector<int> visited(vertices, 0);
    std::uint64_t total = 0;
    for (std::size_t pass = 0; pass < passes; ++pass) {
        std::fill(visited.begin(), visited.end(), 0);
        const std::uint32_t source = static_cast<std::uint32_t>((pass * 17) % vertices);
        std::size_t head = 0;
        std::size_t tail = 0;
        visited[source] = 1;
        queue[tail++] = source;
        while (head < tail) {
            const std::uint32_t node = queue[head++];
            total += node;
            // The neighbor range is contiguous in graph.edges.
            for (std::uint32_t edge = graph.offsets[node]; edge < graph.offsets[node + 1]; ++edge) {
                const std::uint32_t next = graph.edges[edge];
                if (visited[next] == 0) {
                    // Still branchy, but now the edge stream is cache-friendly.
                    visited[next] = 1;
                    queue[tail++] = next;
                }
            }
        }
        total += tail;
    }
    return total;
}

// SIMD: gather destination and visited state for adjacency-list records.
// Major difference: tries to batch neighbor loads, but cannot batch queue mutation cleanly.
NOINLINE std::uint64_t bfs_simd_adjlist(const GraphAdjList& graph, std::size_t passes) {
    const std::size_t vertices = graph.edges.size();
    std::vector<std::uint32_t> queue(vertices);
    std::vector<int> visited(vertices, 0);
    std::uint64_t total = 0;
    alignas(32) std::array<std::uint32_t, kBatchLanes> dst_lanes{};
    constexpr std::int32_t kStrideWords = static_cast<std::int32_t>(sizeof(GraphEdgeRecord) / sizeof(std::uint32_t));
    const __m256i zero = _mm256_setzero_si256();

    for (std::size_t pass = 0; pass < passes; ++pass) {
        std::fill(visited.begin(), visited.end(), 0);
        const std::uint32_t source = static_cast<std::uint32_t>((pass * 17) % vertices);
        std::size_t head = 0;
        std::size_t tail = 0;
        visited[source] = 1;
        queue[tail++] = source;
        while (head < tail) {
            const std::uint32_t node = queue[head++];
            total += node;
            const auto& edges = graph.edges[node];
            const int* base = reinterpret_cast<const int*>(edges.data());
            std::size_t edge = 0;
            for (; edge + kBatchLanes <= edges.size(); edge += kBatchLanes) {
                // Gather destination fields from records that also contain cold fields.
                const __m256i edge_indices = _mm256_setr_epi32(static_cast<int>((edge + 0) * kStrideWords),
                                                               static_cast<int>((edge + 1) * kStrideWords),
                                                               static_cast<int>((edge + 2) * kStrideWords),
                                                               static_cast<int>((edge + 3) * kStrideWords),
                                                               static_cast<int>((edge + 4) * kStrideWords),
                                                               static_cast<int>((edge + 5) * kStrideWords),
                                                               static_cast<int>((edge + 6) * kStrideWords),
                                                               static_cast<int>((edge + 7) * kStrideWords));
                const __m256i next_nodes =
                    _mm256_i32gather_epi32(base, edge_indices, sizeof(std::uint32_t));
                // Second gather: visited[] is indexed by arbitrary destination IDs.
                const __m256i seen =
                    _mm256_i32gather_epi32(visited.data(), next_nodes, static_cast<int>(sizeof(int)));
                _mm256_store_si256(reinterpret_cast<__m256i*>(dst_lanes.data()), next_nodes);
                const int unseen_mask = mask_bits(_mm256_cmpeq_epi32(seen, zero));
                // Major limiter: queue insertion and duplicate checks remain lane-by-lane.
                for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
                    if (((unseen_mask >> lane) & 1) != 0 && visited[dst_lanes[lane]] == 0) {
                        visited[dst_lanes[lane]] = 1;
                        queue[tail++] = dst_lanes[lane];
                    }
                }
            }
            for (; edge < edges.size(); ++edge) {
                const std::uint32_t next = edges[edge].dst;
                if (visited[next] == 0) {
                    visited[next] = 1;
                    queue[tail++] = next;
                }
            }
        }
        total += tail;
    }
    return total;
}

// Cache-aware + SIMD: contiguous CSR edge loads, but visited checks still serialize.
// Major difference from SIMD adjacency list: destination load is dense, visited gather remains.
NOINLINE std::uint64_t bfs_cache_simd_csr(const GraphCsr& graph, std::size_t passes) {
    const std::size_t vertices = graph.offsets.size() - 1;
    std::vector<std::uint32_t> queue(vertices);
    std::vector<int> visited(vertices, 0);
    std::uint64_t total = 0;
    alignas(32) std::array<std::uint32_t, kBatchLanes> dst_lanes{};
    const __m256i zero = _mm256_setzero_si256();

    for (std::size_t pass = 0; pass < passes; ++pass) {
        std::fill(visited.begin(), visited.end(), 0);
        const std::uint32_t source = static_cast<std::uint32_t>((pass * 17) % vertices);
        std::size_t head = 0;
        std::size_t tail = 0;
        visited[source] = 1;
        queue[tail++] = source;
        while (head < tail) {
            const std::uint32_t node = queue[head++];
            total += node;
            std::uint32_t edge = graph.offsets[node];
            const std::uint32_t edge_limit = graph.offsets[node + 1];
            for (; edge + kBatchLanes <= edge_limit; edge += static_cast<std::uint32_t>(kBatchLanes)) {
                // CSR gives a real packed load for destinations.
                const __m256i next_nodes =
                    _mm256_loadu_si256(reinterpret_cast<const __m256i*>(graph.edges.data() + edge));
                // But visited state is still scattered by node id.
                const __m256i seen =
                    _mm256_i32gather_epi32(visited.data(), next_nodes, static_cast<int>(sizeof(int)));
                _mm256_store_si256(reinterpret_cast<__m256i*>(dst_lanes.data()), next_nodes);
                const int unseen_mask = mask_bits(_mm256_cmpeq_epi32(seen, zero));
                // This scalar enqueue loop is why cache-aware scalar can still beat SIMD.
                for (std::size_t lane = 0; lane < kBatchLanes; ++lane) {
                    if (((unseen_mask >> lane) & 1) != 0 && visited[dst_lanes[lane]] == 0) {
                        visited[dst_lanes[lane]] = 1;
                        queue[tail++] = dst_lanes[lane];
                    }
                }
            }
            for (; edge < edge_limit; ++edge) {
                const std::uint32_t next = graph.edges[edge];
                if (visited[next] == 0) {
                    visited[next] = 1;
                    queue[tail++] = next;
                }
            }
        }
        total += tail;
    }
    return total;
}

Result:

BFS timing

BFS perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive74.2690 ns1.48451.805.120.12
cache-aware54.1225 ns1.62535.834.760.08
SIMD144.3282 ns1.29952.785.380.11
cache-aware + SIMD65.9072 ns1.43832.944.230.08

Evidence says cache-aware scalar CSR wins. I expected BFS to be dominated by frontier/control-flow and visited-state dependencies, and timing confirms SIMD is not automatically helpful. Perf supports the CSR layout: cache-aware scalar has better IPC and lower cache-miss pressure than naive adjacency-list records.

4.9 Transpose of Matrix

Problem:

Transpose a dense row-major matrix.

Diagram:

naive:
  read src row:     a b c d ...
  write dst column: a
                    b
                    c
                    d
  issue: stores stride by dim

cache-aware:
  process tile:
    src tile 32x32 -> dst tile 32x32
  change: read/write working set stays closer to cache

SIMD:
  transpose 4x4 chunks, but still walks the full matrix without tile locality

cache-aware + SIMD:
  tile 32x32, and inside each tile transpose 4x4 SIMD chunks

Code:

// Naive: row-major reads, column-major writes.
// Major difference: the store is strided by dim, so writes miss badly as dim grows.
NOINLINE NOVECTOR void transpose_naive(const float* src, float* dst, std::size_t dim) {
    for (std::size_t row = 0; row < dim; ++row) {
        for (std::size_t col = 0; col < dim; ++col) {
            // src is contiguous across col; dst jumps by dim for each next col.
            dst[col * dim + row] = src[row * dim + col];
        }
    }
}

// Cache-aware: tile the transpose so read/write working sets stay smaller.
// Major difference from naive: operate on a small row/column tile before moving on.
NOINLINE NOVECTOR void transpose_cache_blocked(const float* src, float* dst, std::size_t dim) {
    for (std::size_t row_block = 0; row_block < dim; row_block += kTransposeBlock) {
        for (std::size_t col_block = 0; col_block < dim; col_block += kTransposeBlock) {
            const std::size_t row_limit = std::min(dim, row_block + kTransposeBlock);
            const std::size_t col_limit = std::min(dim, col_block + kTransposeBlock);
            for (std::size_t row = row_block; row < row_limit; ++row) {
                for (std::size_t col = col_block; col < col_limit; ++col) {
                    // Same assignment as naive, but now repeated writes land inside a tile.
                    dst[col * dim + row] = src[row * dim + col];
                }
            }
        }
    }
}

void transpose4x4_simd(const float* src, std::size_t src_stride, float* dst, std::size_t dst_stride) {
    // Four row loads become four transposed row stores.
    // This is the SIMD primitive used by both SIMD variants.
    __m128 row0 = _mm_loadu_ps(src + 0 * src_stride);
    __m128 row1 = _mm_loadu_ps(src + 1 * src_stride);
    __m128 row2 = _mm_loadu_ps(src + 2 * src_stride);
    __m128 row3 = _mm_loadu_ps(src + 3 * src_stride);
    _MM_TRANSPOSE4_PS(row0, row1, row2, row3);
    _mm_storeu_ps(dst + 0 * dst_stride, row0);
    _mm_storeu_ps(dst + 1 * dst_stride, row1);
    _mm_storeu_ps(dst + 2 * dst_stride, row2);
    _mm_storeu_ps(dst + 3 * dst_stride, row3);
}

// SIMD: transpose 4x4 blocks, but still walks the whole matrix naively.
// Major difference from cache-aware + SIMD: no outer cache tile.
NOINLINE void transpose_simd_naive(const float* src, float* dst, std::size_t dim) {
    const std::size_t simd_dim = dim - (dim % 4);
    for (std::size_t row = 0; row < simd_dim; row += 4) {
        for (std::size_t col = 0; col < simd_dim; col += 4) {
            // SIMD fixes the 4x4 register shuffle, but the global traversal is still broad.
            transpose4x4_simd(src + row * dim + col, dim, dst + col * dim + row, dim);
        }
        for (std::size_t col = simd_dim; col < dim; ++col) {
            for (std::size_t lane = 0; lane < 4 && row + lane < dim; ++lane) {
                dst[col * dim + (row + lane)] = src[(row + lane) * dim + col];
            }
        }
    }
    for (std::size_t row = simd_dim; row < dim; ++row) {
        for (std::size_t col = 0; col < dim; ++col) {
            dst[col * dim + row] = src[row * dim + col];
        }
    }
}

// Cache-aware + SIMD: block the matrix and transpose 4x4 chunks inside each tile.
// Major difference: tile locality and SIMD register transpose are used together.
NOINLINE void transpose_cache_simd_blocked(const float* src, float* dst, std::size_t dim) {
    const std::size_t simd_dim = dim - (dim % 4);
    for (std::size_t row_block = 0; row_block < dim; row_block += kTransposeBlock) {
        for (std::size_t col_block = 0; col_block < dim; col_block += kTransposeBlock) {
            const std::size_t row_limit = std::min(dim, row_block + kTransposeBlock);
            const std::size_t col_limit = std::min(dim, col_block + kTransposeBlock);
            std::size_t row = row_block;
            for (; row + 4 <= row_limit && row < simd_dim; row += 4) {
                std::size_t col = col_block;
                for (; col + 4 <= col_limit && col < simd_dim; col += 4) {
                    // This is the main win: 4x4 SIMD transpose inside a cache-sized tile.
                    transpose4x4_simd(src + row * dim + col, dim, dst + col * dim + row, dim);
                }
                for (; col < col_limit; ++col) {
                    for (std::size_t lane = 0; lane < 4; ++lane) {
                        dst[col * dim + (row + lane)] = src[(row + lane) * dim + col];
                    }
                }
            }
            for (; row < row_limit; ++row) {
                for (std::size_t col = col_block; col < col_limit; ++col) {
                    dst[col * dim + row] = src[row * dim + col];
                }
            }
        }
    }
}

Result:

Transpose timing

Transpose perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive0.2023 GiB/s0.40319.7558.740.29
cache-aware1.5820 GiB/s1.43981.5210.190.22
SIMD0.7610 GiB/s0.69139.4126.430.32
cache-aware + SIMD2.4765 GiB/s1.09083.3511.020.28

Evidence says cache-aware + SIMD wins. I expected blocking to matter because naive transpose writes down columns. Timing confirms that. Perf supports the blocking story through much lower L1 miss percentage and higher IPC for the blocked scalar row; the blocked SIMD row then adds vectorized movement.

4.10 Multiply Two Matrices

Problem:

Multiply two dense square matrices.

Diagram:

naive:
  C[i][j] += A[i][k] * B[k][j]
             A row is linear, B column jumps by n
  issue: poor B locality dominates

cache-aware:
  build BT[j][k] = B[k][j]
  C[i][j] = dot(A row, BT row)
  change: both inputs in the dot product are linear

SIMD:
  load 8 A values, gather 8 B column values
  issue: vector math is wide, B still jumps

cache-aware + SIMD:
  block i/k/j:
    keep A scalar reused, stream B row segment, update C row segment with AVX2

Code:

// Naive: inner loop reads B down a column, creating poor locality.
// Major difference: this is the textbook loop order, but it is cache-hostile for B.
NOINLINE NOVECTOR void matmul_naive(const float* a, const float* b, float* c, std::size_t n) {
    std::fill_n(c, n * n, 0.0f);
    for (std::size_t i = 0; i < n; ++i) {
        for (std::size_t j = 0; j < n; ++j) {
            float sum = 0.0f;
            for (std::size_t k = 0; k < n; ++k) {
                // A[i*n+k] is linear across k, but B[k*n+j] jumps by n floats.
                sum += a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sum;
        }
    }
}

// Cache-aware: pretranspose B so each dot product is two linear reads.
// Major difference from naive: pay one transpose so the hot inner loop streams rows.
void transpose_square_matrix(const float* src, float* dst, std::size_t n) {
    for (std::size_t i = 0; i < n; ++i) {
        for (std::size_t j = 0; j < n; ++j) {
            dst[j * n + i] = src[i * n + j];
        }
    }
}

NOINLINE NOVECTOR void matmul_cache_transposed(const float* a, const float* bt, float* c, std::size_t n) {
    std::fill_n(c, n * n, 0.0f);
    for (std::size_t i = 0; i < n; ++i) {
        const float* row_a = a + i * n;
        for (std::size_t j = 0; j < n; ++j) {
            // row_bt is the old B column laid out as a row.
            const float* row_bt = bt + j * n;
            float sum = 0.0f;
            for (std::size_t k = 0; k < n; ++k) {
                // Both row_a[k] and row_bt[k] advance linearly.
                sum += row_a[k] * row_bt[k];
            }
            c[i * n + j] = sum;
        }
    }
}

// SIMD: vectorize the dot product, but gather B's column.
// Major difference: this widens arithmetic without fixing B's column access.
NOINLINE void matmul_simd_gather(const float* a, const float* b, float* c, std::size_t n) {
    std::fill_n(c, n * n, 0.0f);
    for (std::size_t i = 0; i < n; ++i) {
        const float* row_a = a + i * n;
        for (std::size_t j = 0; j < n; ++j) {
            __m256 acc = _mm256_setzero_ps();
            std::size_t k = 0;
            for (; k + kAvxWidthFloats <= n; k += kAvxWidthFloats) {
                // A is a clean packed load.
                const __m256 avec = _mm256_load_ps(row_a + k);
                // B is still a column, so the eight B addresses are strided by n.
                const __m256i bindex = _mm256_setr_epi32(static_cast<int>((k + 0) * n + j),
                                                         static_cast<int>((k + 1) * n + j),
                                                         static_cast<int>((k + 2) * n + j),
                                                         static_cast<int>((k + 3) * n + j),
                                                         static_cast<int>((k + 4) * n + j),
                                                         static_cast<int>((k + 5) * n + j),
                                                         static_cast<int>((k + 6) * n + j),
                                                         static_cast<int>((k + 7) * n + j));
                // This gather is the major reason plain SIMD does not win here.
                const __m256 bcol = _mm256_i32gather_ps(b, bindex, sizeof(float));
                acc = _mm256_add_ps(acc, _mm256_mul_ps(avec, bcol));
            }
            float sum = horizontal_sum_ps(acc);
            for (; k < n; ++k) {
                sum += row_a[k] * b[k * n + j];
            }
            c[i * n + j] = sum;
        }
    }
}

// Cache-aware + SIMD: blocked i/k/j loop reuses A and streams B/C across rows.
// Major difference from plain SIMD: B and C are accessed as row segments, not columns.
NOINLINE void matmul_cache_simd_blocked(const float* a, const float* b, float* c, std::size_t n) {
    constexpr std::size_t kBlockI = 32;
    constexpr std::size_t kBlockJ = 64;
    constexpr std::size_t kBlockK = 64;

    std::fill_n(c, n * n, 0.0f);
    for (std::size_t ii = 0; ii < n; ii += kBlockI) {
        // Blocks keep the active A/B/C working set smaller.
        const std::size_t i_end = std::min(ii + kBlockI, n);
        for (std::size_t kk = 0; kk < n; kk += kBlockK) {
            const std::size_t k_end = std::min(kk + kBlockK, n);
            for (std::size_t jj = 0; jj < n; jj += kBlockJ) {
                const std::size_t j_end = std::min(jj + kBlockJ, n);
                for (std::size_t i = ii; i < i_end; ++i) {
                    for (std::size_t k = kk; k < k_end; ++k) {
                        // Reuse one A value across a vector of B/C columns.
                        const __m256 aval = _mm256_set1_ps(a[i * n + k]);
                        std::size_t j = jj;
                        for (; j + kAvxWidthFloats <= j_end; j += kAvxWidthFloats) {
                            // B and C are contiguous here. This is the section making
                            // the major measured difference versus gathered B columns.
                            const __m256 cvec = _mm256_loadu_ps(c + i * n + j);
                            const __m256 bvec = _mm256_loadu_ps(b + k * n + j);
                            _mm256_storeu_ps(c + i * n + j,
                                             _mm256_add_ps(cvec, _mm256_mul_ps(aval, bvec)));
                        }
                        for (; j < j_end; ++j) {
                            c[i * n + j] += a[i * n + k] * b[k * n + j];
                        }
                    }
                }
            }
        }
    }
}

Result:

Matrix multiply timing

Matrix multiply perf

Variant64 MiB timingIPCcache miss %L1 miss %dTLB miss %
naive0.0001 GiB/s0.27815.2466.4448.51
cache-aware0.0012 GiB/s2.31055.953.230.00
SIMD0.0001 GiB/s0.16116.75104.3176.22
cache-aware + SIMD0.0094 GiB/s2.88360.282.520.04

Evidence says cache-aware + SIMD wins. I expected naive and gather variants to suffer from poor reuse and dependent memory traffic. Timing verifies that expectation. Perf supports the blocked verdict through much higher IPC and far lower L1/dTLB miss pressure than the naive and gather rows. Some derived miss ratios exceed intuitive bounds for the gather row, so I treat the exact ratio as advisory, not as a standalone proof.

5. Reproduction

The public runner/source snapshot uses DSA labels:

run_dsa_evidence.py

The equivalent command used on .122:

python3 scripts/run_dsa_evidence.py \
  --exe build_evidence/cache_bench \
  --results-dir results/cache_simd_dsa_haswell_20260609_full_evidence \
  --iterations 3 \
  --threads 4 \
  --taskset 0,2,4,6 \
  --perf-repeats 1 \
  --timing-timeout 900 \
  --matmul-timeout 1800

The full raw archive is kept in the internal/raw repository. The public article carries only the curated charts, aggregated CSVs, metadata, source snapshots, and inline benchmark code.