001 Notes

Cache Lines Before Big-O: Optimizing Search Algorithms by Changing Memory Access

0. Abstract

This note studies search performance as a memory-access problem.

The benchmark suite compares scalar C++20 implementations across:

  • raw memory patterns
  • linked-list layouts
  • binary-search layouts
  • hash-map collision strategies
  • batching and prefetching
  • multithreaded cache-line ownership

The central claim is:

Search performance is memory-pattern performance.

Absolute timings are specific to one measured x86_64 Linux system. The useful claim is the relative pattern: which code shapes stream through cache lines, which waste cache-line payload, which chase dependent addresses, which expose independent memory requests, and which create cache-line ownership traffic.

1. Research Questions

IDQuestion
RQ1When does memory locality dominate comparison count?
RQ2Which code shapes reduce cache-line, TLB, branch, or dependency cost?
RQ3When does memory-level parallelism compensate for poor locality?
RQ4Which optimizations move cost from lookup to build, mutation, or memory footprint?
RQ5Which claims are measured, which are partial, and which remain blocked by tooling?

2. Measurement Method

RulePurpose
Fixed seed dataRepeatable input generation.
Correctness before timingInvalid variants do not support timing claims.
Warmup before timingReduce first-touch effects.
Median of repetitionsReduce one-run noise.
Printed checksumPrevent dead-code elimination.
Scalar C++20 onlyKeep this version free of SIMD and assembly effects.
Standalone binariesKeep construction, lookup, mutation, and scaling behavior visible.
Representative perf stat rowsAdd counter evidence where available.

Article snippets are simplified code shapes. They omit CLI parsing, warmup, repetition control, correctness checks, result formatting, and checksum plumbing unless those details are the topic.

3. Measured System And Boundaries

FieldValue
System classsingle measured x86_64 Linux system
CPUIntel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz
Topology4 cores, 8 logical CPUs
Cache line64 bytes
Page size4096 bytes
Compilerg++ 13.3.0
Build flags-O3 -march=native -std=c++20 -Wall -Wextra -Wshadow -Wconversion
Pinningsingle-thread rows pinned to one CPU; multithread rows pinned to the available CPU set
SMTenabled
Frequencynot fixed; governor remained powersave
Counter toolingrepresentative Linux perf stat rows available; perf c2c unavailable

Boundaries:

  • No benchmark number is universal.
  • Relative patterns are the intended claim.
  • Counter evidence is partial.
  • Rows blocked by correctness or tooling are labeled as blocked.
  • Missing measurements are not inferred.

4. ASCII Legend

All diagrams use cache-line-centered notation.

[CL] = one 64-byte cache line
*    = useful value read
.    = fetched but unused value
P    = explicit prefetch target
?    = unpredictable address
!    = conflict, contention, or ownership risk
->   = access order

Cache-line cost model:

good search path:
CPU -> [CL0: * * * * * * * *] -> [CL1: * * * * * * * *]
       many useful values per fetched line

bad search path:
CPU -> [CL0: * . . . . . . .] -> [CL97: * . . . . . . .]
       one useful value per fetched line

dependent search path:
CPU -> [? node] -> [? node] -> [? node]
       next address is unknown until the current load returns

5. Raw Memory Code Shapes

Family question:

What does the CPU pay for when the operation is only "read memory"?

5.1 Linear Scan

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < values.size(); ++i) {
    sum += static_cast<std::uint64_t>(values[i]);
}

return sum;

Cache-Line Access Visual

int32 payload, 16 values per 64B line:

CPU -> [CL0: * * * * * * * * * * * * * * * *]
       [CL1: * * * * * * * * * * * * * * * *]

reference point:
nearly every fetched slot is useful

Interpretation

FieldNotes
ExpectationSequential scan should be the best raw-memory baseline because addresses are predictable and cache-line payload is fully used.
Measured resultn = 1,000,000, 5 repetitions: detailed rows below.
What happenedWider records cost more per logical element because each element consumes more cache-line payload.
TakeawayKeep hot scan fields narrow. Do not drag cold payload through a search loop.
Evidence statusSupported by timing.

Detailed rows

n = 1,000,000, 5 repetitions:

VariantElement sizens/elementVisual
int32_t4 B0.743#
int64_t8 B1.027##
struct1616 B2.808#####
struct6464 B10.170####################

5.2 Stride Scan

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < values.size(); i += stride) {
    sum += static_cast<std::uint64_t>(values[i]);
}

return sum;

Cache-Line Access Visual

stride = 16 over int32:

CPU -> [CL0: * . . . . . . . . . . . . . . .]
       [CL1: * . . . . . . . . . . . . . . .]

reference point:
access is predictable, but most fetched bytes are unused

Interpretation

FieldNotes
ExpectationLarge strides should be slower than stride 1 because each cache line contributes fewer useful values. Power-of-two strides may also create cache-set pressure.
Measured resultSelected rows, n = 1,000,000, 5 repetitions: detailed rows below.
What happenedStride 1 was fastest. Several plus-one strides beat nearby power-of-two strides.
TakeawayDo not only ask whether the working set fits in cache. Also ask whether the addresses map evenly across cache sets.
Evidence statusSupported by timing.

Detailed rows

Selected rows, n = 1,000,000, 5 repetitions:

Stridens/accessVisual
10.686#
166.352#########
647.921############
657.430###########
2567.292###########
2575.000#######
10247.586###########
10253.784######

5.3 Random Access

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < indexes.size(); ++i) {
    sum += static_cast<std::uint64_t>(values[indexes[i]]);
}

return sum;

Cache-Line Access Visual

index stream:
17 -> 900231 -> 42 -> 700001

CPU -> [? CL17:  * . . . . . . .]
       [? CL900: * . . . . . . .]
       [? CL42:  * . . . . . . .]

reference point:
loads are independent, but the next line is hard to predict

Interpretation

FieldNotes
ExpectationRandom access should be tolerable for small windows and expensive when the working set exceeds cache.
Measured resultn = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below.
What happenedThe small-window row was much cheaper than full-array random access.
TakeawayConstraining randomness into a smaller active window can be a real optimization.
Evidence statusSupported by timing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 5 repetitions:

Patternns/accessVisual
random small window4.355####
random within page10.172##########
shuffled permutation11.144###########
uniform random12.394############

5.4 Pointer Chase

Code Shape

std::uint64_t sum = 0;

for (PtrNode* node = head; node != nullptr; node = node->next) {
    sum += node->value;
}

return sum;

Cache-Line Access Visual

sequential arena:
CPU -> [CL0: node0 node1 node2 node3] -> [CL1: node4 node5 ...]

random heap or random arena:
CPU -> [? node A] -> [? node Q] -> [? node D] -> null

reference point:
next address is stored in the current node

Interpretation

FieldNotes
ExpectationArena allocation should help, but random traversal should remain expensive because each next address depends on the previous load.
Measured resultn = 500,000, 5 repetitions: detailed rows below.
What happenedArena allocation helped, but traversal order mattered more. Random arena traversal was still much slower than sequential arena traversal.
TakeawayContiguous storage is not enough. Traversal order controls whether cache-line locality is usable.
Evidence statusSupported by timing.

Detailed rows

n = 500,000, 5 repetitions:

Variantns/nodeRelative
arena sequential5.8451.0x
heap sequential9.2941.6x
arena randomized112.30519.2x
heap randomized150.63225.8x

5.5 Dependent Pointer Chase

Code Shape

std::size_t index = 0;
std::uint64_t sum = 0;

for (std::size_t step = 0; step < steps; ++step) {
    index = next[index];
    sum += index;
}

return sum;

Cache-Line Access Visual

idx0 -> next[idx0] = idx1
        idx1 -> next[idx1] = idx2
                idx2 -> next[idx2] = idx3

CPU -> [? CL idx0] -> [? CL idx1] -> [? CL idx2]

reference point:
the next load cannot be issued until the current load returns

Interpretation

FieldNotes
ExpectationRandom dependent chains should expose load-to-use latency and scale poorly with working-set size.
Measured resultn = 1,000,000, steps = 5,000,000, 5 repetitions: detailed rows below.
What happenedRandom and page-spread cycles were about 14x slower than the sequential cycle.
TakeawayDependent memory chains are a latency problem, not just a cache-capacity problem.
Evidence statusSupported by timing.

Detailed rows

n = 1,000,000, steps = 5,000,000, 5 repetitions:

Variantns/stepRelative
sequential cycle6.6741.0x
random permutation cycle92.51313.9x
page-spread cycle96.17314.4x

5.6 Prefetch Pointer Chase

Code Shape

std::uint64_t sum = 0;

for (PtrNode* node = head; node != nullptr; node = node->next) {
    if (node->next != nullptr) {
        prefetchRead(node->next);
    }

    sum += node->value;
}

return sum;

Cache-Line Access Visual

CPU reads [CL?: node A | next -> node B]
  P -> [CL?: node B]
CPU then uses node B soon after

reference point:
prefetch is often too late because node B is discovered at node A

Interpretation

FieldNotes
ExpectationSimple next-prefetch may not help pointer chasing because address discovery and use are too close together.
Measured resultn = 500,000, 5 repetitions: detailed rows below.
What happenedPointer next-prefetch did not improve the measured rows. Longer stride prefetch distances also did not improve this pass.
TakeawayPrefetch is a distance problem. It helps only when the future address is known early enough and the prefetched data is likely to be used.
Evidence statusSupported for these rows; broader prefetch policy remains workload-specific.

Detailed rows

n = 500,000, 5 repetitions:

Variantns/node
no prefetch97.438
prefetch next100.348
prefetch next-next134.773

Prefetch distance sweep for stride 16:

Distancens/accessVisual
010.293##################
110.426##################
210.265##################
410.204##################
810.221##################
1611.400####################
3211.560####################
6411.427####################
12811.491####################

Prefetch distance sweep

5.7 MLP Pointer Chase

Code Shape

std::vector<std::size_t> index(lanes, 0);
std::uint64_t sum = 0;

for (std::size_t step = 0; step < steps; ++step) {
    for (int lane = 0; lane < lanes; ++lane) {
        index[lane] = chains[lane][index[lane]];
        sum += index[lane];
    }
}

return sum;

Cache-Line Access Visual

lane 0: CPU -> [? CL] -> [? CL] -> [? CL]
lane 1: CPU -> [? CL] -> [? CL] -> [? CL]
lane 2: CPU -> [? CL] -> [? CL] -> [? CL]
lane 3: CPU -> [? CL] -> [? CL] -> [? CL]

reference point:
each lane is dependent, but lanes are independent of each other

Interpretation

FieldNotes
ExpectationMultiple independent chains should expose memory-level parallelism and lower per-chain cost until the core or memory system saturates.
Measured resultworking_set = 8,000,000 bytes, steps = 1,000,000, 5 repetitions: detailed rows below.
What happenedPer-chain time dropped from 107.056 ns to about 17 ns near 16 lanes.
TakeawayIndependent work can hide latency even when each individual chain has poor locality.
Evidence statusSupported by timing.

Detailed rows

working_set = 8,000,000 bytes, steps = 1,000,000, 5 repetitions:

Lanesns/outer stepns/chain-stepVisual per chain
1107.056107.056####################
2108.37354.186##########
4114.80028.700#####
8152.17019.021####
16274.04417.128###
32557.56017.424###

5.8 Associativity Conflict

Code Shape

for (std::size_t i = 0; i < values.size(); i += stride) {
    sum += values[i];
}

Cache-Line Access Visual

stride 256:
CPU -> [CL set A: *] -> [CL set A: *] -> [CL set A: *]
                         !

stride 257:
CPU -> [CL set A: *] -> [CL set B: *] -> [CL set C: *]

reference point:
similar strides can map differently across cache sets

Interpretation

FieldNotes
ExpectationPower-of-two strides can concentrate accesses into fewer cache sets; plus-one variants may spread mapping differently.
Measured resultn = 4,000,000, 5 repetitions: detailed rows below.
What happenedThe plus-one stride was faster in every paired row in this pass.
TakeawayCache-set placement belongs in the experiment matrix. Similar-looking access patterns can differ.
Evidence statusSupported by timing; counter evidence not included.

Detailed rows

n = 4,000,000, 5 repetitions:

Stridens/accessCompared rowVisual
647.862power of two############
657.040plus one###########
1287.970power of two############
1297.288plus one###########
2568.441power of two#############
2577.434plus one###########
5127.507power of two###########
5136.371plus one##########
10247.595power of two###########
10254.920plus one#######

5.9 TLB Page Walk

Code Shape

for (std::size_t page : page_order) {
    sum += values[(page * page_size) / sizeof(values[0])];
}

Cache-Line Access Visual

one access per page:

CPU -> [page 0: CL0: * . . . . . . .]
       [page 1: CL0: * . . . . . . .]
       [page 2: CL0: * . . . . . . .]

reference point:
one useful load per page stresses address translation

Interpretation

FieldNotes
ExpectationPage-walk patterns should be much slower than dense sequential scans because they use fewer values per page and per cache line.
Measured resultn = 4,000,000, 3906 pages, 5 repetitions: detailed rows below.
What happenedThe page-walk rows were much slower than dense sequential scan.
TakeawayGood search layout should reuse both cache lines and pages.
Evidence statusSupported by timing; direct TLB counter rows are not shown per variant.

Detailed rows

n = 4,000,000, 3906 pages, 5 repetitions:

Patternns/access
dense sequential scan4.410
sequential page walk31.813
one access per 4 KB page32.018
random page walk32.817

Completion pass:

Patternns/accessVisual
one access per 4 KB page33.580##################
sequential page walk33.277##################
random page walk36.540####################
dense sequential scan4.390##

5.10 Working-Set Ladder

Code Shape

for (std::size_t footprint : footprints) {
    buildInput(footprint);
    runSameAccessShape();
}

Cache-Line Access Visual

32 KiB  -> likely cache-resident
256 KiB -> larger private-cache pressure
1 MiB   -> beyond small private caches
8 MiB   -> shared-cache and memory pressure
64 MiB  -> memory-system pressure

reference point:
same loop shape, larger footprint

Interpretation

FieldNotes
ExpectationSequential access should stay relatively stable; random and dependent access should rise sharply with footprint.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedSequential access stayed cheap. Random, pointer, and dependent rows rose sharply at larger footprints. MLP stayed lower than a single random dependent chain.
TakeawayFootprint sweeps are required before interpreting one benchmark row.
Evidence statusSupported by timing.

Detailed rows

PatternVariant32 KiB ns/op256 KiB ns/op1 MiB ns/op8 MiB ns/op64 MiB ns/op
linear scanint32 sequential0.6000.5270.5340.8270.867
stride scanstride 162.2014.9486.1149.25810.034
random accessuniform random3.9115.0008.29818.92445.038
pointer chasearena random9.36529.73353.748109.177170.508
dependent chaserandom cycle6.39619.60845.818116.110160.778
prefetch chaseprefetch next9.84724.48150.856103.324187.059
MLP chase8 lanes1.2613.7886.93217.45042.083
TLB page walkrandom page19.3757.8287.07028.46438.044

Working-set ladder

5.11 Cold/Warm Scan, Write Traffic, And Huge Pages

Code Shape

touchEvictionBuffer();
runScan();      // cold-style pass
runScan();      // warm repeated pass

for (std::size_t i = 0; i < values.size(); ++i) {
    values[i] = values[i] + 1;
}

Cache-Line Access Visual

cold-style:
CPU -> [eviction buffer] -> [CL0: * * * *] -> [CL1: * * * *]

read-modify-write:
CPU -> [CL0: load * * * *] -> [CL0: writable !] -> store

transparent huge-page probe:
4 KiB pages -> many translations
2 MiB pages -> fewer translations, if available

Interpretation

FieldNotes
ExpectationWarm repeated scans may be faster. Writes should cost more than reads because cache lines must become writable. Huge pages may help TLB-sensitive access if available.
Measured resultCold versus warm, 32 MiB: detailed rows below.
What happenedThe warm scan was only slightly faster. Writes were slower than reads. Transparent huge pages were slightly faster in this row, while explicit huge pages were unavailable.
TakeawayRead/write ownership and page size are real variables, but this article has timing evidence only, not direct RFO or per-row TLB counter proof.
Evidence statusPartially supported. Explicit huge pages and direct RFO counters remain blocked or unavailable.

Detailed rows

Cold versus warm, 32 MiB:

ItemsBytesCold ns/eltWarm ns/eltCold/WarmVisual
8,388,60832 MiB0.8860.8651.02x##

Write and RFO-style timing proxy:

Variantns/opVisual
read_only1.273#############
write_stream1.918####################
read_modify_write1.884###################

Huge page probe, 64 MiB allocation:

ModeStatusPage sizens/valueNote
default pagesmeasured40961.289normal anonymous mapping
explicit huge pagesunavailable2097152n/ahuge-page allocation unavailable
transparent huge pagesmeasured20971521.253advised huge-page path

6. Linked-List Code Shapes

Family question:

Can list-like search be made less expensive by changing how much useful work each pointer jump buys?

6.1 Heap List

Code Shape

bool searchList(PtrNode* head, int key) {
    for (PtrNode* node = head; node != nullptr; node = node->next) {
        if (node->value == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

CPU -> [? heap node A: * next] -> [? heap node Q: * next]
       -> [? heap node D: * next] -> null

reference point:
one useful value per dependent pointer jump

Interpretation

FieldNotes
ExpectationHeap allocation should be the poor-locality baseline when traversal order is scattered.
Measured resultPointer-chase row: detailed rows below.
What happenedHeap randomized traversal was the expensive baseline.
TakeawayHeap nodes plus random traversal are a worst-case shape for cache-line reuse.
Evidence statusSupported by timing.

Detailed rows

Pointer-chase row:

Variantns/nodeRelative
heap sequential9.2941.6x
heap randomized150.63225.8x

Linked-list compare-all row:

Variantns/queryRelative
heap randomized3,646,39728.1x

6.2 Arena List

Code Shape

std::vector<PtrNode> nodes(n);

for (std::size_t i = 0; i + 1U < nodes.size(); ++i) {
    nodes[i].value = static_cast<std::uint64_t>(i);
    nodes[i].next = &nodes[i + 1U];
}

nodes.back().next = nullptr;

Cache-Line Access Visual

arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]

CPU -> node0 -> node1 -> node2 -> node3 -> node4

reference point:
allocation and traversal are both compact

Interpretation

FieldNotes
ExpectationArena allocation should improve locality but cannot remove the pointer dependency.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedArena sequential was much faster than heap randomized, but slower than chunked list search in the compare-all row.
TakeawayArena allocation helps, but one pointer jump per value is still expensive for late-hit and miss-heavy search.
Evidence statusSupported by timing.

Detailed rows

Variantns/nodeRelative
arena sequential5.8451.0x

Compare-all:

Variantns/queryRelative
arena sequential333,7082.6x

6.3 Random Arena List

Code Shape

std::vector<std::size_t> order = makeRandomPermutation(n, seed);

for (std::size_t i = 0; i + 1U < order.size(); ++i) {
    nodes[order[i]].next = &nodes[order[i + 1U]];
}

nodes[order.back()].next = nullptr;

Cache-Line Access Visual

arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]

CPU -> node3 -> node0 -> node7 -> node2 -> node5

reference point:
storage is compact, but traversal is not

Interpretation

FieldNotes
ExpectationRandom traversal should be slow even when allocation is compact.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedRandom arena traversal was much slower than arena sequential traversal.
TakeawayThe CPU follows access order, not allocation intent.
Evidence statusSupported by timing.

Detailed rows

Variantns/nodeRelative
arena randomized112.30519.2x

6.4 Prefetch List

Code Shape

for (PtrNode* node = head; node != nullptr; node = node->next) {
    if (node->next != nullptr) {
        prefetchRead(node->next);
    }

    sum += node->value;
}

Cache-Line Access Visual

CPU reads [CL?: node A | next -> node B]
  P -> [CL?: node B]
CPU immediately advances to node B

reference point:
prefetch target is discovered late

Interpretation

FieldNotes
ExpectationNaive pointer-list prefetch should be weak because there is little work between prefetch and use.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedNaive prefetch did not beat arena sequential in these rows.
TakeawayLinked-list prefetch needs enough independent work between address discovery and use.
Evidence statusSupported for this implementation and query mix.

Detailed rows

Variantns/queryRelative
prefetch next on arena list484,0953.7x
arena sequential333,7082.6x

Completion compare-all:

Variantns/queryVisual
prefetch next963,053##
arena sequential667,221#

6.5 Chunked List

Code Shape

template <int ChunkSize>
bool searchChunked(const ChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

normal list:
CPU -> [CL?: value next] -> [CL?: value next] -> [CL?: value next]

chunked list:
CPU -> [CL0: * * * * * * * * | chunk metadata]
       [CL1: * * * * * * * * | next part of chunk]
       -> next chunk

reference point:
one pointer jump buys many value checks

Interpretation

FieldNotes
ExpectationChunking should improve late-hit and miss-heavy list search because it reduces pointer jumps per searchable value.
Measured resultCompare-all: detailed rows below.
What happenedChunking improved the expensive late-hit and miss rows compared with pointer-per-value traversal.
TakeawayIf a list must be scanned, make each pointer jump buy a block of local work.
Evidence statusSupported by timing.

Detailed rows

Compare-all:

Variantns/queryRelative
chunked 64129,6291.0x
arena sequential333,7082.6x
heap randomized3,646,39728.1x

Hit-position distribution, chunked 32:

PositionOpsns/queryVisual
early hit1,000,0005.68#
middle hit500315,212##########
late hit500633,347####################
miss500630,506####################

Linked-list miss cost

6.6 Chunked Prefetch

Code Shape

template <int ChunkSize>
bool searchChunkedPrefetch(const ChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        if (node->next != nullptr) {
            prefetchRead(node->next);
        }

        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

CPU reads [chunk A | next -> chunk B]
  P -> [chunk B]
CPU scans [CL A0: * * * *] [CL A1: * * * *]
CPU then uses [chunk B]

reference point:
chunk scan creates prefetch distance

Interpretation

FieldNotes
ExpectationChunked prefetch should help more than normal list prefetch because scanning the current chunk creates lead time.
Measured resultn = 100,000, queries = 5,000, 5 repetitions: detailed rows below.
What happenedPrefetch was not a broad win. The best no-prefetch and prefetch rows were both chunk size 16.
TakeawayPrefetch distance is necessary but not sufficient. Chunk size, query mix, and cache state decide whether it pays.
Evidence statusMixed timing evidence.

Detailed rows

n = 100,000, queries = 5,000, 5 repetitions:

Chunk sizeno prefetch ns/queryprefetch ns/query
4149,268165,376
898,464108,059
1694,23398,170
32183,450228,932
64189,332187,497
128101,391105,723
256134,423133,957

6.7 Indexed List

Code Shape

struct IndexNode {
    int value = 0;
    int next = -1;
};

bool searchIndexed(const std::vector<IndexNode>& nodes, int key) {
    for (int index = 0; index != -1; index = nodes[static_cast<std::size_t>(index)].next) {
        if (nodes[static_cast<std::size_t>(index)].value == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

nodes vector:
[CL0: idx0 value,next | idx1 value,next | idx2 value,next | idx3 value,next]

CPU -> nodes[0] -> nodes[4] -> nodes[1]

reference point:
link is a compact integer, but traversal is still dependent

Interpretation

FieldNotes
ExpectationIndexed nodes should reduce hot-node footprint, but still behave like a dependent chase.
Measured resultn = 200,000, queries = 500, 3 repetitions: detailed rows below.
What happenedIndexed nodes cut footprint but were slower than arena pointer nodes in the listed high-level query shape.
TakeawayCompact representation helps memory footprint, but it does not remove address dependency.
Evidence statusSupported by timing and byte counts.

Detailed rows

n = 200,000, queries = 500, 3 repetitions:

Variantns/queryBytes/searchable key
indexed list949,7848.0

Hit-position distribution:

PositionOpsns/queryVisual
early hit1,000,0002.56#
middle hit500771,964##########
late hit5001,545,820####################
miss5001,543,800####################

6.8 Hot/Cold List

Code Shape

struct HotNode {
    int key = 0;
    int next = -1;
};

struct ColdNode {
    int value = 0;
    std::array<char, PayloadSize> payload{};
};

bool searchHot(const std::vector<HotNode>& hot, int key) {
    for (int index = 0; index != -1; index = hot[static_cast<std::size_t>(index)].next) {
        if (hot[static_cast<std::size_t>(index)].key == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

fat node search:
CPU -> [CL: key next payload payload] -> [CL: payload payload ...]

hot/cold search:
CPU -> [CL hot: key next | key next | key next | key next]
only on match -> [cold payload]

reference point:
negative search touches metadata, not payload

Interpretation

FieldNotes
ExpectationHot/cold split should help as payload grows.
Measured resultn = 100,000, queries = 5,000, 5 repetitions: detailed rows below.
What happenedFat-node search cost grew with payload. Hot/cold search stayed nearly flat.
TakeawaySearch should touch searchable metadata first and cold payload only after a match.
Evidence statusSupported by timing.

Detailed rows

n = 100,000, queries = 5,000, 5 repetitions:

Payloadfat node ns/queryhot/cold ns/queryImprovement
16 B827,299559,8201.5x
64 B1,504,184559,6202.7x
128 B2,615,491559,8494.7x
256 B4,265,747559,8867.6x

6.9 Multi-Lane List

Code Shape

std::uint64_t found = 0;

for (std::size_t q = 0; q < queries.size(); q += static_cast<std::size_t>(lanes)) {
    for (int lane = 0; lane < lanes; ++lane) {
        found += searchList(heads[static_cast<std::size_t>(lane)], queries[q + lane]);
    }
}

Cache-Line Access Visual

lane 0: CPU -> [? node] -> [? node] -> [? node]
lane 1: CPU -> [? node] -> [? node] -> [? node]
lane 2: CPU -> [? node] -> [? node] -> [? node]
lane 3: CPU -> [? node] -> [? node] -> [? node]

reference point:
one list remains dependent; many lists expose throughput parallelism

Interpretation

FieldNotes
ExpectationInterleaving independent searches should expose memory-level parallelism.
Measured resultn = 200,000, queries = 250, 3 repetitions: detailed rows below.
What happenedPer-query time dropped sharply as lanes increased.
TakeawayMLP improves throughput across independent list searches, not the latency of one dependent traversal.
Evidence statusSupported by timing.

Detailed rows

n = 200,000, queries = 250, 3 repetitions:

Lanesns/queryVisual
17,129,608####################
23,419,579##########
41,604,336#####
8611,275##
16250,894#

6.10 Bloom Chunked List

Code Shape

bool searchBloomChunked(const BloomChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        if ((node->bloom & bloomMask(key)) == 0U) {
            continue;
        }

        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

negative lookup:
CPU -> [CL chunk metadata: bloom bits]
       bit absent -> skip [values] -> next chunk

maybe hit:
CPU -> [CL chunk metadata: bloom bits]
       bit present -> [CL values: * * * *]

reference point:
cheap metadata can avoid scanning values

Interpretation

FieldNotes
ExpectationBloom filtering should help negative lookups by skipping local chunk scans.
Measured resultn = 200,000, queries = 2,000, 3 repetitions: detailed rows below.
What happenedThe hit-ratio sweep improved as generated hits became more common and early in the benchmark shape.
TakeawayBloom-per-chunk needs hit-ratio and hit-position separated before making a broad claim.
Evidence statusPartially supported.

Detailed rows

n = 200,000, queries = 2,000, 3 repetitions:

Hit rations/queryVisual
0%314,063####################
10%282,604##################
50%157,390##########
100%1,543#

6.11 Compare-All And Mutation Rows

Code Shape

sameQueries();
run(heapList);
run(arenaList);
run(prefetchList);
run(chunkedList);

// mutation timing is separated from search-to-position cost
insertAfter(cursor);
deleteAfter(cursor);
spliceAfter(source, dest);

Cache-Line Access Visual

compare-all:
same query stream -> heap / arena / prefetch / chunked

mutation:
[cursor] -> [new_node] -> [old_next]
[cursor] -----------> [next_after_delete]

reference point:
lookup-friendly layout may move cost to update paths

Interpretation

FieldNotes
ExpectationChunking should win lookup-heavy rows. Vector-backed mutation should avoid heap-allocation overhead.
Measured resultCompletion compare-all, n = 200,000, queries = 500, 3 repetitions: detailed rows below.
What happenedChunked lookup was strongest in compare-all. Heap allocation was expensive in mutation. Indexed mutation rows were cheapest in this pass.
TakeawayList layout decisions should be evaluated on both lookup and mutation paths.
Evidence statusSupported for measured append/insert/delete/splice shapes; arbitrary middle-position search cost is separate.

Detailed rows

Completion compare-all, n = 200,000, queries = 500, 3 repetitions:

Variantns/queryVisual
heap randomized11,208,462####################
prefetch next963,053##
arena sequential667,221#
chunked 64257,240#

Insert/delete/splice mutation, operations = 100,000, 5 repetitions:

LayoutOperationns/opBytes/nodeVisual
pointer arenainsert after, preallocated32.79616####
pointer heapinsert after, new177.69616##################
pointer arenadelete after unlink112.19916###########
pointer arenasplice after relink194.45016####################
indexedinsert after, preallocated13.7288#
indexeddelete after free list27.2538###
indexedsplice after relink39.5308####

7. Binary-Search Code Shapes

Family question:

Can fewer global jumps beat fewer comparisons?

Code Shape

bool linearContains(const int* values, std::size_t n, int key) {
    for (std::size_t i = 0; i < n; ++i) {
        if (values[i] == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

small array:
CPU -> [CL0: * * * * * * * * * * * * * * * *]

reference point:
for tiny arrays, the full scan can stay in one or a few cache lines

Interpretation

FieldNotes
ExpectationLinear search should win early-hit rows for tiny arrays; binary search should win misses as n grows.
Measured resultqueries = 1,000,000, 5 repetitions: detailed rows below.
What happenedLinear search won early hits. Binary search won misses once n grew.
TakeawaySmall local search policy depends on hit position and miss rate.
Evidence statusSupported by timing.

Detailed rows

queries = 1,000,000, 5 repetitions:

nlinear early hitbinary early hitlinear missbinary miss
43.88215.47817.71717.401
83.85225.12627.93818.645
163.87028.82248.12120.227
323.85733.97888.82323.728
643.90435.470172.84328.091

Code Shape

bool binaryContains(const std::vector<int>& values, int key) {
    std::size_t left = 0;
    std::size_t right = values.size();

    while (left < right) {
        const std::size_t mid = left + (right - left) / 2U;

        if (values[mid] == key) {
            return true;
        }

        if (values[mid] < key) {
            left = mid + 1U;
        } else {
            right = mid;
        }
    }

    return false;
}

Cache-Line Access Visual

CPU -> [CL mid: * . . .]
       -> [CL quarter: * . . .]
       -> [CL eighth: * . . .]

reference point:
few comparisons, but jumpy memory access

Interpretation

FieldNotes
ExpectationNormal binary search should be a strong comparison-count baseline but can lose to layouts with better memory locality.
Measured resultLarge-array compare-all, n = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below.
What happenedNormal binary was slower than Eytzinger and blocked layouts in the large-array rows shown here.
TakeawayMinimizing comparisons does not necessarily minimize memory stalls.
Evidence statusSupported by timing.

Detailed rows

Large-array compare-all, n = 1,000,000, queries = 1,000,000, 5 repetitions:

Variantns/queryVisual
normal binary591.551####################
Eytzinger326.475###########

Completion compare-all:

Variantns/queryVisual
binary738.569##############
Eytzinger328.340######

7.3 Branchless Binary

Code Shape

bool branchlessContains(const std::vector<int>& values, int key) {
    std::size_t base = 0;
    std::size_t len = values.size();

    while (len > 1U) {
        const std::size_t half = len / 2U;
        const std::size_t mid = base + half;
        const bool move = values[mid] <= key;
        base = move ? mid : base;
        len -= half;
    }

    return !values.empty() && values[base] == key;
}

Cache-Line Access Visual

CPU -> [CL mid: *]
       compute move flag
       update base without branch-shaped control flow
       -> [CL next mid: *]

reference point:
branch behavior changes; memory jumps remain

Interpretation

FieldNotes
ExpectationBranchless search should help when branch misses matter, but cannot remove jumpy memory access.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedBranchless timing beat normal binary in the displayed compare-all row. The counter row is program-level, not per variant.
TakeawayBranch behavior is material, but per-variant counter isolation is still needed.
Evidence statusPartially supported by timing and representative counters.

Detailed rows

Variantns/queryVisual
branchless338.353###########
normal binary591.551####################

Representative program-level branch counter row:

FamilyStatusBranchesBranch missesMiss rate
binary searchmeasured with Linux perf stat381,401,89644,259,94611.60%

Code Shape

bool blockedContains(const BlockedArray& array, int key) {
    auto it = std::lower_bound(array.blockMax.begin(), array.blockMax.end(), key);
    if (it == array.blockMax.end()) {
        return false;
    }

    const int block = static_cast<int>(it - array.blockMax.begin());
    const int first = block * array.blockSize;
    const int last = std::min(first + array.blockSize, static_cast<int>(array.values.size()));

    for (int i = first; i < last; ++i) {
        if (array.values[static_cast<std::size_t>(i)] == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

metadata:
CPU -> [CL blockMax: * . . .] -> choose block

local scan:
CPU -> [CL block: * * * * * * * *]
       [CL block: * * * * * * * *]

reference point:
one global decision leads to local sequential work

Interpretation

FieldNotes
ExpectationBlocked search should have a sweet spot: small blocks cost metadata search, large blocks cost local scanning.
Measured resultBlock-size sweep: detailed rows below.
What happenedBlock size 64 was best in this pass.
TakeawayThe correct block size is empirical. It balances global jumps against local scan work.
Evidence statusSupported by timing.

Detailed rows

Block-size sweep:

Block sizens/queryVisual
8568.714#############
16496.838############
32480.809###########
64472.178###########
128529.335############
256615.460##############
512846.829####################

Code Shape

for (int blockSize : {16, 32, 64}) {
    auto index = buildBlockedArray(values, blockSize);
    runQueries(index);
}

Cache-Line Access Visual

int32 values:
block 16 -> [CL0: * * * * * * * * * * * * * * * *]
block 32 -> [CL0: * * * * ...] [CL1: * * * * ...]
block 64 -> [CL0] [CL1] [CL2] [CL3]

reference point:
test one, two, and four nearby cache lines

Interpretation

FieldNotes
ExpectationOne-cache-line blocks may be good, but larger nearby blocks can win if they reduce more global jumps.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedThe 64 value block was best in this pass.
Takeaway”One cache line” is not automatically the optimal block size.
Evidence statusSupported by timing.

Detailed rows

Block sizens/queryVisual
16499.228####################
32481.549###################
64471.386##################

Code Shape

bool eytzingerContains(const std::vector<int>& eytzinger, int key) {
    std::size_t index = 1;

    while (index < eytzinger.size()) {
        if (eytzinger[index] == key) {
            return true;
        }

        index = key < eytzinger[index] ? index * 2U : index * 2U + 1U;
    }

    return false;
}

Cache-Line Access Visual

array-backed tree:

index:  1
       / \
      2   3
     / \ / \
    4  5 6  7

CPU -> [CL: eyt[1] eyt[2] eyt[3] ...] -> child index by arithmetic

reference point:
tree shape without pointer nodes

Interpretation

FieldNotes
ExpectationEytzinger should improve large read-only search by changing physical layout while keeping compact array storage.
Measured resultLarge-array compare-all: detailed rows below.
What happenedEytzinger was the fastest large-array row in the displayed passes.
TakeawayPhysical layout can matter as much as search logic.
Evidence statusSupported by timing.

Detailed rows

Large-array compare-all:

Variantns/queryVisual
Eytzinger326.475###########
normal binary591.551####################

Completion compare-all:

Variantns/queryVisual
Eytzinger328.340######
binary738.569##############

7.7 Eytzinger Prefetch

Code Shape

if (prefetchDistance >= 1) {
    const std::size_t child = index * 2U;
    if (child < eytzinger.size()) {
        prefetchRead(&eytzinger[child]);
    }
}

Cache-Line Access Visual

CPU at eyt[i]
  P child       -> eyt[2*i]
  P grandchild  -> eyt[4*i]
CPU compares eyt[i]
CPU chooses next child

reference point:
future addresses are arithmetic, but not all prefetched paths are used

Interpretation

FieldNotes
ExpectationPrefetch may help when distance is right and hurt when it is too near, too far, or polluting.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedChild prefetch was worse. Grandchild prefetch was slightly faster than the no-prefetch row in this pass.
TakeawayPrefetch distance is a parameter, not a feature toggle.
Evidence statusPartially supported by timing.

Detailed rows

Variantns/queryVisual
Eytzinger314.106#############
prefetch child494.878####################
prefetch grandchildren312.936#############

7.8 Implicit B-Tree

Code Shape

int childSlotLinear(const BTreeNodeShape& node, int key) {
    int slot = 0;

    while (slot < node.count && key > node.keys[static_cast<std::size_t>(slot)]) {
        ++slot;
    }

    return slot;
}

Cache-Line Access Visual

binary tree:
CPU -> [node] -> [node] -> [node] -> [node]

implicit B-tree:
CPU -> [CL node: k0 k1 k2 k3 k4 k5 k6 k7 | child offsets]
       -> child node

reference point:
more local comparisons, fewer levels

Interpretation

FieldNotes
ExpectationHigher fanout should reduce depth, but too much inside-node search can cost more.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedLinear inside-node scan won across tested fanouts. 64 keys was the fastest linear inside-node row.
TakeawayLocal sequential work can beat branchier inside-node search.
Evidence statusSupported by timing.

Detailed rows

Node keysLinear inside-node ns/queryBinary inside-node ns/queryFaster
4662.335735.386linear
8605.165736.048linear
16511.993734.992linear
32508.562743.072linear
64498.007739.088linear

Binary-search locality sweep

7.9 Power-Of-Two Padding

Code Shape

for (std::size_t n : {pow2, pow2 + 123U, pow2 + 257U, pow2 + 4099U}) {
    auto values = makeSortedEvenKeys(n);
    runSearchVariants(values, queries);
}

Cache-Line Access Visual

n = 2^20:
CPU -> regular midpoint offsets -> possible repeated set mapping !

n = 2^20 + 257:
CPU -> shifted midpoint offsets -> different set mapping

reference point:
same algorithm, changed address placement

Interpretation

FieldNotes
ExpectationPower-of-two sizes can expose address-mapping sensitivity.
Measured resultNormal binary: detailed rows below.
What happenedThe padded 2^20 + 257 row improved normal binary substantially. Eytzinger was less sensitive in this table.
TakeawayArray size can change search speed even when the algorithm is identical.
Evidence statusSupported by timing; counter confirmation is missing.

Detailed rows

Normal binary:

nns/queryVisual
1,048,576750.393###############
1,048,699649.605#############
1,048,833628.107#############
1,052,675609.551############
2,097,1521,000.268####################
2,097,275846.635#################

Eytzinger:

nns/queryVisual
1,048,576336.239###############
1,048,699352.473###############
1,048,833347.754###############
1,052,675335.315##############
2,097,152456.035####################
2,097,275441.157###################

7.10 Query Ordering

Code Shape

std::sort(queries.begin(), queries.end());
runSearchVariants(values, queries);

Cache-Line Access Visual

random queries:
q900 -> q17 -> q700 -> q3
CPU  -> [CL far] -> [CL far] -> [CL far]

clustered queries:
q100 -> q101 -> q104 -> q110
CPU  -> [CL region A] -> [CL region A] -> [CL nearby]

reference point:
same data structure, different temporal locality

Interpretation

FieldNotes
ExpectationSorted or clustered queries should reuse nearby array/tree regions.
Measured resultn = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below.
What happenedClustered queries made normal binary much faster than random queries.
TakeawayWork ordering can be a memory optimization without changing the data structure.
Evidence statusSupported by timing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 5 repetitions:

Query orderbinary ns/queryEytzinger ns/queryblocked ns/query
random595.078328.620475.115
sorted184.750207.668179.653
clustered80.564171.971133.153
bucketed high bits185.036207.205179.505

7.11 Duplicate-Key And Lower-Bound Behavior

Code Shape

auto lower = std::lower_bound(values.begin(), values.end(), key);
auto upper = std::upper_bound(values.begin(), values.end(), key);
return std::make_pair(lower, upper);

Cache-Line Access Visual

membership:
CPU -> find any equal key

lower_bound:
CPU -> first key >= query

equal_range:
CPU -> lower bound path + upper bound path

reference point:
same sorted array, different query contract

Interpretation

FieldNotes
ExpectationMembership timing should not be assumed to represent lower-bound or range queries.
Measured resultn = 1,048,576, queries = 1,000,000, 3 repetitions: detailed rows below.
What happenedDuplicate-heavy rows were faster in this benchmark shape, but range queries still have their own cost.
TakeawayDefine the exact search contract before comparing layouts.
Evidence statusSupported by timing.

Detailed rows

n = 1,048,576, queries = 1,000,000, 3 repetitions:

Duplicate runbinary search membershiplower-bound membershiplower-bound positionequal-range count
1729.078713.280726.278743.355
4715.141706.743707.536737.183
16700.384689.846686.602724.461
64644.760632.632636.485678.741

8. Hash-Map Code Shapes

Family question:

How much does O(1) lookup depend on collision layout and metadata placement?

8.1 Chained Heap

Code Shape

struct ChainNode {
    int key;
    int value;
    ChainNode* next;
};

bool find(int key, int& out) const {
    ChainNode* node = buckets[hash(key) % bucketCount];
    while (node != nullptr) {
        if (node->key == key) {
            out = node->value;
            return true;
        }
        node = node->next;
    }
    return false;
}

Cache-Line Access Visual

CPU -> [bucket pointer]
       -> [? heap node: key value next]
       -> [? heap node: key value next]

reference point:
random bucket plus pointer chain

Interpretation

FieldNotes
ExpectationHeap chaining should be the pointer-chasing baseline.
Measured resultimplemented but not shown as a separate public table in this article.
What happenedThe displayed compare-all row uses chained arena rather than chained heap.
TakeawayHeap chaining needs a direct row before it can support a public timing claim here.
Evidence statusImplemented but not shown.

8.2 Chained Arena

Code Shape

std::vector<ChainNode> arena;
std::vector<int> bucketHead;

int index = bucketHead[bucket];
while (index != -1) {
    const ChainNode& node = arena[static_cast<std::size_t>(index)];
    if (node.key == key) {
        return true;
    }
    index = node.nextIndex;
}

Cache-Line Access Visual

CPU -> [bucket index]
       -> [CL arena: node0 node1 node2 node3]
       -> maybe next arena index

reference point:
collision chain remains, but nodes are compact

Interpretation

FieldNotes
ExpectationArena chaining should beat scattered chaining and may beat open addressing if chains stay short and compact.
Measured resultn = 524,288, queries = 500,000, 3 repetitions: detailed rows below.
What happenedChained arena was fastest in this implementation and workload.
TakeawayOpen addressing does not automatically win. Implementation shape and compactness matter.
Evidence statusSupported by timing.

Detailed rows

n = 524,288, queries = 500,000, 3 repetitions:

Variantns/queryVisual
chained arena69.551###########
Robin Hood95.566##############
std::unordered_map100.008###############
tag split 8108.137################
linear probing118.353##################
bucketized 8120.731##################

8.3 Linear Probing

Code Shape

std::size_t i = hash(key) & mask;

while (table[i].occupied) {
    if (table[i].key == key) {
        return true;
    }
    i = (i + 1U) & mask;
}

return false;

Cache-Line Access Visual

CPU -> [? bucket start]
       -> [CL: slot0 slot1 slot2 slot3]
       -> [CL next: slot4 slot5 slot6 slot7]

reference point:
random start, then local probe scan

Interpretation

FieldNotes
ExpectationLinear probing should be good at moderate load factor and degrade sharply at high load factor.
Measured resultLoad-factor sweep: detailed rows below.
What happenedProbe tails grew much faster than medians at high load. At 0.95 load, p99 probes exceeded 1200.
TakeawayHash tables need load-factor and tail-probe measurements, not only average lookup time.
Evidence statusSupported by timing and representative counters.

Detailed rows

Load-factor sweep:

Target loadInserted keysns/queryp50 probesp95 probesp99 probes
0.30157,286138.442135
0.50262,144175.6061611
0.70367,001227.82421632
0.85445,644297.676363144
0.95498,073676.31775311207

Representative perf stat row:

EventCount
cycles725,366,509
instructions1,174,526,099
cache references10,805,729
cache misses7,064,581
LLC load misses6,815,291
dTLB load misses2,542,337
branches284,527,336
branch misses9,065,424

Hash load factor probe tail

8.4 Robin Hood Hashing

Code Shape

while (slotOccupied(i)) {
    if (table[i].key == key) {
        return true;
    }
    if (probeDistance(i) < currentProbeDistance) {
        return false;
    }
    i = (i + 1U) & mask;
}

Cache-Line Access Visual

CPU -> [? bucket]
       -> [CL: slot probe=0 | slot probe=1 | slot probe=2]

reference point:
local probes remain, but distance metadata controls tails

Interpretation

FieldNotes
ExpectationRobin Hood should reduce probe-tail variance compared with basic linear probing.
Measured resultRobin Hood row: detailed rows below.
What happenedRobin Hood beat linear probing in the compare-all row and had low p99 probes in the tail row.
TakeawayProbe-tail distribution is as important as mean lookup time.
Evidence statusSupported directionally.

Detailed rows

Robin Hood row:

Variantns/queryp50 probesp95 probesp99 probes
Robin Hood134.260135

Compare-all row:

Variantns/queryVisual
Robin Hood95.566##############
linear probing118.353##################

8.5 Bucketized Hash

Code Shape

Bucket& bucket = buckets[bucketIndex(key)];

for (int slot = 0; slot < BucketSize; ++slot) {
    if (bucket.occupied[slot] && bucket.keys[slot] == key) {
        return true;
    }
}

Cache-Line Access Visual

bucket size 4:
CPU -> [CL bucket: slot0 slot1 slot2 slot3]

bucket size 16:
CPU -> [CL bucket part A] [CL bucket part B] [CL bucket part C] ...

reference point:
one bucket access tests several candidates

Interpretation

FieldNotes
ExpectationBucket size should have a sweet spot: too small means more overflow, too large means too much local scan.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedBucket size 4 was best in this pass.
TakeawayBucket size is the hash-map version of block size: one memory access should buy enough comparisons, but not too many.
Evidence statusSupported by timing.

Detailed rows

Bucket sizens/queryVisual
491.262###########
8120.048###############
16164.193####################

8.6 Dense ID Array

Code Shape

int get(std::uint32_t denseId) {
    return values[denseId];
}

Cache-Line Access Visual

CPU -> [CL values: v0 v1 v2 v3 v4 v5 v6 v7]
       direct index

reference point:
hashing and collision handling are removed

Interpretation

FieldNotes
ExpectationDense IDs should be the upper-bound baseline when the key domain allows direct indexing.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedDense ID lookup was much cheaper than hash-table rows.
TakeawayWhen possible, remove hashing from the hot path rather than optimizing it.
Evidence statusSupported by timing and byte accounting.

Detailed rows

Variantns/query
dense ID array11.384

Bytes-per-lookup row:

Variantns/queryAvg probesp95 probesEstimated bytes/lookup
dense ID array13.1561.00014.000

8.7 Tag Split Hash

Code Shape

auto tag = fingerprint(hash);
auto& meta = metadata[bucket];

for (int i = 0; i < 16; ++i) {
    if (meta.tags[i] == tag) {
        const Entry& entry = entries[meta.entryIndex[i]];
        if (entry.key == key) {
            return true;
        }
    }
}

return false;

Cache-Line Access Visual

CPU -> [CL tag metadata: t0 t1 t2 ... t15]
       tag mismatch -> stop
       tag match    -> [CL full entry: key value]

reference point:
metadata can reject many misses before loading payload

Interpretation

FieldNotes
ExpectationTag split should help misses and large-payload lookups, but tag width can trade false positives against metadata cost.
Measured resultTag-size sweep: detailed rows below.
What happenedTag width differences were small. Tag split did not win broadly at 0% hits, but helped in 50% and 100% rows in this implementation.
TakeawayFingerprint metadata needs hit/miss-ratio measurements. It is not automatically faster.
Evidence statusPartially supported.

Detailed rows

Tag-size sweep:

Tag bitsns/queryp50 probesp95 probesp99 probesVisual
7144.311134###################
8146.515134###################
12147.959134####################
16145.981134###################

Hit/miss ratio sweep:

VariantHit rations/queryp95 probesp99 probes
linear0%161.529813
tag80%175.518813
tag160%170.531813
linear50%174.359611
tag850%160.616611
tag1650%162.057611
linear100%117.69347
tag8100%109.96747
tag16100%106.24247

8.8 Hot/Cold Hash

Code Shape

struct EntryHot {
    std::uint64_t key;
    std::uint32_t valueIndex;
    State state;
};

if (hot[i].key == key) {
    return values[hot[i].valueIndex];
}

Cache-Line Access Visual

normal entry:
CPU -> [CL: key state value payload payload]

split entry:
CPU -> [CL hot: key state valueIndex]
only on match -> [cold payload]

reference point:
payload stays cold during failed lookup

Interpretation

FieldNotes
ExpectationHot/cold split should help once value payload is large enough.
Measured resultDetailed rows below; benchmark numbers unchanged.
What happenedSplit layout lost at 16 B and won for larger payloads.
TakeawayHot/cold split is conditional. It helps when payload is large and not always consumed.
Evidence statusSupported by timing.

Detailed rows

PayloadNormal entry ns/querySplit hot/cold ns/query
16 B177.117191.221
64 B211.278181.337
128 B222.311178.956
256 B236.144184.487

8.9 Batched Prefetch Lookup

Code Shape

for (std::size_t i = 0; i < keys.size(); i += Batch) {
    for (std::size_t j = i; j < i + Batch; ++j) {
        bucket[j - i] = table.bucketIndex(keys[j]);
        prefetchRead(&table.entries[bucket[j - i]]);
    }

    for (std::size_t j = i; j < i + Batch; ++j) {
        found += table.findFromBucket(keys[j], bucket[j - i]);
    }
}

Cache-Line Access Visual

phase 1:
CPU computes bucket0 bucket1 bucket2 bucket3
P -> [CL bucket0] [CL bucket1] [CL bucket2] [CL bucket3]

phase 2:
CPU uses bucket0 bucket1 bucket2 bucket3

reference point:
batching creates independent future addresses

Interpretation

FieldNotes
ExpectationBatching should expose memory-level parallelism. Too-large batches may add overhead or pollution.
Measured resultSingle-thread: detailed rows below.
What happenedBatch 8 was best in the single-thread row.
TakeawayBatching can improve lookup throughput, but batch size is workload-specific.
Evidence statusSupported by timing.

Detailed rows

Single-thread:

Batch sizens/queryVisual
1137.675####################
2122.769##################
4108.192################
899.972###############
16106.113###############
32110.872################

8.10 Set-Conflict Hash

Code Shape

std::size_t indexIdentity(std::uint64_t key) {
    return key & mask;
}

std::size_t indexMixed(std::uint64_t key) {
    return splitmix64(key) & mask;
}

Cache-Line Access Visual

identity + patterned keys:
CPU -> [CL set A] -> [CL set A] -> [CL set A] !

mixed hash:
CPU -> [CL set A] -> [CL set C] -> [CL set F]

reference point:
hash choice affects table placement and cache-set pressure

Interpretation

FieldNotes
ExpectationHash quality and table size should affect lookup time, especially with patterned keys.
Measured resultSequential keys: detailed rows below.
What happenedIdentity hashing was not worst on this sequential-key workload.
TakeawayConflict experiments must include realistic bad key patterns, not only sequential keys.
Evidence statusPartially supported.

Detailed rows

Sequential keys:

Table sizeHashns/queryVisual
1,048,576identity36.584###########
1,048,576splitmix6467.638###################
1,048,576multiplicative73.249####################
1,048,699identity36.638###########
1,048,699splitmix6465.814##################
1,049,585identity35.920##########
1,049,585splitmix6464.511##################

8.11 Tombstone Degradation

Code Shape

eraseSomeKeys();       // leaves tombstones
measureLookup();

rehashWithoutDeadSlots();
measureLookup();

Cache-Line Access Visual

before repair:
CPU -> [CL: filled tombstone filled tombstone empty]

after rehash:
CPU -> [CL: filled filled empty . .]

reference point:
tombstones extend probe chains

Interpretation

FieldNotes
ExpectationTombstones should degrade lookup by lengthening probe sequences. Rehash repair should reduce tails.
Measured resultOriginal benchmark status: detailed rows below.
What happenedRehash repair reduced both time and probe tails.
TakeawayOpen-addressed hash tables must account for table aging, not just fresh construction.
Evidence statusReplacement measured; original benchmark blocked by correctness.

Detailed rows

Original benchmark status:

hashmap_tombstone_degradation: blocked by correctness

Focused replacement row:

VariantTombstonesns/queryp50 probesp95 probesp99 probes
with tombstones72,447212.91311018
rehashed0156.0241611

8.12 Bytes Per Lookup

Code Shape

bytes += probes * bytesPerHotSlot;
if (keyMatch) {
    bytes += bytesPerPayload;
}

Cache-Line Access Visual

dense ID:
CPU -> [CL values: *]                 about 4 B useful payload

tag split:
CPU -> [CL tags: * * * *] -> maybe [full entry]

bucketized:
CPU -> [CL bucket with many slots]     more bytes per probe

reference point:
estimated bytes and timing may rank layouts differently

Interpretation

FieldNotes
ExpectationEstimated bytes touched should explain part of lookup time but not all of it.
Measured result50% hits: detailed rows below.
What happenedDense ID touched the fewest bytes and was fastest. Bucketized had low probe count but high estimated bytes.
TakeawayProbe count alone is incomplete. Bytes touched and cache-line footprint also matter.
Evidence statusSupported by software byte accounting and timing.

Detailed rows

50% hits:

Variantns/queryAvg probesp95 probesEstimated bytes/lookup
dense ID array13.1561.00014.000
Robin Hood108.0141.625339.001
tag split 8125.5731.277326.323
bucketized 8135.5931.0011136.074
linear probing143.1451.994647.846
hot/cold split168.5901.994651.846

Hash bytes touched vs time

9. Multithreaded Code Shapes

Family question:

What changes when cache lines become ownership units shared across cores?

9.1 False Sharing

Code Shape

for (std::uint64_t i = 0; i < steps; ++i) {
    counters[threadId].value++;
}

Cache-Line Access Visual

bad adjacent counters:
[CL0: T0 counter | T1 counter | T2 counter | T3 counter]
       ! ownership moves between cores on writes

reference point:
threads write different variables but share one cache line

Interpretation

FieldNotes
ExpectationAdjacent counters should scale poorly because writes move cache-line ownership between cores.
Measured resultsteps = 5,000,000, 3 repetitions: detailed rows below.
What happenedThe adjacent row was worse than padded at 8 threads, but local write-once was much stronger than both.
TakeawayAvoid repeated shared writes inside hot loops.
Evidence statusSupported by timing; perf c2c evidence is blocked by tooling.

Detailed rows

steps = 5,000,000, 3 repetitions:

ThreadsBad adjacent ns/incPadded ns/incLocal write-once ns/inc
17.6427.6690.047
23.9583.8260.022
42.0201.9260.015
82.3681.7370.020

Counter ownership ladder

9.2 Padded Counters

Code Shape

struct alignas(64) PaddedCounter {
    std::uint64_t value = 0;
};

counters[threadId].value++;

Cache-Line Access Visual

padded counters:
[CL0: T0 counter]
[CL1: T1 counter]
[CL2: T2 counter]
[CL3: T3 counter]

reference point:
each writer owns a separate line

Interpretation

FieldNotes
ExpectationPadding should reduce false sharing when threads repeatedly publish per-thread counters.
Measured result8-thread row: detailed rows below.
What happenedPadding improved the 8-thread counter row.
TakeawayPadding is useful when repeated writes to result slots are unavoidable.
Evidence statusSupported by timing.

Detailed rows

8-thread row:

Variantns/inc
bad adjacent counters2.368
padded counters1.737

9.3 Local Write-Once Counters

Code Shape

std::uint64_t local = 0;

for (auto query : myQueries) {
    local += search(query);
}

results[threadId].value = local;

Cache-Line Access Visual

hot loop:
T0 -> register/local stack only
T1 -> register/local stack only

end:
T0 -> [CL result0]
T1 -> [CL result1]

reference point:
shared write happens once

Interpretation

FieldNotes
ExpectationLocal accumulation should beat both adjacent and padded counters when the computation can reduce at the end.
Measured result8-thread row: detailed rows below.
What happenedLocal write-once was much cheaper than both shared-counter layouts.
TakeawayMake reads shared and writes private.
Evidence statusSupported by timing.

Detailed rows

8-thread row:

Variantns/inc
bad adjacent counters2.368
padded counters1.737
local write-once0.020

9.4 Read-Only Binary Search Scaling

Code Shape

std::size_t begin = threadId * queries.size() / threadCount;
std::size_t end = (threadId + 1U) * queries.size() / threadCount;

for (std::size_t i = begin; i < end; ++i) {
    local += contains(sharedIndex, queries[i]);
}

Cache-Line Access Visual

shared data:
all threads -> read-only [array/tree cache lines]

queries:
T0 -> query range 0
T1 -> query range 1
T2 -> query range 2

reference point:
shared reads, private counters

Interpretation

FieldNotes
ExpectationRead-only search should scale until memory latency, bandwidth, topology, or frequency effects dominate.
Measured resultn = 1,000,000, queries = 1,000,000, 3 repetitions: detailed rows below.
What happenedEytzinger stayed the fastest 8-thread row. Some speedups were above ideal, so exact speedup values remain provisional.
TakeawayUse these rows for direction, not as fixed-frequency scaling claims.
Evidence statusSupported directionally; fixed-frequency and SMT-off reruns are missing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 3 repetitions:

Variant1 thread ns/query4 threads ns/query8 threads ns/query8-thread speedup
binary751.195203.478124.3096.043x
branchless570.114217.112112.6395.061x
Eytzinger324.58982.79066.6754.868x
blocked493.224124.72284.5135.836x

9.5 Read-Only Hash Scaling

Code Shape

for (std::size_t i = begin; i < end; ++i) {
    local += table.find(queries[i]);
}

results[threadId].value = local;

Cache-Line Access Visual

T0 -> [? bucket] -> local probes
T1 -> [? bucket] -> local probes
T2 -> [? bucket] -> local probes

shared table is read-only
per-thread result is private or padded

Interpretation

FieldNotes
ExpectationRead-only hash lookup should scale until memory resources saturate.
Measured resultn = 500,000, queries = 1,000,000, 3 repetitions: detailed rows below.
What happenedRobin Hood was the fastest 8-thread hash-layout row.
TakeawayThreaded lookup should be measured per layout; single-thread ranking may not directly transfer.
Evidence statusSupported directionally.

Detailed rows

n = 500,000, queries = 1,000,000, 3 repetitions:

Variant1 thread ns/query4 threads ns/query8 threads ns/query8-thread speedup
linear probing108.10527.52219.3925.575x
Robin Hood88.70223.63617.0995.187x
bucketized 8120.98930.95631.0863.892x
tag split 879.19321.79620.7223.822x
hot/cold split122.97533.37126.9474.564x

Read-only search scaling

9.6 Chunked-List Scaling

Code Shape

std::size_t firstChunk = threadId * chunks.size() / threadCount;
std::size_t lastChunk = (threadId + 1U) * chunks.size() / threadCount;

for (std::size_t c = firstChunk; c < lastChunk; ++c) {
    scanChunk(chunks[c]);
}

Cache-Line Access Visual

T0 -> chunks [0..K/T)
T1 -> chunks [K/T..2K/T)
T2 -> chunks [2K/T..3K/T)

reference point:
threads scan disjoint ranges instead of all starting at the same head

Interpretation

FieldNotes
ExpectationChunk-range partitioning should scale better than all threads scanning from the same list head.
Measured resultn = 1,000,000, chunk size 64, 3 repetitions: detailed rows below.
What happenedScaling improved through 4 threads and then got worse at 8 threads.
TakeawayPartitioning avoids duplicate work, but memory throughput and scheduling still limit scaling.
Evidence statusSupported directionally.

Detailed rows

n = 1,000,000, chunk size 64, 3 repetitions:

Threadsns/valueSpeedupEfficiency
11.2281.000x1.000
20.7081.734x0.867
40.5012.453x0.613
80.9501.293x0.162

9.7 Batched Hash Scaling

Code Shape

for (std::size_t i = begin; i < end; i += Batch) {
    prefetchBatch(i);
    local += lookupBatch(i);
}

Cache-Line Access Visual

inside each thread:
P -> [bucket0] [bucket1] [bucket2] [bucket3]
then lookup bucket0..bucket3

across threads:
T0 batch stream
T1 batch stream
T2 batch stream

reference point:
MLP inside a thread plus parallel query ranges

Interpretation

FieldNotes
ExpectationBatching should improve lookup throughput, but the best batch size may change with thread count.
Measured resultn = 500,000, queries = 1,000,000, 3 repetitions: detailed rows below.
What happenedBatch 8 was best single-thread. Batch 4 was best at 8 threads.
TakeawayBatch size is not constant across thread counts.
Evidence statusSupported by timing.

Detailed rows

n = 500,000, queries = 1,000,000, 3 repetitions:

Batch1 thread ns/query4 threads ns/query8 threads ns/query8-thread speedup
1137.08448.87737.9313.614x
2118.55032.28321.8285.431x
4104.85127.64219.5765.356x
897.75237.05227.6063.541x
16105.20527.14419.7945.315x
32107.41428.26620.2315.309x

9.8 Read-Only Stream Bandwidth

Code Shape

for (std::size_t i = begin; i < end; ++i) {
    local += values[i];
}

Cache-Line Access Visual

T0 -> [CL range 0: * * * *]
T1 -> [CL range 1: * * * *]
T2 -> [CL range 2: * * * *]
T3 -> [CL range 3: * * * *]

reference point:
disjoint read-only streaming ranges

Interpretation

FieldNotes
ExpectationStream bandwidth should rise with threads until memory bandwidth saturates.
Measured resultn = 8,388,608 uint64_t values, 5 repetitions: detailed rows below.
What happenedThe stream row peaked at 4 threads and fell at 8 threads.
TakeawayThis gives a concrete bandwidth-saturation boundary for the measured setup.
Evidence statusSupported directionally; fixed-frequency rerun missing.

Detailed rows

n = 8,388,608 uint64_t values, 5 repetitions:

Threadsns/valueGB/sSpeedupEfficiency
11.3336.0001.000x1.000
20.73510.8861.814x0.907
40.47316.9032.817x0.704
80.51915.4132.569x0.321

Read-only stream bandwidth

10. Result Map

TopicBest measured row in this passMain note
Linear scan64 MiB int32_t, 0.867 ns/elementbaseline stream
Stride scan64 MiB stride 16, 10.034 ns/accessstrided access wastes cache-line payload
Full raw working-set ladder64 MiB linear 0.867 ns/op, random 45.038, dependent 160.778all raw-memory families now have ladder rows
Huge page probedefault 4 KiB pages 1.289 ns/value; transparent huge page path 1.253; explicit huge pages unavailablehuge-page result is measured/probed
Associativitystride 1025, 5.163 ns/access among listed conflict rowsplus-one strides were faster than paired power-of-two strides
Pointer chasearena sequential, 5.845 ns/nodelayout and order matter
Dependent chasesequential cycle, 6.674 ns/steprandom dependency costs about 14x
MLP chase16 lanes, 17.128 ns/chain-stepindependent misses overlap
Linked listchunked 64, 129,629 ns/querychunking reduces pointer jumps
Linked-list mutationindexed insert 13.728 ns/op; heap insert-new 177.696 ns/opallocation and pointer rewrites dominate mutation
Hot/cold list256 B payload, 7.6x faster than fat nodekeep payload cold
Binary searchEytzinger, 328.340 ns/querylayout beats normal binary
Branch-miss counter rowbinary-search program 11.60%; hash linear probing 3.19%representative perf stat evidence exists
Query orderingclustered binary, 80.564 ns/querywork order changes locality
Hash mapchained arena, 69.551 ns/querymeasured winner in this implementation
Hash bytes touchedtag split 8 estimated 26.323 B/lookup at 50% hitsbytes and time rank layouts differently
Batched hashbatch 8 single-thread, 97.752 ns/querybatching improved MLP over batch 1
False sharing8-thread adjacent counters, 2.368 ns/inc vs padded 1.737 vs local 0.020private accumulation is the cleanest shape
Binary scalingEytzinger 8 threads, 66.675 ns/queryfastest threaded binary-search row
Hash scalingRobin Hood 8 threads, 17.099 ns/queryfastest threaded hash-layout row
Batched hash scalingbatch 4 at 8 threads, 19.576 ns/querybatching sweet spot changed with thread count
Stream bandwidth4 threads, 16.903 GB/sbandwidth flattened after 4 threads

11. Negative Results And Mixed Results

ObservationStatusInterpretation
Simple linked-list prefetch did not rescue pointer chasing.MixedThe future address is discovered too late for prefetch to hide much latency.
Arena allocation did not fix random traversal.SupportedContiguous storage is not enough when traversal order is random.
Chained arena beat the expected open-addressing winner in one hash compare-all row.MixedImplementation details and compact chains mattered on this system.
Transparent huge-page advice had a small measured delta.InconclusiveExplicit huge pages were unavailable, so the result is not a full huge-page claim.
Padding counters helped, but local accumulation was much better.SupportedAvoiding repeated shared writes is stronger than spacing them out.
Tombstone degradation original binary failed correctness.Blocked by correctnessReplacement rows are useful, but the original benchmark cannot support timing claims.
perf c2c was unavailable.Blocked by toolingFalse-sharing claims are timing-backed, not c2c/HITM-backed.

12. Tradeoff Matrix

OptimizationHelps whenHurts whenEvidence strength
Sequential layoutTraversal order is known.Queries require random access.Supported
Chunked listRead-heavy misses or late hits dominate.Mutation inside chunks is frequent.Supported
Hot/cold splitSearch touches metadata more than payload.Every lookup consumes payload immediately.Supported
Bloom chunk filterNegative lookups dominate.Hit-heavy lookup pays filter overhead.Partially supported
Branchless binaryBranch mispredicts are material.Memory stalls dominate.Partially supported
Blocked searchOne global jump can be traded for local scans.Block scans become too large.Partially supported
Eytzinger layoutStatic large arrays are searched repeatedly.Frequent updates require rebuilds.Supported
Robin Hood hashingProbe-tail control matters.Insert displacement cost matters.Partially supported
Tag split hashMisses and large payloads are common.Tags false-positive often or payloads are small.Partially supported
Batched lookupIndependent queries exist.Single-query latency is the target.Supported directionally
Padded countersThreads must publish per-thread counters.Local accumulation can write once.Supported

13. Research Claims

ClaimMechanismMeasurement evidenceCaveatVerdict
Sequential scan is the baseline to beat.Full cache-line use and predictable access.Linear scan and working-set rows.Absolute timings are system-specific.Supported
Pointer chasing is slow because the next address is data-dependent.Load-to-use dependency prevents overlap.Dependent chase and linked-list rows.Arena allocation helps compact cases.Supported
Chunking improves list search.One pointer jump covers many local comparisons.Chunked list and hit-position rows.Mutation cost changes the tradeoff.Supported
Binary-search layout can matter as much as comparison count.Fewer global jumps or better tree layout.Eytzinger, blocked, and fanout rows.No layout wins every row.Supported
Hash-map lookup quality depends on collision layout and metadata placement.Local probes, probe tails, hot metadata.Compare-all, probe-tail, tag, hot/cold, and byte rows.Chained arena beat expectations in one row.Supported
Batching can expose memory-level parallelism.Independent bucket or chain requests overlap.MLP and batched hash rows.Batch sweet spot depends on workload and thread count.Supported directionally
False sharing can distort benchmark results.Cache-line ownership transfers on adjacent writes.Adjacent, padded, and local counter rows.perf c2c unavailable.Partially supported

14. Remaining Evidence

TopicStatusNeeded next
Per-variant branch counterspartially measuredIsolate normal binary, branchless, Eytzinger, and blocked binaries under separate perf stat runs.
Direct RFO/HITM evidenceblocked by toolingRequires CPU-specific events or available perf c2c.
SMT-off scalingnot measured yetRerun threaded rows with sibling threads disabled.
Fixed-frequency scalingnot measured yetRerun with frequency pinned where permissions and hardware allow.
Cross-machine validationnot measured yetRepeat on at least one newer CPU and one different cache hierarchy.
Production hash variantsnot measured yetAdd string keys, realistic deletion churn, and production-style allocators.

15. Reproducibility Notes

ItemPublic description
Measurement runBenchmark suite run on 2026-06-02
Benchmark codeScalar C++20, no SIMD or assembly in this version
Build shape60 separate benchmark binaries
Timing methodWarmup, repeated runs, median timing, checksum-protected loops
PinningSingle-thread rows pinned to one CPU; multithread rows pinned to the available CPU set
Counter rowsRepresentative Linux perf stat rows for binary search and hash-map lookup
Blocked toolingperf c2c unavailable, so false sharing is not backed by c2c counter evidence
Invalid row policyCorrectness failures are reported as blocked and are not used for timing claims

Figures in this article:

FigureWhat it shows
Working-set ladderRaw-memory footprint effects
Prefetch distanceStride prefetch-distance sweep
Linked-list miss costLinked-list miss cost and bytes per key
Binary locality sweepBlocked search and implicit B-tree fanout
Hash probe tailsLoad factor versus p50/p95/p99 probe tails
Hash bytes touchedEstimated bytes touched versus timing
Read-only search scalingThreaded binary and hash lookup
Counter ownership ladderAdjacent, padded, and local counter writes
Stream bandwidthRead-only stream bandwidth scaling

16. Limitations

AreaLimitation
System coverageOne measured Linux x86_64 system is included in the current result tables.
CPU controlsSMT stayed enabled; frequency was not pinned; governor was powersave.
Counter coverageLinux perf stat branch/cache rows exist for representative binary-search and hash-map binaries; perf c2c was unavailable.
PortabilityAbsolute timings are machine-specific; relative trends are the intended claim.
Implementation scopeScalar C++20 only; no SIMD, assembly, cloud, or distributed-system mapping.
Data structuresSome implementations are first-pass benchmark structures, not production containers.
CompletenessThe remaining missing items are mainly controlled-system counter reruns, SMT/frequency controls, and cross-machine validation.

17. Final Takeaway

Big-O explains growth. Cache behavior explains the trip through memory.

The repeated transformation pattern is:

one value per pointer
-> many values per pointer

global jumps
-> local block scans

pointer chains
-> local probes

fat objects
-> hot metadata plus cold payload

one dependent miss
-> many independent misses

shared writes
-> private or padded writes

Search performance is memory-pattern performance.