001 Notes

Cache Lines Before Big-O: Optimizing Search Algorithms by Changing Memory Access

0. Abstract

This note studies search performance as a memory-access problem.

The benchmark suite compares scalar C++20 implementations across:

raw memory patterns
linked-list layouts
binary-search layouts
hash-map collision strategies
batching and prefetching
multithreaded cache-line ownership

The central claim is:

Search performance is memory-pattern performance.

Absolute timings are specific to one measured x86_64 Linux system. The useful claim is the relative pattern: which code shapes stream through cache lines, which waste cache-line payload, which chase dependent addresses, which expose independent memory requests, and which create cache-line ownership traffic.

1. Research Questions

ID	Question
RQ1	When does memory locality dominate comparison count?
RQ2	Which code shapes reduce cache-line, TLB, branch, or dependency cost?
RQ3	When does memory-level parallelism compensate for poor locality?
RQ4	Which optimizations move cost from lookup to build, mutation, or memory footprint?
RQ5	Which claims are measured, which are partial, and which remain blocked by tooling?

2. Measurement Method

Rule	Purpose
Fixed seed data	Repeatable input generation.
Correctness before timing	Invalid variants do not support timing claims.
Warmup before timing	Reduce first-touch effects.
Median of repetitions	Reduce one-run noise.
Printed checksum	Prevent dead-code elimination.
Scalar C++20 only	Keep this version free of SIMD and assembly effects.
Standalone binaries	Keep construction, lookup, mutation, and scaling behavior visible.
Representative `perf stat` rows	Add counter evidence where available.

Article snippets are simplified code shapes. They omit CLI parsing, warmup, repetition control, correctness checks, result formatting, and checksum plumbing unless those details are the topic.

3. Measured System And Boundaries

Field	Value
System class	single measured x86_64 Linux system
CPU	`Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz`
Topology	4 cores, 8 logical CPUs
Cache line	64 bytes
Page size	4096 bytes
Compiler	`g++ 13.3.0`
Build flags	`-O3 -march=native -std=c++20 -Wall -Wextra -Wshadow -Wconversion`
Pinning	single-thread rows pinned to one CPU; multithread rows pinned to the available CPU set
SMT	enabled
Frequency	not fixed; governor remained `powersave`
Counter tooling	representative Linux `perf stat` rows available; `perf c2c` unavailable

Boundaries:

No benchmark number is universal.
Relative patterns are the intended claim.
Counter evidence is partial.
Rows blocked by correctness or tooling are labeled as blocked.
Missing measurements are not inferred.

4. ASCII Legend

All diagrams use cache-line-centered notation.

[CL] = one 64-byte cache line
*    = useful value read
.    = fetched but unused value
P    = explicit prefetch target
?    = unpredictable address
!    = conflict, contention, or ownership risk
->   = access order

Cache-line cost model:

good search path:
CPU -> [CL0: * * * * * * * *] -> [CL1: * * * * * * * *]
       many useful values per fetched line

bad search path:
CPU -> [CL0: * . . . . . . .] -> [CL97: * . . . . . . .]
       one useful value per fetched line

dependent search path:
CPU -> [? node] -> [? node] -> [? node]
       next address is unknown until the current load returns

5. Raw Memory Code Shapes

Family question:

What does the CPU pay for when the operation is only "read memory"?

5.1 Linear Scan

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < values.size(); ++i) {
    sum += static_cast<std::uint64_t>(values[i]);
}

return sum;

Cache-Line Access Visual

int32 payload, 16 values per 64B line:

CPU -> [CL0: * * * * * * * * * * * * * * * *]
       [CL1: * * * * * * * * * * * * * * * *]

reference point:
nearly every fetched slot is useful

Interpretation

Field	Notes
Expectation	Sequential scan should be the best raw-memory baseline because addresses are predictable and cache-line payload is fully used.
Measured result	`n = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Wider records cost more per logical element because each element consumes more cache-line payload.
Takeaway	Keep hot scan fields narrow. Do not drag cold payload through a search loop.
Evidence status	Supported by timing.

Detailed rows

n = 1,000,000, 5 repetitions:

Variant	Element size	ns/element	Visual
`int32_t`	4 B	`0.743`	`#`
`int64_t`	8 B	`1.027`	`##`
`struct16`	16 B	`2.808`	`#####`
`struct64`	64 B	`10.170`	`####################`

5.2 Stride Scan

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < values.size(); i += stride) {
    sum += static_cast<std::uint64_t>(values[i]);
}

return sum;

Cache-Line Access Visual

stride = 16 over int32:

CPU -> [CL0: * . . . . . . . . . . . . . . .]
       [CL1: * . . . . . . . . . . . . . . .]

reference point:
access is predictable, but most fetched bytes are unused

Interpretation

Field	Notes
Expectation	Large strides should be slower than stride 1 because each cache line contributes fewer useful values. Power-of-two strides may also create cache-set pressure.
Measured result	Selected rows, `n = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Stride 1 was fastest. Several plus-one strides beat nearby power-of-two strides.
Takeaway	Do not only ask whether the working set fits in cache. Also ask whether the addresses map evenly across cache sets.
Evidence status	Supported by timing.

Detailed rows

Selected rows, n = 1,000,000, 5 repetitions:

Stride	ns/access	Visual
`1`	`0.686`	`#`
`16`	`6.352`	`#########`
`64`	`7.921`	`############`
`65`	`7.430`	`###########`
`256`	`7.292`	`###########`
`257`	`5.000`	`#######`
`1024`	`7.586`	`###########`
`1025`	`3.784`	`######`

5.3 Random Access

Code Shape

std::uint64_t sum = 0;

for (std::size_t i = 0; i < indexes.size(); ++i) {
    sum += static_cast<std::uint64_t>(values[indexes[i]]);
}

return sum;

Cache-Line Access Visual

index stream:
17 -> 900231 -> 42 -> 700001

CPU -> [? CL17:  * . . . . . . .]
       [? CL900: * . . . . . . .]
       [? CL42:  * . . . . . . .]

reference point:
loads are independent, but the next line is hard to predict

Interpretation

Field	Notes
Expectation	Random access should be tolerable for small windows and expensive when the working set exceeds cache.
Measured result	`n = 1,000,000`, `queries = 1,000,000`, 5 repetitions: detailed rows below.
What happened	The small-window row was much cheaper than full-array random access.
Takeaway	Constraining randomness into a smaller active window can be a real optimization.
Evidence status	Supported by timing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 5 repetitions:

Pattern	ns/access	Visual
random small window	`4.355`	`####`
random within page	`10.172`	`##########`
shuffled permutation	`11.144`	`###########`
uniform random	`12.394`	`############`

5.4 Pointer Chase

Code Shape

std::uint64_t sum = 0;

for (PtrNode* node = head; node != nullptr; node = node->next) {
    sum += node->value;
}

return sum;

Cache-Line Access Visual

sequential arena:
CPU -> [CL0: node0 node1 node2 node3] -> [CL1: node4 node5 ...]

random heap or random arena:
CPU -> [? node A] -> [? node Q] -> [? node D] -> null

reference point:
next address is stored in the current node

Interpretation

Field	Notes
Expectation	Arena allocation should help, but random traversal should remain expensive because each next address depends on the previous load.
Measured result	`n = 500,000`, 5 repetitions: detailed rows below.
What happened	Arena allocation helped, but traversal order mattered more. Random arena traversal was still much slower than sequential arena traversal.
Takeaway	Contiguous storage is not enough. Traversal order controls whether cache-line locality is usable.
Evidence status	Supported by timing.

Detailed rows

n = 500,000, 5 repetitions:

Variant	ns/node	Relative
arena sequential	`5.845`	`1.0x`
heap sequential	`9.294`	`1.6x`
arena randomized	`112.305`	`19.2x`
heap randomized	`150.632`	`25.8x`

5.5 Dependent Pointer Chase

Code Shape

std::size_t index = 0;
std::uint64_t sum = 0;

for (std::size_t step = 0; step < steps; ++step) {
    index = next[index];
    sum += index;
}

return sum;

Cache-Line Access Visual

idx0 -> next[idx0] = idx1
        idx1 -> next[idx1] = idx2
                idx2 -> next[idx2] = idx3

CPU -> [? CL idx0] -> [? CL idx1] -> [? CL idx2]

reference point:
the next load cannot be issued until the current load returns

Interpretation

Field	Notes
Expectation	Random dependent chains should expose load-to-use latency and scale poorly with working-set size.
Measured result	`n = 1,000,000`, `steps = 5,000,000`, 5 repetitions: detailed rows below.
What happened	Random and page-spread cycles were about 14x slower than the sequential cycle.
Takeaway	Dependent memory chains are a latency problem, not just a cache-capacity problem.
Evidence status	Supported by timing.

Detailed rows

n = 1,000,000, steps = 5,000,000, 5 repetitions:

Variant	ns/step	Relative
sequential cycle	`6.674`	`1.0x`
random permutation cycle	`92.513`	`13.9x`
page-spread cycle	`96.173`	`14.4x`

5.6 Prefetch Pointer Chase

Code Shape

std::uint64_t sum = 0;

for (PtrNode* node = head; node != nullptr; node = node->next) {
    if (node->next != nullptr) {
        prefetchRead(node->next);
    }

    sum += node->value;
}

return sum;

Cache-Line Access Visual

CPU reads [CL?: node A | next -> node B]
  P -> [CL?: node B]
CPU then uses node B soon after

reference point:
prefetch is often too late because node B is discovered at node A

Interpretation

Field	Notes
Expectation	Simple next-prefetch may not help pointer chasing because address discovery and use are too close together.
Measured result	`n = 500,000`, 5 repetitions: detailed rows below.
What happened	Pointer next-prefetch did not improve the measured rows. Longer stride prefetch distances also did not improve this pass.
Takeaway	Prefetch is a distance problem. It helps only when the future address is known early enough and the prefetched data is likely to be used.
Evidence status	Supported for these rows; broader prefetch policy remains workload-specific.

Detailed rows

n = 500,000, 5 repetitions:

Variant	ns/node
no prefetch	`97.438`
prefetch next	`100.348`
prefetch next-next	`134.773`

Prefetch distance sweep for stride 16:

Distance	ns/access	Visual
0	`10.293`	`##################`
1	`10.426`	`##################`
2	`10.265`	`##################`
4	`10.204`	`##################`
8	`10.221`	`##################`
16	`11.400`	`####################`
32	`11.560`	`####################`
64	`11.427`	`####################`
128	`11.491`	`####################`

5.7 MLP Pointer Chase

Code Shape

std::vector<std::size_t> index(lanes, 0);
std::uint64_t sum = 0;

for (std::size_t step = 0; step < steps; ++step) {
    for (int lane = 0; lane < lanes; ++lane) {
        index[lane] = chains[lane][index[lane]];
        sum += index[lane];
    }
}

return sum;

Cache-Line Access Visual

lane 0: CPU -> [? CL] -> [? CL] -> [? CL]
lane 1: CPU -> [? CL] -> [? CL] -> [? CL]
lane 2: CPU -> [? CL] -> [? CL] -> [? CL]
lane 3: CPU -> [? CL] -> [? CL] -> [? CL]

reference point:
each lane is dependent, but lanes are independent of each other

Interpretation

Field	Notes
Expectation	Multiple independent chains should expose memory-level parallelism and lower per-chain cost until the core or memory system saturates.
Measured result	`working_set = 8,000,000` bytes, `steps = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Per-chain time dropped from `107.056 ns` to about `17 ns` near 16 lanes.
Takeaway	Independent work can hide latency even when each individual chain has poor locality.
Evidence status	Supported by timing.

Detailed rows

working_set = 8,000,000 bytes, steps = 1,000,000, 5 repetitions:

Lanes	ns/outer step	ns/chain-step	Visual per chain
`1`	`107.056`	`107.056`	`####################`
`2`	`108.373`	`54.186`	`##########`
`4`	`114.800`	`28.700`	`#####`
`8`	`152.170`	`19.021`	`####`
`16`	`274.044`	`17.128`	`###`
`32`	`557.560`	`17.424`	`###`

5.8 Associativity Conflict

Code Shape

for (std::size_t i = 0; i < values.size(); i += stride) {
    sum += values[i];
}

Cache-Line Access Visual

stride 256:
CPU -> [CL set A: *] -> [CL set A: *] -> [CL set A: *]
                         !

stride 257:
CPU -> [CL set A: *] -> [CL set B: *] -> [CL set C: *]

reference point:
similar strides can map differently across cache sets

Interpretation

Field	Notes
Expectation	Power-of-two strides can concentrate accesses into fewer cache sets; plus-one variants may spread mapping differently.
Measured result	`n = 4,000,000`, 5 repetitions: detailed rows below.
What happened	The plus-one stride was faster in every paired row in this pass.
Takeaway	Cache-set placement belongs in the experiment matrix. Similar-looking access patterns can differ.
Evidence status	Supported by timing; counter evidence not included.

Detailed rows

n = 4,000,000, 5 repetitions:

Stride	ns/access	Compared row	Visual
`64`	`7.862`	power of two	`############`
`65`	`7.040`	plus one	`###########`
`128`	`7.970`	power of two	`############`
`129`	`7.288`	plus one	`###########`
`256`	`8.441`	power of two	`#############`
`257`	`7.434`	plus one	`###########`
`512`	`7.507`	power of two	`###########`
`513`	`6.371`	plus one	`##########`
`1024`	`7.595`	power of two	`###########`
`1025`	`4.920`	plus one	`#######`

5.9 TLB Page Walk

Code Shape

for (std::size_t page : page_order) {
    sum += values[(page * page_size) / sizeof(values[0])];
}

Cache-Line Access Visual

one access per page:

CPU -> [page 0: CL0: * . . . . . . .]
       [page 1: CL0: * . . . . . . .]
       [page 2: CL0: * . . . . . . .]

reference point:
one useful load per page stresses address translation

Interpretation

Field	Notes
Expectation	Page-walk patterns should be much slower than dense sequential scans because they use fewer values per page and per cache line.
Measured result	`n = 4,000,000`, 3906 pages, 5 repetitions: detailed rows below.
What happened	The page-walk rows were much slower than dense sequential scan.
Takeaway	Good search layout should reuse both cache lines and pages.
Evidence status	Supported by timing; direct TLB counter rows are not shown per variant.

Detailed rows

n = 4,000,000, 3906 pages, 5 repetitions:

Pattern	ns/access
dense sequential scan	`4.410`
sequential page walk	`31.813`
one access per 4 KB page	`32.018`
random page walk	`32.817`

Completion pass:

Pattern	ns/access	Visual
one access per 4 KB page	`33.580`	`##################`
sequential page walk	`33.277`	`##################`
random page walk	`36.540`	`####################`
dense sequential scan	`4.390`	`##`

5.10 Working-Set Ladder

Code Shape

for (std::size_t footprint : footprints) {
    buildInput(footprint);
    runSameAccessShape();
}

Cache-Line Access Visual

32 KiB  -> likely cache-resident
256 KiB -> larger private-cache pressure
1 MiB   -> beyond small private caches
8 MiB   -> shared-cache and memory pressure
64 MiB  -> memory-system pressure

reference point:
same loop shape, larger footprint

Interpretation

Field	Notes
Expectation	Sequential access should stay relatively stable; random and dependent access should rise sharply with footprint.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Sequential access stayed cheap. Random, pointer, and dependent rows rose sharply at larger footprints. MLP stayed lower than a single random dependent chain.
Takeaway	Footprint sweeps are required before interpreting one benchmark row.
Evidence status	Supported by timing.

Detailed rows

Pattern	Variant	32 KiB ns/op	256 KiB ns/op	1 MiB ns/op	8 MiB ns/op	64 MiB ns/op
linear scan	int32 sequential	`0.600`	`0.527`	`0.534`	`0.827`	`0.867`
stride scan	stride 16	`2.201`	`4.948`	`6.114`	`9.258`	`10.034`
random access	uniform random	`3.911`	`5.000`	`8.298`	`18.924`	`45.038`
pointer chase	arena random	`9.365`	`29.733`	`53.748`	`109.177`	`170.508`
dependent chase	random cycle	`6.396`	`19.608`	`45.818`	`116.110`	`160.778`
prefetch chase	prefetch next	`9.847`	`24.481`	`50.856`	`103.324`	`187.059`
MLP chase	8 lanes	`1.261`	`3.788`	`6.932`	`17.450`	`42.083`
TLB page walk	random page	`19.375`	`7.828`	`7.070`	`28.464`	`38.044`

5.11 Cold/Warm Scan, Write Traffic, And Huge Pages

Code Shape

touchEvictionBuffer();
runScan();      // cold-style pass
runScan();      // warm repeated pass

for (std::size_t i = 0; i < values.size(); ++i) {
    values[i] = values[i] + 1;
}

Cache-Line Access Visual

cold-style:
CPU -> [eviction buffer] -> [CL0: * * * *] -> [CL1: * * * *]

read-modify-write:
CPU -> [CL0: load * * * *] -> [CL0: writable !] -> store

transparent huge-page probe:
4 KiB pages -> many translations
2 MiB pages -> fewer translations, if available

Interpretation

Field	Notes
Expectation	Warm repeated scans may be faster. Writes should cost more than reads because cache lines must become writable. Huge pages may help TLB-sensitive access if available.
Measured result	Cold versus warm, 32 MiB: detailed rows below.
What happened	The warm scan was only slightly faster. Writes were slower than reads. Transparent huge pages were slightly faster in this row, while explicit huge pages were unavailable.
Takeaway	Read/write ownership and page size are real variables, but this article has timing evidence only, not direct RFO or per-row TLB counter proof.
Evidence status	Partially supported. Explicit huge pages and direct RFO counters remain blocked or unavailable.

Detailed rows

Cold versus warm, 32 MiB:

Items	Bytes	Cold ns/elt	Warm ns/elt	Cold/Warm	Visual
8,388,608	32 MiB	`0.886`	`0.865`	`1.02x`	`##`

Write and RFO-style timing proxy:

Variant	ns/op	Visual
read_only	`1.273`	`#############`
write_stream	`1.918`	`####################`
read_modify_write	`1.884`	`###################`

Huge page probe, 64 MiB allocation:

Mode	Status	Page size	ns/value	Note
default pages	measured	`4096`	`1.289`	normal anonymous mapping
explicit huge pages	unavailable	`2097152`	n/a	huge-page allocation unavailable
transparent huge pages	measured	`2097152`	`1.253`	advised huge-page path

6. Linked-List Code Shapes

Family question:

Can list-like search be made less expensive by changing how much useful work each pointer jump buys?

6.1 Heap List

Code Shape

bool searchList(PtrNode* head, int key) {
    for (PtrNode* node = head; node != nullptr; node = node->next) {
        if (node->value == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

CPU -> [? heap node A: * next] -> [? heap node Q: * next]
       -> [? heap node D: * next] -> null

reference point:
one useful value per dependent pointer jump

Interpretation

Field	Notes
Expectation	Heap allocation should be the poor-locality baseline when traversal order is scattered.
Measured result	Pointer-chase row: detailed rows below.
What happened	Heap randomized traversal was the expensive baseline.
Takeaway	Heap nodes plus random traversal are a worst-case shape for cache-line reuse.
Evidence status	Supported by timing.

Detailed rows

Pointer-chase row:

Variant	ns/node	Relative
heap sequential	`9.294`	`1.6x`
heap randomized	`150.632`	`25.8x`

Linked-list compare-all row:

Variant	ns/query	Relative
heap randomized	`3,646,397`	`28.1x`

6.2 Arena List

Code Shape

std::vector<PtrNode> nodes(n);

for (std::size_t i = 0; i + 1U < nodes.size(); ++i) {
    nodes[i].value = static_cast<std::uint64_t>(i);
    nodes[i].next = &nodes[i + 1U];
}

nodes.back().next = nullptr;

Cache-Line Access Visual

arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]

CPU -> node0 -> node1 -> node2 -> node3 -> node4

reference point:
allocation and traversal are both compact

Interpretation

Field	Notes
Expectation	Arena allocation should improve locality but cannot remove the pointer dependency.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Arena sequential was much faster than heap randomized, but slower than chunked list search in the compare-all row.
Takeaway	Arena allocation helps, but one pointer jump per value is still expensive for late-hit and miss-heavy search.
Evidence status	Supported by timing.

Detailed rows

Variant	ns/node	Relative
arena sequential	`5.845`	`1.0x`

Compare-all:

Variant	ns/query	Relative
arena sequential	`333,708`	`2.6x`

6.3 Random Arena List

Code Shape

std::vector<std::size_t> order = makeRandomPermutation(n, seed);

for (std::size_t i = 0; i + 1U < order.size(); ++i) {
    nodes[order[i]].next = &nodes[order[i + 1U]];
}

nodes[order.back()].next = nullptr;

Cache-Line Access Visual

arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]

CPU -> node3 -> node0 -> node7 -> node2 -> node5

reference point:
storage is compact, but traversal is not

Interpretation

Field	Notes
Expectation	Random traversal should be slow even when allocation is compact.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Random arena traversal was much slower than arena sequential traversal.
Takeaway	The CPU follows access order, not allocation intent.
Evidence status	Supported by timing.

Detailed rows

Variant	ns/node	Relative
arena randomized	`112.305`	`19.2x`

6.4 Prefetch List

Code Shape

for (PtrNode* node = head; node != nullptr; node = node->next) {
    if (node->next != nullptr) {
        prefetchRead(node->next);
    }

    sum += node->value;
}

Cache-Line Access Visual

CPU reads [CL?: node A | next -> node B]
  P -> [CL?: node B]
CPU immediately advances to node B

reference point:
prefetch target is discovered late

Interpretation

Field	Notes
Expectation	Naive pointer-list prefetch should be weak because there is little work between prefetch and use.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Naive prefetch did not beat arena sequential in these rows.
Takeaway	Linked-list prefetch needs enough independent work between address discovery and use.
Evidence status	Supported for this implementation and query mix.

Detailed rows

Variant	ns/query	Relative
prefetch next on arena list	`484,095`	`3.7x`
arena sequential	`333,708`	`2.6x`

Completion compare-all:

Variant	ns/query	Visual
prefetch next	`963,053`	`##`
arena sequential	`667,221`	`#`

6.5 Chunked List

Code Shape

template <int ChunkSize>
bool searchChunked(const ChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

normal list:
CPU -> [CL?: value next] -> [CL?: value next] -> [CL?: value next]

chunked list:
CPU -> [CL0: * * * * * * * * | chunk metadata]
       [CL1: * * * * * * * * | next part of chunk]
       -> next chunk

reference point:
one pointer jump buys many value checks

Interpretation

Field	Notes
Expectation	Chunking should improve late-hit and miss-heavy list search because it reduces pointer jumps per searchable value.
Measured result	Compare-all: detailed rows below.
What happened	Chunking improved the expensive late-hit and miss rows compared with pointer-per-value traversal.
Takeaway	If a list must be scanned, make each pointer jump buy a block of local work.
Evidence status	Supported by timing.

Detailed rows

Compare-all:

Variant	ns/query	Relative
chunked 64	`129,629`	`1.0x`
arena sequential	`333,708`	`2.6x`
heap randomized	`3,646,397`	`28.1x`

Hit-position distribution, chunked 32:

Position	Ops	ns/query	Visual
early hit	1,000,000	`5.68`	`#`
middle hit	500	`315,212`	`##########`
late hit	500	`633,347`	`####################`
miss	500	`630,506`	`####################`

6.6 Chunked Prefetch

Code Shape

template <int ChunkSize>
bool searchChunkedPrefetch(const ChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        if (node->next != nullptr) {
            prefetchRead(node->next);
        }

        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

CPU reads [chunk A | next -> chunk B]
  P -> [chunk B]
CPU scans [CL A0: * * * *] [CL A1: * * * *]
CPU then uses [chunk B]

reference point:
chunk scan creates prefetch distance

Interpretation

Field	Notes
Expectation	Chunked prefetch should help more than normal list prefetch because scanning the current chunk creates lead time.
Measured result	`n = 100,000`, `queries = 5,000`, 5 repetitions: detailed rows below.
What happened	Prefetch was not a broad win. The best no-prefetch and prefetch rows were both chunk size `16`.
Takeaway	Prefetch distance is necessary but not sufficient. Chunk size, query mix, and cache state decide whether it pays.
Evidence status	Mixed timing evidence.

Detailed rows

n = 100,000, queries = 5,000, 5 repetitions:

Chunk size	no prefetch ns/query	prefetch ns/query
`4`	`149,268`	`165,376`
`8`	`98,464`	`108,059`
`16`	`94,233`	`98,170`
`32`	`183,450`	`228,932`
`64`	`189,332`	`187,497`
`128`	`101,391`	`105,723`
`256`	`134,423`	`133,957`

6.7 Indexed List

Code Shape

struct IndexNode {
    int value = 0;
    int next = -1;
};

bool searchIndexed(const std::vector<IndexNode>& nodes, int key) {
    for (int index = 0; index != -1; index = nodes[static_cast<std::size_t>(index)].next) {
        if (nodes[static_cast<std::size_t>(index)].value == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

nodes vector:
[CL0: idx0 value,next | idx1 value,next | idx2 value,next | idx3 value,next]

CPU -> nodes[0] -> nodes[4] -> nodes[1]

reference point:
link is a compact integer, but traversal is still dependent

Interpretation

Field	Notes
Expectation	Indexed nodes should reduce hot-node footprint, but still behave like a dependent chase.
Measured result	`n = 200,000`, `queries = 500`, 3 repetitions: detailed rows below.
What happened	Indexed nodes cut footprint but were slower than arena pointer nodes in the listed high-level query shape.
Takeaway	Compact representation helps memory footprint, but it does not remove address dependency.
Evidence status	Supported by timing and byte counts.

Detailed rows

n = 200,000, queries = 500, 3 repetitions:

Variant	ns/query	Bytes/searchable key
indexed list	`949,784`	`8.0`

Hit-position distribution:

Position	Ops	ns/query	Visual
early hit	1,000,000	`2.56`	`#`
middle hit	500	`771,964`	`##########`
late hit	500	`1,545,820`	`####################`
miss	500	`1,543,800`	`####################`

6.8 Hot/Cold List

Code Shape

struct HotNode {
    int key = 0;
    int next = -1;
};

struct ColdNode {
    int value = 0;
    std::array<char, PayloadSize> payload{};
};

bool searchHot(const std::vector<HotNode>& hot, int key) {
    for (int index = 0; index != -1; index = hot[static_cast<std::size_t>(index)].next) {
        if (hot[static_cast<std::size_t>(index)].key == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

fat node search:
CPU -> [CL: key next payload payload] -> [CL: payload payload ...]

hot/cold search:
CPU -> [CL hot: key next | key next | key next | key next]
only on match -> [cold payload]

reference point:
negative search touches metadata, not payload

Interpretation

Field	Notes
Expectation	Hot/cold split should help as payload grows.
Measured result	`n = 100,000`, `queries = 5,000`, 5 repetitions: detailed rows below.
What happened	Fat-node search cost grew with payload. Hot/cold search stayed nearly flat.
Takeaway	Search should touch searchable metadata first and cold payload only after a match.
Evidence status	Supported by timing.

Detailed rows

n = 100,000, queries = 5,000, 5 repetitions:

Payload	fat node ns/query	hot/cold ns/query	Improvement
`16 B`	`827,299`	`559,820`	`1.5x`
`64 B`	`1,504,184`	`559,620`	`2.7x`
`128 B`	`2,615,491`	`559,849`	`4.7x`
`256 B`	`4,265,747`	`559,886`	`7.6x`

6.9 Multi-Lane List

Code Shape

std::uint64_t found = 0;

for (std::size_t q = 0; q < queries.size(); q += static_cast<std::size_t>(lanes)) {
    for (int lane = 0; lane < lanes; ++lane) {
        found += searchList(heads[static_cast<std::size_t>(lane)], queries[q + lane]);
    }
}

Cache-Line Access Visual

lane 0: CPU -> [? node] -> [? node] -> [? node]
lane 1: CPU -> [? node] -> [? node] -> [? node]
lane 2: CPU -> [? node] -> [? node] -> [? node]
lane 3: CPU -> [? node] -> [? node] -> [? node]

reference point:
one list remains dependent; many lists expose throughput parallelism

Interpretation

Field	Notes
Expectation	Interleaving independent searches should expose memory-level parallelism.
Measured result	`n = 200,000`, `queries = 250`, 3 repetitions: detailed rows below.
What happened	Per-query time dropped sharply as lanes increased.
Takeaway	MLP improves throughput across independent list searches, not the latency of one dependent traversal.
Evidence status	Supported by timing.

Detailed rows

n = 200,000, queries = 250, 3 repetitions:

Lanes	ns/query	Visual
`1`	`7,129,608`	`####################`
`2`	`3,419,579`	`##########`
`4`	`1,604,336`	`#####`
`8`	`611,275`	`##`
`16`	`250,894`	`#`

6.10 Bloom Chunked List

Code Shape

bool searchBloomChunked(const BloomChunkNode<ChunkSize>* head, int key) {
    for (auto* node = head; node != nullptr; node = node->next) {
        if ((node->bloom & bloomMask(key)) == 0U) {
            continue;
        }

        for (int i = 0; i < node->count; ++i) {
            if (node->values[i] == key) {
                return true;
            }
        }
    }

    return false;
}

Cache-Line Access Visual

negative lookup:
CPU -> [CL chunk metadata: bloom bits]
       bit absent -> skip [values] -> next chunk

maybe hit:
CPU -> [CL chunk metadata: bloom bits]
       bit present -> [CL values: * * * *]

reference point:
cheap metadata can avoid scanning values

Interpretation

Field	Notes
Expectation	Bloom filtering should help negative lookups by skipping local chunk scans.
Measured result	`n = 200,000`, `queries = 2,000`, 3 repetitions: detailed rows below.
What happened	The hit-ratio sweep improved as generated hits became more common and early in the benchmark shape.
Takeaway	Bloom-per-chunk needs hit-ratio and hit-position separated before making a broad claim.
Evidence status	Partially supported.

Detailed rows

n = 200,000, queries = 2,000, 3 repetitions:

Hit ratio	ns/query	Visual
`0%`	`314,063`	`####################`
`10%`	`282,604`	`##################`
`50%`	`157,390`	`##########`
`100%`	`1,543`	`#`

6.11 Compare-All And Mutation Rows

Code Shape

sameQueries();
run(heapList);
run(arenaList);
run(prefetchList);
run(chunkedList);

// mutation timing is separated from search-to-position cost
insertAfter(cursor);
deleteAfter(cursor);
spliceAfter(source, dest);

Cache-Line Access Visual

compare-all:
same query stream -> heap / arena / prefetch / chunked

mutation:
[cursor] -> [new_node] -> [old_next]
[cursor] -----------> [next_after_delete]

reference point:
lookup-friendly layout may move cost to update paths

Interpretation

Field	Notes
Expectation	Chunking should win lookup-heavy rows. Vector-backed mutation should avoid heap-allocation overhead.
Measured result	Completion compare-all, `n = 200,000`, `queries = 500`, 3 repetitions: detailed rows below.
What happened	Chunked lookup was strongest in compare-all. Heap allocation was expensive in mutation. Indexed mutation rows were cheapest in this pass.
Takeaway	List layout decisions should be evaluated on both lookup and mutation paths.
Evidence status	Supported for measured append/insert/delete/splice shapes; arbitrary middle-position search cost is separate.

Detailed rows

Completion compare-all, n = 200,000, queries = 500, 3 repetitions:

Variant	ns/query	Visual
heap randomized	`11,208,462`	`####################`
prefetch next	`963,053`	`##`
arena sequential	`667,221`	`#`
chunked 64	`257,240`	`#`

Insert/delete/splice mutation, operations = 100,000, 5 repetitions:

Layout	Operation	ns/op	Bytes/node	Visual
pointer arena	insert after, preallocated	`32.796`	`16`	`####`
pointer heap	insert after, `new`	`177.696`	`16`	`##################`
pointer arena	delete after unlink	`112.199`	`16`	`###########`
pointer arena	splice after relink	`194.450`	`16`	`####################`
indexed	insert after, preallocated	`13.728`	`8`	`#`
indexed	delete after free list	`27.253`	`8`	`###`
indexed	splice after relink	`39.530`	`8`	`####`

7. Binary-Search Code Shapes

Family question:

Can fewer global jumps beat fewer comparisons?

7.1 Tiny Linear Search

Code Shape

bool linearContains(const int* values, std::size_t n, int key) {
    for (std::size_t i = 0; i < n; ++i) {
        if (values[i] == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

small array:
CPU -> [CL0: * * * * * * * * * * * * * * * *]

reference point:
for tiny arrays, the full scan can stay in one or a few cache lines

Interpretation

Field	Notes
Expectation	Linear search should win early-hit rows for tiny arrays; binary search should win misses as `n` grows.
Measured result	`queries = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Linear search won early hits. Binary search won misses once `n` grew.
Takeaway	Small local search policy depends on hit position and miss rate.
Evidence status	Supported by timing.

Detailed rows

queries = 1,000,000, 5 repetitions:

n	linear early hit	binary early hit	linear miss	binary miss
`4`	`3.882`	`15.478`	`17.717`	`17.401`
`8`	`3.852`	`25.126`	`27.938`	`18.645`
`16`	`3.870`	`28.822`	`48.121`	`20.227`
`32`	`3.857`	`33.978`	`88.823`	`23.728`
`64`	`3.904`	`35.470`	`172.843`	`28.091`

7.2 Normal Binary Search

Code Shape

bool binaryContains(const std::vector<int>& values, int key) {
    std::size_t left = 0;
    std::size_t right = values.size();

    while (left < right) {
        const std::size_t mid = left + (right - left) / 2U;

        if (values[mid] == key) {
            return true;
        }

        if (values[mid] < key) {
            left = mid + 1U;
        } else {
            right = mid;
        }
    }

    return false;
}

Cache-Line Access Visual

CPU -> [CL mid: * . . .]
       -> [CL quarter: * . . .]
       -> [CL eighth: * . . .]

reference point:
few comparisons, but jumpy memory access

Interpretation

Field	Notes
Expectation	Normal binary search should be a strong comparison-count baseline but can lose to layouts with better memory locality.
Measured result	Large-array compare-all, `n = 1,000,000`, `queries = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Normal binary was slower than Eytzinger and blocked layouts in the large-array rows shown here.
Takeaway	Minimizing comparisons does not necessarily minimize memory stalls.
Evidence status	Supported by timing.

Detailed rows

Large-array compare-all, n = 1,000,000, queries = 1,000,000, 5 repetitions:

Variant	ns/query	Visual
normal binary	`591.551`	`####################`
Eytzinger	`326.475`	`###########`

Completion compare-all:

Variant	ns/query	Visual
binary	`738.569`	`##############`
Eytzinger	`328.340`	`######`

7.3 Branchless Binary

Code Shape

bool branchlessContains(const std::vector<int>& values, int key) {
    std::size_t base = 0;
    std::size_t len = values.size();

    while (len > 1U) {
        const std::size_t half = len / 2U;
        const std::size_t mid = base + half;
        const bool move = values[mid] <= key;
        base = move ? mid : base;
        len -= half;
    }

    return !values.empty() && values[base] == key;
}

Cache-Line Access Visual

CPU -> [CL mid: *]
       compute move flag
       update base without branch-shaped control flow
       -> [CL next mid: *]

reference point:
branch behavior changes; memory jumps remain

Interpretation

Field	Notes
Expectation	Branchless search should help when branch misses matter, but cannot remove jumpy memory access.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Branchless timing beat normal binary in the displayed compare-all row. The counter row is program-level, not per variant.
Takeaway	Branch behavior is material, but per-variant counter isolation is still needed.
Evidence status	Partially supported by timing and representative counters.

Detailed rows

Variant	ns/query	Visual
branchless	`338.353`	`###########`
normal binary	`591.551`	`####################`

Representative program-level branch counter row:

Family	Status	Branches	Branch misses	Miss rate
binary search	measured with Linux `perf stat`	`381,401,896`	`44,259,946`	`11.60%`

7.4 Blocked Search

Code Shape

bool blockedContains(const BlockedArray& array, int key) {
    auto it = std::lower_bound(array.blockMax.begin(), array.blockMax.end(), key);
    if (it == array.blockMax.end()) {
        return false;
    }

    const int block = static_cast<int>(it - array.blockMax.begin());
    const int first = block * array.blockSize;
    const int last = std::min(first + array.blockSize, static_cast<int>(array.values.size()));

    for (int i = first; i < last; ++i) {
        if (array.values[static_cast<std::size_t>(i)] == key) {
            return true;
        }
    }

    return false;
}

Cache-Line Access Visual

metadata:
CPU -> [CL blockMax: * . . .] -> choose block

local scan:
CPU -> [CL block: * * * * * * * *]
       [CL block: * * * * * * * *]

reference point:
one global decision leads to local sequential work

Interpretation

Field	Notes
Expectation	Blocked search should have a sweet spot: small blocks cost metadata search, large blocks cost local scanning.
Measured result	Block-size sweep: detailed rows below.
What happened	Block size `64` was best in this pass.
Takeaway	The correct block size is empirical. It balances global jumps against local scan work.
Evidence status	Supported by timing.

Detailed rows

Block-size sweep:

Block size	ns/query	Visual
`8`	`568.714`	`#############`
`16`	`496.838`	`############`
`32`	`480.809`	`###########`
`64`	`472.178`	`###########`
`128`	`529.335`	`############`
`256`	`615.460`	`##############`
`512`	`846.829`	`####################`

7.5 Cache-Line Block Search

Code Shape

for (int blockSize : {16, 32, 64}) {
    auto index = buildBlockedArray(values, blockSize);
    runQueries(index);
}

Cache-Line Access Visual

int32 values:
block 16 -> [CL0: * * * * * * * * * * * * * * * *]
block 32 -> [CL0: * * * * ...] [CL1: * * * * ...]
block 64 -> [CL0] [CL1] [CL2] [CL3]

reference point:
test one, two, and four nearby cache lines

Interpretation

Field	Notes
Expectation	One-cache-line blocks may be good, but larger nearby blocks can win if they reduce more global jumps.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	The `64` value block was best in this pass.
Takeaway	”One cache line” is not automatically the optimal block size.
Evidence status	Supported by timing.

Detailed rows

Block size	ns/query	Visual
`16`	`499.228`	`####################`
`32`	`481.549`	`###################`
`64`	`471.386`	`##################`

7.6 Eytzinger Search

Code Shape

bool eytzingerContains(const std::vector<int>& eytzinger, int key) {
    std::size_t index = 1;

    while (index < eytzinger.size()) {
        if (eytzinger[index] == key) {
            return true;
        }

        index = key < eytzinger[index] ? index * 2U : index * 2U + 1U;
    }

    return false;
}

Cache-Line Access Visual

array-backed tree:

index:  1
       / \
      2   3
     / \ / \
    4  5 6  7

CPU -> [CL: eyt[1] eyt[2] eyt[3] ...] -> child index by arithmetic

reference point:
tree shape without pointer nodes

Interpretation

Field	Notes
Expectation	Eytzinger should improve large read-only search by changing physical layout while keeping compact array storage.
Measured result	Large-array compare-all: detailed rows below.
What happened	Eytzinger was the fastest large-array row in the displayed passes.
Takeaway	Physical layout can matter as much as search logic.
Evidence status	Supported by timing.

Detailed rows

Large-array compare-all:

Variant	ns/query	Visual
Eytzinger	`326.475`	`###########`
normal binary	`591.551`	`####################`

Completion compare-all:

Variant	ns/query	Visual
Eytzinger	`328.340`	`######`
binary	`738.569`	`##############`

7.7 Eytzinger Prefetch

Code Shape

if (prefetchDistance >= 1) {
    const std::size_t child = index * 2U;
    if (child < eytzinger.size()) {
        prefetchRead(&eytzinger[child]);
    }
}

Cache-Line Access Visual

CPU at eyt[i]
  P child       -> eyt[2*i]
  P grandchild  -> eyt[4*i]
CPU compares eyt[i]
CPU chooses next child

reference point:
future addresses are arithmetic, but not all prefetched paths are used

Interpretation

Field	Notes
Expectation	Prefetch may help when distance is right and hurt when it is too near, too far, or polluting.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Child prefetch was worse. Grandchild prefetch was slightly faster than the no-prefetch row in this pass.
Takeaway	Prefetch distance is a parameter, not a feature toggle.
Evidence status	Partially supported by timing.

Detailed rows

Variant	ns/query	Visual
Eytzinger	`314.106`	`#############`
prefetch child	`494.878`	`####################`
prefetch grandchildren	`312.936`	`#############`

7.8 Implicit B-Tree

Code Shape

int childSlotLinear(const BTreeNodeShape& node, int key) {
    int slot = 0;

    while (slot < node.count && key > node.keys[static_cast<std::size_t>(slot)]) {
        ++slot;
    }

    return slot;
}

Cache-Line Access Visual

binary tree:
CPU -> [node] -> [node] -> [node] -> [node]

implicit B-tree:
CPU -> [CL node: k0 k1 k2 k3 k4 k5 k6 k7 | child offsets]
       -> child node

reference point:
more local comparisons, fewer levels

Interpretation

Field	Notes
Expectation	Higher fanout should reduce depth, but too much inside-node search can cost more.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Linear inside-node scan won across tested fanouts. `64` keys was the fastest linear inside-node row.
Takeaway	Local sequential work can beat branchier inside-node search.
Evidence status	Supported by timing.

Detailed rows

Node keys	Linear inside-node ns/query	Binary inside-node ns/query	Faster
`4`	`662.335`	`735.386`	linear
`8`	`605.165`	`736.048`	linear
`16`	`511.993`	`734.992`	linear
`32`	`508.562`	`743.072`	linear
`64`	`498.007`	`739.088`	linear

7.9 Power-Of-Two Padding

Code Shape

for (std::size_t n : {pow2, pow2 + 123U, pow2 + 257U, pow2 + 4099U}) {
    auto values = makeSortedEvenKeys(n);
    runSearchVariants(values, queries);
}

Cache-Line Access Visual

n = 2^20:
CPU -> regular midpoint offsets -> possible repeated set mapping !

n = 2^20 + 257:
CPU -> shifted midpoint offsets -> different set mapping

reference point:
same algorithm, changed address placement

Interpretation

Field	Notes
Expectation	Power-of-two sizes can expose address-mapping sensitivity.
Measured result	Normal binary: detailed rows below.
What happened	The padded `2^20 + 257` row improved normal binary substantially. Eytzinger was less sensitive in this table.
Takeaway	Array size can change search speed even when the algorithm is identical.
Evidence status	Supported by timing; counter confirmation is missing.

Detailed rows

Normal binary:

n	ns/query	Visual
1,048,576	`750.393`	`###############`
1,048,699	`649.605`	`#############`
1,048,833	`628.107`	`#############`
1,052,675	`609.551`	`############`
2,097,152	`1,000.268`	`####################`
2,097,275	`846.635`	`#################`

Eytzinger:

n	ns/query	Visual
1,048,576	`336.239`	`###############`
1,048,699	`352.473`	`###############`
1,048,833	`347.754`	`###############`
1,052,675	`335.315`	`##############`
2,097,152	`456.035`	`####################`
2,097,275	`441.157`	`###################`

7.10 Query Ordering

Code Shape

std::sort(queries.begin(), queries.end());
runSearchVariants(values, queries);

Cache-Line Access Visual

random queries:
q900 -> q17 -> q700 -> q3
CPU  -> [CL far] -> [CL far] -> [CL far]

clustered queries:
q100 -> q101 -> q104 -> q110
CPU  -> [CL region A] -> [CL region A] -> [CL nearby]

reference point:
same data structure, different temporal locality

Interpretation

Field	Notes
Expectation	Sorted or clustered queries should reuse nearby array/tree regions.
Measured result	`n = 1,000,000`, `queries = 1,000,000`, 5 repetitions: detailed rows below.
What happened	Clustered queries made normal binary much faster than random queries.
Takeaway	Work ordering can be a memory optimization without changing the data structure.
Evidence status	Supported by timing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 5 repetitions:

Query order	binary ns/query	Eytzinger ns/query	blocked ns/query
random	`595.078`	`328.620`	`475.115`
sorted	`184.750`	`207.668`	`179.653`
clustered	`80.564`	`171.971`	`133.153`
bucketed high bits	`185.036`	`207.205`	`179.505`

7.11 Duplicate-Key And Lower-Bound Behavior

Code Shape

auto lower = std::lower_bound(values.begin(), values.end(), key);
auto upper = std::upper_bound(values.begin(), values.end(), key);
return std::make_pair(lower, upper);

Cache-Line Access Visual

membership:
CPU -> find any equal key

lower_bound:
CPU -> first key >= query

equal_range:
CPU -> lower bound path + upper bound path

reference point:
same sorted array, different query contract

Interpretation

Field	Notes
Expectation	Membership timing should not be assumed to represent lower-bound or range queries.
Measured result	`n = 1,048,576`, `queries = 1,000,000`, 3 repetitions: detailed rows below.
What happened	Duplicate-heavy rows were faster in this benchmark shape, but range queries still have their own cost.
Takeaway	Define the exact search contract before comparing layouts.
Evidence status	Supported by timing.

Detailed rows

n = 1,048,576, queries = 1,000,000, 3 repetitions:

Duplicate run	binary search membership	lower-bound membership	lower-bound position	equal-range count
`1`	`729.078`	`713.280`	`726.278`	`743.355`
`4`	`715.141`	`706.743`	`707.536`	`737.183`
`16`	`700.384`	`689.846`	`686.602`	`724.461`
`64`	`644.760`	`632.632`	`636.485`	`678.741`

8. Hash-Map Code Shapes

Family question:

How much does O(1) lookup depend on collision layout and metadata placement?

8.1 Chained Heap

Code Shape

struct ChainNode {
    int key;
    int value;
    ChainNode* next;
};

bool find(int key, int& out) const {
    ChainNode* node = buckets[hash(key) % bucketCount];
    while (node != nullptr) {
        if (node->key == key) {
            out = node->value;
            return true;
        }
        node = node->next;
    }
    return false;
}

Cache-Line Access Visual

CPU -> [bucket pointer]
       -> [? heap node: key value next]
       -> [? heap node: key value next]

reference point:
random bucket plus pointer chain

Interpretation

Field	Notes
Expectation	Heap chaining should be the pointer-chasing baseline.
Measured result	implemented but not shown as a separate public table in this article.
What happened	The displayed compare-all row uses chained arena rather than chained heap.
Takeaway	Heap chaining needs a direct row before it can support a public timing claim here.
Evidence status	Implemented but not shown.

8.2 Chained Arena

Code Shape

std::vector<ChainNode> arena;
std::vector<int> bucketHead;

int index = bucketHead[bucket];
while (index != -1) {
    const ChainNode& node = arena[static_cast<std::size_t>(index)];
    if (node.key == key) {
        return true;
    }
    index = node.nextIndex;
}

Cache-Line Access Visual

CPU -> [bucket index]
       -> [CL arena: node0 node1 node2 node3]
       -> maybe next arena index

reference point:
collision chain remains, but nodes are compact

Interpretation

Field	Notes
Expectation	Arena chaining should beat scattered chaining and may beat open addressing if chains stay short and compact.
Measured result	`n = 524,288`, `queries = 500,000`, 3 repetitions: detailed rows below.
What happened	Chained arena was fastest in this implementation and workload.
Takeaway	Open addressing does not automatically win. Implementation shape and compactness matter.
Evidence status	Supported by timing.

Detailed rows

n = 524,288, queries = 500,000, 3 repetitions:

Variant	ns/query	Visual
chained arena	`69.551`	`###########`
Robin Hood	`95.566`	`##############`
`std::unordered_map`	`100.008`	`###############`
tag split 8	`108.137`	`################`
linear probing	`118.353`	`##################`
bucketized 8	`120.731`	`##################`

8.3 Linear Probing

Code Shape

std::size_t i = hash(key) & mask;

while (table[i].occupied) {
    if (table[i].key == key) {
        return true;
    }
    i = (i + 1U) & mask;
}

return false;

Cache-Line Access Visual

CPU -> [? bucket start]
       -> [CL: slot0 slot1 slot2 slot3]
       -> [CL next: slot4 slot5 slot6 slot7]

reference point:
random start, then local probe scan

Interpretation

Field	Notes
Expectation	Linear probing should be good at moderate load factor and degrade sharply at high load factor.
Measured result	Load-factor sweep: detailed rows below.
What happened	Probe tails grew much faster than medians at high load. At `0.95` load, p99 probes exceeded `1200`.
Takeaway	Hash tables need load-factor and tail-probe measurements, not only average lookup time.
Evidence status	Supported by timing and representative counters.

Detailed rows

Load-factor sweep:

Target load	Inserted keys	ns/query	p50 probes	p95 probes	p99 probes
`0.30`	157,286	`138.442`	`1`	`3`	`5`
`0.50`	262,144	`175.606`	`1`	`6`	`11`
`0.70`	367,001	`227.824`	`2`	`16`	`32`
`0.85`	445,644	`297.676`	`3`	`63`	`144`
`0.95`	498,073	`676.317`	`7`	`531`	`1207`

Representative perf stat row:

Event	Count
cycles	`725,366,509`
instructions	`1,174,526,099`
cache references	`10,805,729`
cache misses	`7,064,581`
LLC load misses	`6,815,291`
dTLB load misses	`2,542,337`
branches	`284,527,336`
branch misses	`9,065,424`

8.4 Robin Hood Hashing

Code Shape

while (slotOccupied(i)) {
    if (table[i].key == key) {
        return true;
    }
    if (probeDistance(i) < currentProbeDistance) {
        return false;
    }
    i = (i + 1U) & mask;
}

Cache-Line Access Visual

CPU -> [? bucket]
       -> [CL: slot probe=0 | slot probe=1 | slot probe=2]

reference point:
local probes remain, but distance metadata controls tails

Interpretation

Field	Notes
Expectation	Robin Hood should reduce probe-tail variance compared with basic linear probing.
Measured result	Robin Hood row: detailed rows below.
What happened	Robin Hood beat linear probing in the compare-all row and had low p99 probes in the tail row.
Takeaway	Probe-tail distribution is as important as mean lookup time.
Evidence status	Supported directionally.

Detailed rows

Robin Hood row:

Variant	ns/query	p50 probes	p95 probes	p99 probes
Robin Hood	`134.260`	`1`	`3`	`5`

Compare-all row:

Variant	ns/query	Visual
Robin Hood	`95.566`	`##############`
linear probing	`118.353`	`##################`

8.5 Bucketized Hash

Code Shape

Bucket& bucket = buckets[bucketIndex(key)];

for (int slot = 0; slot < BucketSize; ++slot) {
    if (bucket.occupied[slot] && bucket.keys[slot] == key) {
        return true;
    }
}

Cache-Line Access Visual

bucket size 4:
CPU -> [CL bucket: slot0 slot1 slot2 slot3]

bucket size 16:
CPU -> [CL bucket part A] [CL bucket part B] [CL bucket part C] ...

reference point:
one bucket access tests several candidates

Interpretation

Field	Notes
Expectation	Bucket size should have a sweet spot: too small means more overflow, too large means too much local scan.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Bucket size `4` was best in this pass.
Takeaway	Bucket size is the hash-map version of block size: one memory access should buy enough comparisons, but not too many.
Evidence status	Supported by timing.

Detailed rows

Bucket size	ns/query	Visual
`4`	`91.262`	`###########`
`8`	`120.048`	`###############`
`16`	`164.193`	`####################`

8.6 Dense ID Array

Code Shape

int get(std::uint32_t denseId) {
    return values[denseId];
}

Cache-Line Access Visual

CPU -> [CL values: v0 v1 v2 v3 v4 v5 v6 v7]
       direct index

reference point:
hashing and collision handling are removed

Interpretation

Field	Notes
Expectation	Dense IDs should be the upper-bound baseline when the key domain allows direct indexing.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Dense ID lookup was much cheaper than hash-table rows.
Takeaway	When possible, remove hashing from the hot path rather than optimizing it.
Evidence status	Supported by timing and byte accounting.

Detailed rows

Variant	ns/query
dense ID array	`11.384`

Bytes-per-lookup row:

Variant	ns/query	Avg probes	p95 probes	Estimated bytes/lookup
dense ID array	`13.156`	`1.000`	`1`	`4.000`

8.7 Tag Split Hash

Code Shape

auto tag = fingerprint(hash);
auto& meta = metadata[bucket];

for (int i = 0; i < 16; ++i) {
    if (meta.tags[i] == tag) {
        const Entry& entry = entries[meta.entryIndex[i]];
        if (entry.key == key) {
            return true;
        }
    }
}

return false;

Cache-Line Access Visual

CPU -> [CL tag metadata: t0 t1 t2 ... t15]
       tag mismatch -> stop
       tag match    -> [CL full entry: key value]

reference point:
metadata can reject many misses before loading payload

Interpretation

Field	Notes
Expectation	Tag split should help misses and large-payload lookups, but tag width can trade false positives against metadata cost.
Measured result	Tag-size sweep: detailed rows below.
What happened	Tag width differences were small. Tag split did not win broadly at 0% hits, but helped in 50% and 100% rows in this implementation.
Takeaway	Fingerprint metadata needs hit/miss-ratio measurements. It is not automatically faster.
Evidence status	Partially supported.

Detailed rows

Tag-size sweep:

Tag bits	ns/query	p50 probes	p95 probes	p99 probes	Visual
`7`	`144.311`	`1`	`3`	`4`	`###################`
`8`	`146.515`	`1`	`3`	`4`	`###################`
`12`	`147.959`	`1`	`3`	`4`	`####################`
`16`	`145.981`	`1`	`3`	`4`	`###################`

Hit/miss ratio sweep:

Variant	Hit ratio	ns/query	p95 probes	p99 probes
linear	`0%`	`161.529`	`8`	`13`
tag8	`0%`	`175.518`	`8`	`13`
tag16	`0%`	`170.531`	`8`	`13`
linear	`50%`	`174.359`	`6`	`11`
tag8	`50%`	`160.616`	`6`	`11`
tag16	`50%`	`162.057`	`6`	`11`
linear	`100%`	`117.693`	`4`	`7`
tag8	`100%`	`109.967`	`4`	`7`
tag16	`100%`	`106.242`	`4`	`7`

8.8 Hot/Cold Hash

Code Shape

struct EntryHot {
    std::uint64_t key;
    std::uint32_t valueIndex;
    State state;
};

if (hot[i].key == key) {
    return values[hot[i].valueIndex];
}

Cache-Line Access Visual

normal entry:
CPU -> [CL: key state value payload payload]

split entry:
CPU -> [CL hot: key state valueIndex]
only on match -> [cold payload]

reference point:
payload stays cold during failed lookup

Interpretation

Field	Notes
Expectation	Hot/cold split should help once value payload is large enough.
Measured result	Detailed rows below; benchmark numbers unchanged.
What happened	Split layout lost at `16 B` and won for larger payloads.
Takeaway	Hot/cold split is conditional. It helps when payload is large and not always consumed.
Evidence status	Supported by timing.

Detailed rows

Payload	Normal entry ns/query	Split hot/cold ns/query
`16 B`	`177.117`	`191.221`
`64 B`	`211.278`	`181.337`
`128 B`	`222.311`	`178.956`
`256 B`	`236.144`	`184.487`

8.9 Batched Prefetch Lookup

Code Shape

for (std::size_t i = 0; i < keys.size(); i += Batch) {
    for (std::size_t j = i; j < i + Batch; ++j) {
        bucket[j - i] = table.bucketIndex(keys[j]);
        prefetchRead(&table.entries[bucket[j - i]]);
    }

    for (std::size_t j = i; j < i + Batch; ++j) {
        found += table.findFromBucket(keys[j], bucket[j - i]);
    }
}

Cache-Line Access Visual

phase 1:
CPU computes bucket0 bucket1 bucket2 bucket3
P -> [CL bucket0] [CL bucket1] [CL bucket2] [CL bucket3]

phase 2:
CPU uses bucket0 bucket1 bucket2 bucket3

reference point:
batching creates independent future addresses

Interpretation

Field	Notes
Expectation	Batching should expose memory-level parallelism. Too-large batches may add overhead or pollution.
Measured result	Single-thread: detailed rows below.
What happened	Batch `8` was best in the single-thread row.
Takeaway	Batching can improve lookup throughput, but batch size is workload-specific.
Evidence status	Supported by timing.

Detailed rows

Single-thread:

Batch size	ns/query	Visual
`1`	`137.675`	`####################`
`2`	`122.769`	`##################`
`4`	`108.192`	`################`
`8`	`99.972`	`###############`
`16`	`106.113`	`###############`
`32`	`110.872`	`################`

8.10 Set-Conflict Hash

Code Shape

std::size_t indexIdentity(std::uint64_t key) {
    return key & mask;
}

std::size_t indexMixed(std::uint64_t key) {
    return splitmix64(key) & mask;
}

Cache-Line Access Visual

identity + patterned keys:
CPU -> [CL set A] -> [CL set A] -> [CL set A] !

mixed hash:
CPU -> [CL set A] -> [CL set C] -> [CL set F]

reference point:
hash choice affects table placement and cache-set pressure

Interpretation

Field	Notes
Expectation	Hash quality and table size should affect lookup time, especially with patterned keys.
Measured result	Sequential keys: detailed rows below.
What happened	Identity hashing was not worst on this sequential-key workload.
Takeaway	Conflict experiments must include realistic bad key patterns, not only sequential keys.
Evidence status	Partially supported.

Detailed rows

Sequential keys:

Table size	Hash	ns/query	Visual
1,048,576	identity	`36.584`	`###########`
1,048,576	splitmix64	`67.638`	`###################`
1,048,576	multiplicative	`73.249`	`####################`
1,048,699	identity	`36.638`	`###########`
1,048,699	splitmix64	`65.814`	`##################`
1,049,585	identity	`35.920`	`##########`
1,049,585	splitmix64	`64.511`	`##################`

8.11 Tombstone Degradation

Code Shape

eraseSomeKeys();       // leaves tombstones
measureLookup();

rehashWithoutDeadSlots();
measureLookup();

Cache-Line Access Visual

before repair:
CPU -> [CL: filled tombstone filled tombstone empty]

after rehash:
CPU -> [CL: filled filled empty . .]

reference point:
tombstones extend probe chains

Interpretation

Field	Notes
Expectation	Tombstones should degrade lookup by lengthening probe sequences. Rehash repair should reduce tails.
Measured result	Original benchmark status: detailed rows below.
What happened	Rehash repair reduced both time and probe tails.
Takeaway	Open-addressed hash tables must account for table aging, not just fresh construction.
Evidence status	Replacement measured; original benchmark blocked by correctness.

Detailed rows

Original benchmark status:

hashmap_tombstone_degradation: blocked by correctness

Focused replacement row:

Variant	Tombstones	ns/query	p50 probes	p95 probes	p99 probes
with tombstones	`72,447`	`212.913`	`1`	`10`	`18`
rehashed	`0`	`156.024`	`1`	`6`	`11`

8.12 Bytes Per Lookup

Code Shape

bytes += probes * bytesPerHotSlot;
if (keyMatch) {
    bytes += bytesPerPayload;
}

Cache-Line Access Visual

dense ID:
CPU -> [CL values: *]                 about 4 B useful payload

tag split:
CPU -> [CL tags: * * * *] -> maybe [full entry]

bucketized:
CPU -> [CL bucket with many slots]     more bytes per probe

reference point:
estimated bytes and timing may rank layouts differently

Interpretation

Field	Notes
Expectation	Estimated bytes touched should explain part of lookup time but not all of it.
Measured result	`50%` hits: detailed rows below.
What happened	Dense ID touched the fewest bytes and was fastest. Bucketized had low probe count but high estimated bytes.
Takeaway	Probe count alone is incomplete. Bytes touched and cache-line footprint also matter.
Evidence status	Supported by software byte accounting and timing.

Detailed rows

50% hits:

Variant	ns/query	Avg probes	p95 probes	Estimated bytes/lookup
dense ID array	`13.156`	`1.000`	`1`	`4.000`
Robin Hood	`108.014`	`1.625`	`3`	`39.001`
tag split 8	`125.573`	`1.277`	`3`	`26.323`
bucketized 8	`135.593`	`1.001`	`1`	`136.074`
linear probing	`143.145`	`1.994`	`6`	`47.846`
hot/cold split	`168.590`	`1.994`	`6`	`51.846`

9. Multithreaded Code Shapes

Family question:

What changes when cache lines become ownership units shared across cores?

Code Shape

for (std::uint64_t i = 0; i < steps; ++i) {
    counters[threadId].value++;
}

Cache-Line Access Visual

bad adjacent counters:
[CL0: T0 counter | T1 counter | T2 counter | T3 counter]
       ! ownership moves between cores on writes

reference point:
threads write different variables but share one cache line

Interpretation

Field	Notes
Expectation	Adjacent counters should scale poorly because writes move cache-line ownership between cores.
Measured result	`steps = 5,000,000`, 3 repetitions: detailed rows below.
What happened	The adjacent row was worse than padded at 8 threads, but local write-once was much stronger than both.
Takeaway	Avoid repeated shared writes inside hot loops.
Evidence status	Supported by timing; `perf c2c` evidence is blocked by tooling.

Detailed rows

steps = 5,000,000, 3 repetitions:

Threads	Bad adjacent ns/inc	Padded ns/inc	Local write-once ns/inc
`1`	`7.642`	`7.669`	`0.047`
`2`	`3.958`	`3.826`	`0.022`
`4`	`2.020`	`1.926`	`0.015`
`8`	`2.368`	`1.737`	`0.020`

9.2 Padded Counters

Code Shape

struct alignas(64) PaddedCounter {
    std::uint64_t value = 0;
};

counters[threadId].value++;

Cache-Line Access Visual

padded counters:
[CL0: T0 counter]
[CL1: T1 counter]
[CL2: T2 counter]
[CL3: T3 counter]

reference point:
each writer owns a separate line

Interpretation

Field	Notes
Expectation	Padding should reduce false sharing when threads repeatedly publish per-thread counters.
Measured result	8-thread row: detailed rows below.
What happened	Padding improved the 8-thread counter row.
Takeaway	Padding is useful when repeated writes to result slots are unavoidable.
Evidence status	Supported by timing.

Detailed rows

8-thread row:

Variant	ns/inc
bad adjacent counters	`2.368`
padded counters	`1.737`

9.3 Local Write-Once Counters

Code Shape

std::uint64_t local = 0;

for (auto query : myQueries) {
    local += search(query);
}

results[threadId].value = local;

Cache-Line Access Visual

hot loop:
T0 -> register/local stack only
T1 -> register/local stack only

end:
T0 -> [CL result0]
T1 -> [CL result1]

reference point:
shared write happens once

Interpretation

Field	Notes
Expectation	Local accumulation should beat both adjacent and padded counters when the computation can reduce at the end.
Measured result	8-thread row: detailed rows below.
What happened	Local write-once was much cheaper than both shared-counter layouts.
Takeaway	Make reads shared and writes private.
Evidence status	Supported by timing.

Detailed rows

8-thread row:

Variant	ns/inc
bad adjacent counters	`2.368`
padded counters	`1.737`
local write-once	`0.020`

9.4 Read-Only Binary Search Scaling

Code Shape

std::size_t begin = threadId * queries.size() / threadCount;
std::size_t end = (threadId + 1U) * queries.size() / threadCount;

for (std::size_t i = begin; i < end; ++i) {
    local += contains(sharedIndex, queries[i]);
}

Cache-Line Access Visual

shared data:
all threads -> read-only [array/tree cache lines]

queries:
T0 -> query range 0
T1 -> query range 1
T2 -> query range 2

reference point:
shared reads, private counters

Interpretation

Field	Notes
Expectation	Read-only search should scale until memory latency, bandwidth, topology, or frequency effects dominate.
Measured result	`n = 1,000,000`, `queries = 1,000,000`, 3 repetitions: detailed rows below.
What happened	Eytzinger stayed the fastest 8-thread row. Some speedups were above ideal, so exact speedup values remain provisional.
Takeaway	Use these rows for direction, not as fixed-frequency scaling claims.
Evidence status	Supported directionally; fixed-frequency and SMT-off reruns are missing.

Detailed rows

n = 1,000,000, queries = 1,000,000, 3 repetitions:

Variant	1 thread ns/query	4 threads ns/query	8 threads ns/query	8-thread speedup
binary	`751.195`	`203.478`	`124.309`	`6.043x`
branchless	`570.114`	`217.112`	`112.639`	`5.061x`
Eytzinger	`324.589`	`82.790`	`66.675`	`4.868x`
blocked	`493.224`	`124.722`	`84.513`	`5.836x`

9.5 Read-Only Hash Scaling

Code Shape

for (std::size_t i = begin; i < end; ++i) {
    local += table.find(queries[i]);
}

results[threadId].value = local;

Cache-Line Access Visual

T0 -> [? bucket] -> local probes
T1 -> [? bucket] -> local probes
T2 -> [? bucket] -> local probes

shared table is read-only
per-thread result is private or padded

Interpretation

Field	Notes
Expectation	Read-only hash lookup should scale until memory resources saturate.
Measured result	`n = 500,000`, `queries = 1,000,000`, 3 repetitions: detailed rows below.
What happened	Robin Hood was the fastest 8-thread hash-layout row.
Takeaway	Threaded lookup should be measured per layout; single-thread ranking may not directly transfer.
Evidence status	Supported directionally.

Detailed rows

n = 500,000, queries = 1,000,000, 3 repetitions:

Variant	1 thread ns/query	4 threads ns/query	8 threads ns/query	8-thread speedup
linear probing	`108.105`	`27.522`	`19.392`	`5.575x`
Robin Hood	`88.702`	`23.636`	`17.099`	`5.187x`
bucketized 8	`120.989`	`30.956`	`31.086`	`3.892x`
tag split 8	`79.193`	`21.796`	`20.722`	`3.822x`
hot/cold split	`122.975`	`33.371`	`26.947`	`4.564x`

9.6 Chunked-List Scaling

Code Shape

std::size_t firstChunk = threadId * chunks.size() / threadCount;
std::size_t lastChunk = (threadId + 1U) * chunks.size() / threadCount;

for (std::size_t c = firstChunk; c < lastChunk; ++c) {
    scanChunk(chunks[c]);
}

Cache-Line Access Visual

T0 -> chunks [0..K/T)
T1 -> chunks [K/T..2K/T)
T2 -> chunks [2K/T..3K/T)

reference point:
threads scan disjoint ranges instead of all starting at the same head

Interpretation

Field	Notes
Expectation	Chunk-range partitioning should scale better than all threads scanning from the same list head.
Measured result	`n = 1,000,000`, chunk size `64`, 3 repetitions: detailed rows below.
What happened	Scaling improved through 4 threads and then got worse at 8 threads.
Takeaway	Partitioning avoids duplicate work, but memory throughput and scheduling still limit scaling.
Evidence status	Supported directionally.

Detailed rows

n = 1,000,000, chunk size 64, 3 repetitions:

Threads	ns/value	Speedup	Efficiency
`1`	`1.228`	`1.000x`	`1.000`
`2`	`0.708`	`1.734x`	`0.867`
`4`	`0.501`	`2.453x`	`0.613`
`8`	`0.950`	`1.293x`	`0.162`

9.7 Batched Hash Scaling

Code Shape

for (std::size_t i = begin; i < end; i += Batch) {
    prefetchBatch(i);
    local += lookupBatch(i);
}

Cache-Line Access Visual

inside each thread:
P -> [bucket0] [bucket1] [bucket2] [bucket3]
then lookup bucket0..bucket3

across threads:
T0 batch stream
T1 batch stream
T2 batch stream

reference point:
MLP inside a thread plus parallel query ranges

Interpretation

Field	Notes
Expectation	Batching should improve lookup throughput, but the best batch size may change with thread count.
Measured result	`n = 500,000`, `queries = 1,000,000`, 3 repetitions: detailed rows below.
What happened	Batch `8` was best single-thread. Batch `4` was best at 8 threads.
Takeaway	Batch size is not constant across thread counts.
Evidence status	Supported by timing.

Detailed rows

n = 500,000, queries = 1,000,000, 3 repetitions:

Batch	1 thread ns/query	4 threads ns/query	8 threads ns/query	8-thread speedup
`1`	`137.084`	`48.877`	`37.931`	`3.614x`
`2`	`118.550`	`32.283`	`21.828`	`5.431x`
`4`	`104.851`	`27.642`	`19.576`	`5.356x`
`8`	`97.752`	`37.052`	`27.606`	`3.541x`
`16`	`105.205`	`27.144`	`19.794`	`5.315x`
`32`	`107.414`	`28.266`	`20.231`	`5.309x`

9.8 Read-Only Stream Bandwidth

Code Shape

for (std::size_t i = begin; i < end; ++i) {
    local += values[i];
}

Cache-Line Access Visual

T0 -> [CL range 0: * * * *]
T1 -> [CL range 1: * * * *]
T2 -> [CL range 2: * * * *]
T3 -> [CL range 3: * * * *]

reference point:
disjoint read-only streaming ranges

Interpretation

Field	Notes
Expectation	Stream bandwidth should rise with threads until memory bandwidth saturates.
Measured result	`n = 8,388,608` `uint64_t` values, 5 repetitions: detailed rows below.
What happened	The stream row peaked at 4 threads and fell at 8 threads.
Takeaway	This gives a concrete bandwidth-saturation boundary for the measured setup.
Evidence status	Supported directionally; fixed-frequency rerun missing.

Detailed rows

n = 8,388,608 uint64_t values, 5 repetitions:

Threads	ns/value	GB/s	Speedup	Efficiency
`1`	`1.333`	`6.000`	`1.000x`	`1.000`
`2`	`0.735`	`10.886`	`1.814x`	`0.907`
`4`	`0.473`	`16.903`	`2.817x`	`0.704`
`8`	`0.519`	`15.413`	`2.569x`	`0.321`

10. Result Map

Topic	Best measured row in this pass	Main note
Linear scan	64 MiB `int32_t`, `0.867 ns/element`	baseline stream
Stride scan	64 MiB stride `16`, `10.034 ns/access`	strided access wastes cache-line payload
Full raw working-set ladder	64 MiB linear `0.867 ns/op`, random `45.038`, dependent `160.778`	all raw-memory families now have ladder rows
Huge page probe	default 4 KiB pages `1.289 ns/value`; transparent huge page path `1.253`; explicit huge pages unavailable	huge-page result is measured/probed
Associativity	stride `1025`, `5.163 ns/access` among listed conflict rows	plus-one strides were faster than paired power-of-two strides
Pointer chase	arena sequential, `5.845 ns/node`	layout and order matter
Dependent chase	sequential cycle, `6.674 ns/step`	random dependency costs about 14x
MLP chase	16 lanes, `17.128 ns/chain-step`	independent misses overlap
Linked list	chunked 64, `129,629 ns/query`	chunking reduces pointer jumps
Linked-list mutation	indexed insert `13.728 ns/op`; heap insert-new `177.696 ns/op`	allocation and pointer rewrites dominate mutation
Hot/cold list	256 B payload, `7.6x` faster than fat node	keep payload cold
Binary search	Eytzinger, `328.340 ns/query`	layout beats normal binary
Branch-miss counter row	binary-search program `11.60%`; hash linear probing `3.19%`	representative `perf stat` evidence exists
Query ordering	clustered binary, `80.564 ns/query`	work order changes locality
Hash map	chained arena, `69.551 ns/query`	measured winner in this implementation
Hash bytes touched	tag split 8 estimated `26.323 B/lookup` at 50% hits	bytes and time rank layouts differently
Batched hash	batch 8 single-thread, `97.752 ns/query`	batching improved MLP over batch 1
False sharing	8-thread adjacent counters, `2.368 ns/inc` vs padded `1.737` vs local `0.020`	private accumulation is the cleanest shape
Binary scaling	Eytzinger 8 threads, `66.675 ns/query`	fastest threaded binary-search row
Hash scaling	Robin Hood 8 threads, `17.099 ns/query`	fastest threaded hash-layout row
Batched hash scaling	batch 4 at 8 threads, `19.576 ns/query`	batching sweet spot changed with thread count
Stream bandwidth	4 threads, `16.903 GB/s`	bandwidth flattened after 4 threads

11. Negative Results And Mixed Results

Observation	Status	Interpretation
Simple linked-list prefetch did not rescue pointer chasing.	Mixed	The future address is discovered too late for prefetch to hide much latency.
Arena allocation did not fix random traversal.	Supported	Contiguous storage is not enough when traversal order is random.
Chained arena beat the expected open-addressing winner in one hash compare-all row.	Mixed	Implementation details and compact chains mattered on this system.
Transparent huge-page advice had a small measured delta.	Inconclusive	Explicit huge pages were unavailable, so the result is not a full huge-page claim.
Padding counters helped, but local accumulation was much better.	Supported	Avoiding repeated shared writes is stronger than spacing them out.
Tombstone degradation original binary failed correctness.	Blocked by correctness	Replacement rows are useful, but the original benchmark cannot support timing claims.
`perf c2c` was unavailable.	Blocked by tooling	False-sharing claims are timing-backed, not c2c/HITM-backed.

12. Tradeoff Matrix

Optimization	Helps when	Hurts when	Evidence strength
Sequential layout	Traversal order is known.	Queries require random access.	Supported
Chunked list	Read-heavy misses or late hits dominate.	Mutation inside chunks is frequent.	Supported
Hot/cold split	Search touches metadata more than payload.	Every lookup consumes payload immediately.	Supported
Bloom chunk filter	Negative lookups dominate.	Hit-heavy lookup pays filter overhead.	Partially supported
Branchless binary	Branch mispredicts are material.	Memory stalls dominate.	Partially supported
Blocked search	One global jump can be traded for local scans.	Block scans become too large.	Partially supported
Eytzinger layout	Static large arrays are searched repeatedly.	Frequent updates require rebuilds.	Supported
Robin Hood hashing	Probe-tail control matters.	Insert displacement cost matters.	Partially supported
Tag split hash	Misses and large payloads are common.	Tags false-positive often or payloads are small.	Partially supported
Batched lookup	Independent queries exist.	Single-query latency is the target.	Supported directionally
Padded counters	Threads must publish per-thread counters.	Local accumulation can write once.	Supported

13. Research Claims

Claim	Mechanism	Measurement evidence	Caveat	Verdict
Sequential scan is the baseline to beat.	Full cache-line use and predictable access.	Linear scan and working-set rows.	Absolute timings are system-specific.	Supported
Pointer chasing is slow because the next address is data-dependent.	Load-to-use dependency prevents overlap.	Dependent chase and linked-list rows.	Arena allocation helps compact cases.	Supported
Chunking improves list search.	One pointer jump covers many local comparisons.	Chunked list and hit-position rows.	Mutation cost changes the tradeoff.	Supported
Binary-search layout can matter as much as comparison count.	Fewer global jumps or better tree layout.	Eytzinger, blocked, and fanout rows.	No layout wins every row.	Supported
Hash-map lookup quality depends on collision layout and metadata placement.	Local probes, probe tails, hot metadata.	Compare-all, probe-tail, tag, hot/cold, and byte rows.	Chained arena beat expectations in one row.	Supported
Batching can expose memory-level parallelism.	Independent bucket or chain requests overlap.	MLP and batched hash rows.	Batch sweet spot depends on workload and thread count.	Supported directionally
False sharing can distort benchmark results.	Cache-line ownership transfers on adjacent writes.	Adjacent, padded, and local counter rows.	`perf c2c` unavailable.	Partially supported

14. Remaining Evidence

Topic	Status	Needed next
Per-variant branch counters	partially measured	Isolate normal binary, branchless, Eytzinger, and blocked binaries under separate `perf stat` runs.
Direct RFO/HITM evidence	blocked by tooling	Requires CPU-specific events or available `perf c2c`.
SMT-off scaling	not measured yet	Rerun threaded rows with sibling threads disabled.
Fixed-frequency scaling	not measured yet	Rerun with frequency pinned where permissions and hardware allow.
Cross-machine validation	not measured yet	Repeat on at least one newer CPU and one different cache hierarchy.
Production hash variants	not measured yet	Add string keys, realistic deletion churn, and production-style allocators.

15. Reproducibility Notes

Item	Public description
Measurement run	Benchmark suite run on `2026-06-02`
Benchmark code	Scalar C++20, no SIMD or assembly in this version
Build shape	60 separate benchmark binaries
Timing method	Warmup, repeated runs, median timing, checksum-protected loops
Pinning	Single-thread rows pinned to one CPU; multithread rows pinned to the available CPU set
Counter rows	Representative Linux `perf stat` rows for binary search and hash-map lookup
Blocked tooling	`perf c2c` unavailable, so false sharing is not backed by c2c counter evidence
Invalid row policy	Correctness failures are reported as blocked and are not used for timing claims

Figures in this article:

Figure	What it shows
Working-set ladder	Raw-memory footprint effects
Prefetch distance	Stride prefetch-distance sweep
Linked-list miss cost	Linked-list miss cost and bytes per key
Binary locality sweep	Blocked search and implicit B-tree fanout
Hash probe tails	Load factor versus p50/p95/p99 probe tails
Hash bytes touched	Estimated bytes touched versus timing
Read-only search scaling	Threaded binary and hash lookup
Counter ownership ladder	Adjacent, padded, and local counter writes
Stream bandwidth	Read-only stream bandwidth scaling

16. Limitations

Area	Limitation
System coverage	One measured Linux x86_64 system is included in the current result tables.
CPU controls	SMT stayed enabled; frequency was not pinned; governor was `powersave`.
Counter coverage	Linux `perf stat` branch/cache rows exist for representative binary-search and hash-map binaries; `perf c2c` was unavailable.
Portability	Absolute timings are machine-specific; relative trends are the intended claim.
Implementation scope	Scalar C++20 only; no SIMD, assembly, cloud, or distributed-system mapping.
Data structures	Some implementations are first-pass benchmark structures, not production containers.
Completeness	The remaining missing items are mainly controlled-system counter reruns, SMT/frequency controls, and cross-machine validation.

17. Final Takeaway

Big-O explains growth. Cache behavior explains the trip through memory.

The repeated transformation pattern is:

one value per pointer
-> many values per pointer

global jumps
-> local block scans

pointer chains
-> local probes

fat objects
-> hot metadata plus cold payload

one dependent miss
-> many independent misses

shared writes
-> private or padded writes

Search performance is memory-pattern performance.