0. Abstract
This note studies search performance as a memory-access problem.
The benchmark suite compares scalar C++20 implementations across:
- raw memory patterns
- linked-list layouts
- binary-search layouts
- hash-map collision strategies
- batching and prefetching
- multithreaded cache-line ownership
The central claim is:
Search performance is memory-pattern performance.
Absolute timings are specific to one measured x86_64 Linux system. The useful claim is the relative pattern: which code shapes stream through cache lines, which waste cache-line payload, which chase dependent addresses, which expose independent memory requests, and which create cache-line ownership traffic.
5. Raw Memory Code Shapes
Family question:
What does the CPU pay for when the operation is only "read memory"?
5.1 Linear Scan
Code Shape
std::uint64_t sum = 0;
for (std::size_t i = 0; i < values.size(); ++i) {
sum += static_cast<std::uint64_t>(values[i]);
}
return sum;
Cache-Line Access Visual
int32 payload, 16 values per 64B line:
CPU -> [CL0: * * * * * * * * * * * * * * * *]
[CL1: * * * * * * * * * * * * * * * *]
reference point:
nearly every fetched slot is useful
Interpretation
| Field | Notes |
|---|
| Expectation | Sequential scan should be the best raw-memory baseline because addresses are predictable and cache-line payload is fully used. |
| Measured result | n = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Wider records cost more per logical element because each element consumes more cache-line payload. |
| Takeaway | Keep hot scan fields narrow. Do not drag cold payload through a search loop. |
| Evidence status | Supported by timing. |
Detailed rows
n = 1,000,000, 5 repetitions:
| Variant | Element size | ns/element | Visual |
|---|
int32_t | 4 B | 0.743 | # |
int64_t | 8 B | 1.027 | ## |
struct16 | 16 B | 2.808 | ##### |
struct64 | 64 B | 10.170 | #################### |
5.2 Stride Scan
Code Shape
std::uint64_t sum = 0;
for (std::size_t i = 0; i < values.size(); i += stride) {
sum += static_cast<std::uint64_t>(values[i]);
}
return sum;
Cache-Line Access Visual
stride = 16 over int32:
CPU -> [CL0: * . . . . . . . . . . . . . . .]
[CL1: * . . . . . . . . . . . . . . .]
reference point:
access is predictable, but most fetched bytes are unused
Interpretation
| Field | Notes |
|---|
| Expectation | Large strides should be slower than stride 1 because each cache line contributes fewer useful values. Power-of-two strides may also create cache-set pressure. |
| Measured result | Selected rows, n = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Stride 1 was fastest. Several plus-one strides beat nearby power-of-two strides. |
| Takeaway | Do not only ask whether the working set fits in cache. Also ask whether the addresses map evenly across cache sets. |
| Evidence status | Supported by timing. |
Detailed rows
Selected rows, n = 1,000,000, 5 repetitions:
| Stride | ns/access | Visual |
|---|
1 | 0.686 | # |
16 | 6.352 | ######### |
64 | 7.921 | ############ |
65 | 7.430 | ########### |
256 | 7.292 | ########### |
257 | 5.000 | ####### |
1024 | 7.586 | ########### |
1025 | 3.784 | ###### |
5.3 Random Access
Code Shape
std::uint64_t sum = 0;
for (std::size_t i = 0; i < indexes.size(); ++i) {
sum += static_cast<std::uint64_t>(values[indexes[i]]);
}
return sum;
Cache-Line Access Visual
index stream:
17 -> 900231 -> 42 -> 700001
CPU -> [? CL17: * . . . . . . .]
[? CL900: * . . . . . . .]
[? CL42: * . . . . . . .]
reference point:
loads are independent, but the next line is hard to predict
Interpretation
| Field | Notes |
|---|
| Expectation | Random access should be tolerable for small windows and expensive when the working set exceeds cache. |
| Measured result | n = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | The small-window row was much cheaper than full-array random access. |
| Takeaway | Constraining randomness into a smaller active window can be a real optimization. |
| Evidence status | Supported by timing. |
Detailed rows
n = 1,000,000, queries = 1,000,000, 5 repetitions:
| Pattern | ns/access | Visual |
|---|
| random small window | 4.355 | #### |
| random within page | 10.172 | ########## |
| shuffled permutation | 11.144 | ########### |
| uniform random | 12.394 | ############ |
5.4 Pointer Chase
Code Shape
std::uint64_t sum = 0;
for (PtrNode* node = head; node != nullptr; node = node->next) {
sum += node->value;
}
return sum;
Cache-Line Access Visual
sequential arena:
CPU -> [CL0: node0 node1 node2 node3] -> [CL1: node4 node5 ...]
random heap or random arena:
CPU -> [? node A] -> [? node Q] -> [? node D] -> null
reference point:
next address is stored in the current node
Interpretation
| Field | Notes |
|---|
| Expectation | Arena allocation should help, but random traversal should remain expensive because each next address depends on the previous load. |
| Measured result | n = 500,000, 5 repetitions: detailed rows below. |
| What happened | Arena allocation helped, but traversal order mattered more. Random arena traversal was still much slower than sequential arena traversal. |
| Takeaway | Contiguous storage is not enough. Traversal order controls whether cache-line locality is usable. |
| Evidence status | Supported by timing. |
Detailed rows
n = 500,000, 5 repetitions:
| Variant | ns/node | Relative |
|---|
| arena sequential | 5.845 | 1.0x |
| heap sequential | 9.294 | 1.6x |
| arena randomized | 112.305 | 19.2x |
| heap randomized | 150.632 | 25.8x |
5.5 Dependent Pointer Chase
Code Shape
std::size_t index = 0;
std::uint64_t sum = 0;
for (std::size_t step = 0; step < steps; ++step) {
index = next[index];
sum += index;
}
return sum;
Cache-Line Access Visual
idx0 -> next[idx0] = idx1
idx1 -> next[idx1] = idx2
idx2 -> next[idx2] = idx3
CPU -> [? CL idx0] -> [? CL idx1] -> [? CL idx2]
reference point:
the next load cannot be issued until the current load returns
Interpretation
| Field | Notes |
|---|
| Expectation | Random dependent chains should expose load-to-use latency and scale poorly with working-set size. |
| Measured result | n = 1,000,000, steps = 5,000,000, 5 repetitions: detailed rows below. |
| What happened | Random and page-spread cycles were about 14x slower than the sequential cycle. |
| Takeaway | Dependent memory chains are a latency problem, not just a cache-capacity problem. |
| Evidence status | Supported by timing. |
Detailed rows
n = 1,000,000, steps = 5,000,000, 5 repetitions:
| Variant | ns/step | Relative |
|---|
| sequential cycle | 6.674 | 1.0x |
| random permutation cycle | 92.513 | 13.9x |
| page-spread cycle | 96.173 | 14.4x |
5.6 Prefetch Pointer Chase
Code Shape
std::uint64_t sum = 0;
for (PtrNode* node = head; node != nullptr; node = node->next) {
if (node->next != nullptr) {
prefetchRead(node->next);
}
sum += node->value;
}
return sum;
Cache-Line Access Visual
CPU reads [CL?: node A | next -> node B]
P -> [CL?: node B]
CPU then uses node B soon after
reference point:
prefetch is often too late because node B is discovered at node A
Interpretation
| Field | Notes |
|---|
| Expectation | Simple next-prefetch may not help pointer chasing because address discovery and use are too close together. |
| Measured result | n = 500,000, 5 repetitions: detailed rows below. |
| What happened | Pointer next-prefetch did not improve the measured rows. Longer stride prefetch distances also did not improve this pass. |
| Takeaway | Prefetch is a distance problem. It helps only when the future address is known early enough and the prefetched data is likely to be used. |
| Evidence status | Supported for these rows; broader prefetch policy remains workload-specific. |
Detailed rows
n = 500,000, 5 repetitions:
| Variant | ns/node |
|---|
| no prefetch | 97.438 |
| prefetch next | 100.348 |
| prefetch next-next | 134.773 |
Prefetch distance sweep for stride 16:
| Distance | ns/access | Visual |
|---|
| 0 | 10.293 | ################## |
| 1 | 10.426 | ################## |
| 2 | 10.265 | ################## |
| 4 | 10.204 | ################## |
| 8 | 10.221 | ################## |
| 16 | 11.400 | #################### |
| 32 | 11.560 | #################### |
| 64 | 11.427 | #################### |
| 128 | 11.491 | #################### |

5.7 MLP Pointer Chase
Code Shape
std::vector<std::size_t> index(lanes, 0);
std::uint64_t sum = 0;
for (std::size_t step = 0; step < steps; ++step) {
for (int lane = 0; lane < lanes; ++lane) {
index[lane] = chains[lane][index[lane]];
sum += index[lane];
}
}
return sum;
Cache-Line Access Visual
lane 0: CPU -> [? CL] -> [? CL] -> [? CL]
lane 1: CPU -> [? CL] -> [? CL] -> [? CL]
lane 2: CPU -> [? CL] -> [? CL] -> [? CL]
lane 3: CPU -> [? CL] -> [? CL] -> [? CL]
reference point:
each lane is dependent, but lanes are independent of each other
Interpretation
| Field | Notes |
|---|
| Expectation | Multiple independent chains should expose memory-level parallelism and lower per-chain cost until the core or memory system saturates. |
| Measured result | working_set = 8,000,000 bytes, steps = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Per-chain time dropped from 107.056 ns to about 17 ns near 16 lanes. |
| Takeaway | Independent work can hide latency even when each individual chain has poor locality. |
| Evidence status | Supported by timing. |
Detailed rows
working_set = 8,000,000 bytes, steps = 1,000,000, 5 repetitions:
| Lanes | ns/outer step | ns/chain-step | Visual per chain |
|---|
1 | 107.056 | 107.056 | #################### |
2 | 108.373 | 54.186 | ########## |
4 | 114.800 | 28.700 | ##### |
8 | 152.170 | 19.021 | #### |
16 | 274.044 | 17.128 | ### |
32 | 557.560 | 17.424 | ### |
5.8 Associativity Conflict
Code Shape
for (std::size_t i = 0; i < values.size(); i += stride) {
sum += values[i];
}
Cache-Line Access Visual
stride 256:
CPU -> [CL set A: *] -> [CL set A: *] -> [CL set A: *]
!
stride 257:
CPU -> [CL set A: *] -> [CL set B: *] -> [CL set C: *]
reference point:
similar strides can map differently across cache sets
Interpretation
| Field | Notes |
|---|
| Expectation | Power-of-two strides can concentrate accesses into fewer cache sets; plus-one variants may spread mapping differently. |
| Measured result | n = 4,000,000, 5 repetitions: detailed rows below. |
| What happened | The plus-one stride was faster in every paired row in this pass. |
| Takeaway | Cache-set placement belongs in the experiment matrix. Similar-looking access patterns can differ. |
| Evidence status | Supported by timing; counter evidence not included. |
Detailed rows
n = 4,000,000, 5 repetitions:
| Stride | ns/access | Compared row | Visual |
|---|
64 | 7.862 | power of two | ############ |
65 | 7.040 | plus one | ########### |
128 | 7.970 | power of two | ############ |
129 | 7.288 | plus one | ########### |
256 | 8.441 | power of two | ############# |
257 | 7.434 | plus one | ########### |
512 | 7.507 | power of two | ########### |
513 | 6.371 | plus one | ########## |
1024 | 7.595 | power of two | ########### |
1025 | 4.920 | plus one | ####### |
5.9 TLB Page Walk
Code Shape
for (std::size_t page : page_order) {
sum += values[(page * page_size) / sizeof(values[0])];
}
Cache-Line Access Visual
one access per page:
CPU -> [page 0: CL0: * . . . . . . .]
[page 1: CL0: * . . . . . . .]
[page 2: CL0: * . . . . . . .]
reference point:
one useful load per page stresses address translation
Interpretation
| Field | Notes |
|---|
| Expectation | Page-walk patterns should be much slower than dense sequential scans because they use fewer values per page and per cache line. |
| Measured result | n = 4,000,000, 3906 pages, 5 repetitions: detailed rows below. |
| What happened | The page-walk rows were much slower than dense sequential scan. |
| Takeaway | Good search layout should reuse both cache lines and pages. |
| Evidence status | Supported by timing; direct TLB counter rows are not shown per variant. |
Detailed rows
n = 4,000,000, 3906 pages, 5 repetitions:
| Pattern | ns/access |
|---|
| dense sequential scan | 4.410 |
| sequential page walk | 31.813 |
| one access per 4 KB page | 32.018 |
| random page walk | 32.817 |
Completion pass:
| Pattern | ns/access | Visual |
|---|
| one access per 4 KB page | 33.580 | ################## |
| sequential page walk | 33.277 | ################## |
| random page walk | 36.540 | #################### |
| dense sequential scan | 4.390 | ## |
5.10 Working-Set Ladder
Code Shape
for (std::size_t footprint : footprints) {
buildInput(footprint);
runSameAccessShape();
}
Cache-Line Access Visual
32 KiB -> likely cache-resident
256 KiB -> larger private-cache pressure
1 MiB -> beyond small private caches
8 MiB -> shared-cache and memory pressure
64 MiB -> memory-system pressure
reference point:
same loop shape, larger footprint
Interpretation
| Field | Notes |
|---|
| Expectation | Sequential access should stay relatively stable; random and dependent access should rise sharply with footprint. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Sequential access stayed cheap. Random, pointer, and dependent rows rose sharply at larger footprints. MLP stayed lower than a single random dependent chain. |
| Takeaway | Footprint sweeps are required before interpreting one benchmark row. |
| Evidence status | Supported by timing. |
Detailed rows
| Pattern | Variant | 32 KiB ns/op | 256 KiB ns/op | 1 MiB ns/op | 8 MiB ns/op | 64 MiB ns/op |
|---|
| linear scan | int32 sequential | 0.600 | 0.527 | 0.534 | 0.827 | 0.867 |
| stride scan | stride 16 | 2.201 | 4.948 | 6.114 | 9.258 | 10.034 |
| random access | uniform random | 3.911 | 5.000 | 8.298 | 18.924 | 45.038 |
| pointer chase | arena random | 9.365 | 29.733 | 53.748 | 109.177 | 170.508 |
| dependent chase | random cycle | 6.396 | 19.608 | 45.818 | 116.110 | 160.778 |
| prefetch chase | prefetch next | 9.847 | 24.481 | 50.856 | 103.324 | 187.059 |
| MLP chase | 8 lanes | 1.261 | 3.788 | 6.932 | 17.450 | 42.083 |
| TLB page walk | random page | 19.375 | 7.828 | 7.070 | 28.464 | 38.044 |

5.11 Cold/Warm Scan, Write Traffic, And Huge Pages
Code Shape
touchEvictionBuffer();
runScan(); // cold-style pass
runScan(); // warm repeated pass
for (std::size_t i = 0; i < values.size(); ++i) {
values[i] = values[i] + 1;
}
Cache-Line Access Visual
cold-style:
CPU -> [eviction buffer] -> [CL0: * * * *] -> [CL1: * * * *]
read-modify-write:
CPU -> [CL0: load * * * *] -> [CL0: writable !] -> store
transparent huge-page probe:
4 KiB pages -> many translations
2 MiB pages -> fewer translations, if available
Interpretation
| Field | Notes |
|---|
| Expectation | Warm repeated scans may be faster. Writes should cost more than reads because cache lines must become writable. Huge pages may help TLB-sensitive access if available. |
| Measured result | Cold versus warm, 32 MiB: detailed rows below. |
| What happened | The warm scan was only slightly faster. Writes were slower than reads. Transparent huge pages were slightly faster in this row, while explicit huge pages were unavailable. |
| Takeaway | Read/write ownership and page size are real variables, but this article has timing evidence only, not direct RFO or per-row TLB counter proof. |
| Evidence status | Partially supported. Explicit huge pages and direct RFO counters remain blocked or unavailable. |
Detailed rows
Cold versus warm, 32 MiB:
| Items | Bytes | Cold ns/elt | Warm ns/elt | Cold/Warm | Visual |
|---|
| 8,388,608 | 32 MiB | 0.886 | 0.865 | 1.02x | ## |
Write and RFO-style timing proxy:
| Variant | ns/op | Visual |
|---|
| read_only | 1.273 | ############# |
| write_stream | 1.918 | #################### |
| read_modify_write | 1.884 | ################### |
Huge page probe, 64 MiB allocation:
| Mode | Status | Page size | ns/value | Note |
|---|
| default pages | measured | 4096 | 1.289 | normal anonymous mapping |
| explicit huge pages | unavailable | 2097152 | n/a | huge-page allocation unavailable |
| transparent huge pages | measured | 2097152 | 1.253 | advised huge-page path |
6. Linked-List Code Shapes
Family question:
Can list-like search be made less expensive by changing how much useful work each pointer jump buys?
6.1 Heap List
Code Shape
bool searchList(PtrNode* head, int key) {
for (PtrNode* node = head; node != nullptr; node = node->next) {
if (node->value == key) {
return true;
}
}
return false;
}
Cache-Line Access Visual
CPU -> [? heap node A: * next] -> [? heap node Q: * next]
-> [? heap node D: * next] -> null
reference point:
one useful value per dependent pointer jump
Interpretation
| Field | Notes |
|---|
| Expectation | Heap allocation should be the poor-locality baseline when traversal order is scattered. |
| Measured result | Pointer-chase row: detailed rows below. |
| What happened | Heap randomized traversal was the expensive baseline. |
| Takeaway | Heap nodes plus random traversal are a worst-case shape for cache-line reuse. |
| Evidence status | Supported by timing. |
Detailed rows
Pointer-chase row:
| Variant | ns/node | Relative |
|---|
| heap sequential | 9.294 | 1.6x |
| heap randomized | 150.632 | 25.8x |
Linked-list compare-all row:
| Variant | ns/query | Relative |
|---|
| heap randomized | 3,646,397 | 28.1x |
6.2 Arena List
Code Shape
std::vector<PtrNode> nodes(n);
for (std::size_t i = 0; i + 1U < nodes.size(); ++i) {
nodes[i].value = static_cast<std::uint64_t>(i);
nodes[i].next = &nodes[i + 1U];
}
nodes.back().next = nullptr;
Cache-Line Access Visual
arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]
CPU -> node0 -> node1 -> node2 -> node3 -> node4
reference point:
allocation and traversal are both compact
Interpretation
| Field | Notes |
|---|
| Expectation | Arena allocation should improve locality but cannot remove the pointer dependency. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Arena sequential was much faster than heap randomized, but slower than chunked list search in the compare-all row. |
| Takeaway | Arena allocation helps, but one pointer jump per value is still expensive for late-hit and miss-heavy search. |
| Evidence status | Supported by timing. |
Detailed rows
| Variant | ns/node | Relative |
|---|
| arena sequential | 5.845 | 1.0x |
Compare-all:
| Variant | ns/query | Relative |
|---|
| arena sequential | 333,708 | 2.6x |
6.3 Random Arena List
Code Shape
std::vector<std::size_t> order = makeRandomPermutation(n, seed);
for (std::size_t i = 0; i + 1U < order.size(); ++i) {
nodes[order[i]].next = &nodes[order[i + 1U]];
}
nodes[order.back()].next = nullptr;
Cache-Line Access Visual
arena memory:
[CL0: node0 node1 node2 node3] [CL1: node4 node5 node6 node7]
CPU -> node3 -> node0 -> node7 -> node2 -> node5
reference point:
storage is compact, but traversal is not
Interpretation
| Field | Notes |
|---|
| Expectation | Random traversal should be slow even when allocation is compact. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Random arena traversal was much slower than arena sequential traversal. |
| Takeaway | The CPU follows access order, not allocation intent. |
| Evidence status | Supported by timing. |
Detailed rows
| Variant | ns/node | Relative |
|---|
| arena randomized | 112.305 | 19.2x |
6.4 Prefetch List
Code Shape
for (PtrNode* node = head; node != nullptr; node = node->next) {
if (node->next != nullptr) {
prefetchRead(node->next);
}
sum += node->value;
}
Cache-Line Access Visual
CPU reads [CL?: node A | next -> node B]
P -> [CL?: node B]
CPU immediately advances to node B
reference point:
prefetch target is discovered late
Interpretation
| Field | Notes |
|---|
| Expectation | Naive pointer-list prefetch should be weak because there is little work between prefetch and use. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Naive prefetch did not beat arena sequential in these rows. |
| Takeaway | Linked-list prefetch needs enough independent work between address discovery and use. |
| Evidence status | Supported for this implementation and query mix. |
Detailed rows
| Variant | ns/query | Relative |
|---|
| prefetch next on arena list | 484,095 | 3.7x |
| arena sequential | 333,708 | 2.6x |
Completion compare-all:
| Variant | ns/query | Visual |
|---|
| prefetch next | 963,053 | ## |
| arena sequential | 667,221 | # |
6.5 Chunked List
Code Shape
template <int ChunkSize>
bool searchChunked(const ChunkNode<ChunkSize>* head, int key) {
for (auto* node = head; node != nullptr; node = node->next) {
for (int i = 0; i < node->count; ++i) {
if (node->values[i] == key) {
return true;
}
}
}
return false;
}
Cache-Line Access Visual
normal list:
CPU -> [CL?: value next] -> [CL?: value next] -> [CL?: value next]
chunked list:
CPU -> [CL0: * * * * * * * * | chunk metadata]
[CL1: * * * * * * * * | next part of chunk]
-> next chunk
reference point:
one pointer jump buys many value checks
Interpretation
| Field | Notes |
|---|
| Expectation | Chunking should improve late-hit and miss-heavy list search because it reduces pointer jumps per searchable value. |
| Measured result | Compare-all: detailed rows below. |
| What happened | Chunking improved the expensive late-hit and miss rows compared with pointer-per-value traversal. |
| Takeaway | If a list must be scanned, make each pointer jump buy a block of local work. |
| Evidence status | Supported by timing. |
Detailed rows
Compare-all:
| Variant | ns/query | Relative |
|---|
| chunked 64 | 129,629 | 1.0x |
| arena sequential | 333,708 | 2.6x |
| heap randomized | 3,646,397 | 28.1x |
Hit-position distribution, chunked 32:
| Position | Ops | ns/query | Visual |
|---|
| early hit | 1,000,000 | 5.68 | # |
| middle hit | 500 | 315,212 | ########## |
| late hit | 500 | 633,347 | #################### |
| miss | 500 | 630,506 | #################### |

6.6 Chunked Prefetch
Code Shape
template <int ChunkSize>
bool searchChunkedPrefetch(const ChunkNode<ChunkSize>* head, int key) {
for (auto* node = head; node != nullptr; node = node->next) {
if (node->next != nullptr) {
prefetchRead(node->next);
}
for (int i = 0; i < node->count; ++i) {
if (node->values[i] == key) {
return true;
}
}
}
return false;
}
Cache-Line Access Visual
CPU reads [chunk A | next -> chunk B]
P -> [chunk B]
CPU scans [CL A0: * * * *] [CL A1: * * * *]
CPU then uses [chunk B]
reference point:
chunk scan creates prefetch distance
Interpretation
| Field | Notes |
|---|
| Expectation | Chunked prefetch should help more than normal list prefetch because scanning the current chunk creates lead time. |
| Measured result | n = 100,000, queries = 5,000, 5 repetitions: detailed rows below. |
| What happened | Prefetch was not a broad win. The best no-prefetch and prefetch rows were both chunk size 16. |
| Takeaway | Prefetch distance is necessary but not sufficient. Chunk size, query mix, and cache state decide whether it pays. |
| Evidence status | Mixed timing evidence. |
Detailed rows
n = 100,000, queries = 5,000, 5 repetitions:
| Chunk size | no prefetch ns/query | prefetch ns/query |
|---|
4 | 149,268 | 165,376 |
8 | 98,464 | 108,059 |
16 | 94,233 | 98,170 |
32 | 183,450 | 228,932 |
64 | 189,332 | 187,497 |
128 | 101,391 | 105,723 |
256 | 134,423 | 133,957 |
6.7 Indexed List
Code Shape
struct IndexNode {
int value = 0;
int next = -1;
};
bool searchIndexed(const std::vector<IndexNode>& nodes, int key) {
for (int index = 0; index != -1; index = nodes[static_cast<std::size_t>(index)].next) {
if (nodes[static_cast<std::size_t>(index)].value == key) {
return true;
}
}
return false;
}
Cache-Line Access Visual
nodes vector:
[CL0: idx0 value,next | idx1 value,next | idx2 value,next | idx3 value,next]
CPU -> nodes[0] -> nodes[4] -> nodes[1]
reference point:
link is a compact integer, but traversal is still dependent
Interpretation
| Field | Notes |
|---|
| Expectation | Indexed nodes should reduce hot-node footprint, but still behave like a dependent chase. |
| Measured result | n = 200,000, queries = 500, 3 repetitions: detailed rows below. |
| What happened | Indexed nodes cut footprint but were slower than arena pointer nodes in the listed high-level query shape. |
| Takeaway | Compact representation helps memory footprint, but it does not remove address dependency. |
| Evidence status | Supported by timing and byte counts. |
Detailed rows
n = 200,000, queries = 500, 3 repetitions:
| Variant | ns/query | Bytes/searchable key |
|---|
| indexed list | 949,784 | 8.0 |
Hit-position distribution:
| Position | Ops | ns/query | Visual |
|---|
| early hit | 1,000,000 | 2.56 | # |
| middle hit | 500 | 771,964 | ########## |
| late hit | 500 | 1,545,820 | #################### |
| miss | 500 | 1,543,800 | #################### |
6.8 Hot/Cold List
Code Shape
struct HotNode {
int key = 0;
int next = -1;
};
struct ColdNode {
int value = 0;
std::array<char, PayloadSize> payload{};
};
bool searchHot(const std::vector<HotNode>& hot, int key) {
for (int index = 0; index != -1; index = hot[static_cast<std::size_t>(index)].next) {
if (hot[static_cast<std::size_t>(index)].key == key) {
return true;
}
}
return false;
}
Cache-Line Access Visual
fat node search:
CPU -> [CL: key next payload payload] -> [CL: payload payload ...]
hot/cold search:
CPU -> [CL hot: key next | key next | key next | key next]
only on match -> [cold payload]
reference point:
negative search touches metadata, not payload
Interpretation
| Field | Notes |
|---|
| Expectation | Hot/cold split should help as payload grows. |
| Measured result | n = 100,000, queries = 5,000, 5 repetitions: detailed rows below. |
| What happened | Fat-node search cost grew with payload. Hot/cold search stayed nearly flat. |
| Takeaway | Search should touch searchable metadata first and cold payload only after a match. |
| Evidence status | Supported by timing. |
Detailed rows
n = 100,000, queries = 5,000, 5 repetitions:
| Payload | fat node ns/query | hot/cold ns/query | Improvement |
|---|
16 B | 827,299 | 559,820 | 1.5x |
64 B | 1,504,184 | 559,620 | 2.7x |
128 B | 2,615,491 | 559,849 | 4.7x |
256 B | 4,265,747 | 559,886 | 7.6x |
6.9 Multi-Lane List
Code Shape
std::uint64_t found = 0;
for (std::size_t q = 0; q < queries.size(); q += static_cast<std::size_t>(lanes)) {
for (int lane = 0; lane < lanes; ++lane) {
found += searchList(heads[static_cast<std::size_t>(lane)], queries[q + lane]);
}
}
Cache-Line Access Visual
lane 0: CPU -> [? node] -> [? node] -> [? node]
lane 1: CPU -> [? node] -> [? node] -> [? node]
lane 2: CPU -> [? node] -> [? node] -> [? node]
lane 3: CPU -> [? node] -> [? node] -> [? node]
reference point:
one list remains dependent; many lists expose throughput parallelism
Interpretation
| Field | Notes |
|---|
| Expectation | Interleaving independent searches should expose memory-level parallelism. |
| Measured result | n = 200,000, queries = 250, 3 repetitions: detailed rows below. |
| What happened | Per-query time dropped sharply as lanes increased. |
| Takeaway | MLP improves throughput across independent list searches, not the latency of one dependent traversal. |
| Evidence status | Supported by timing. |
Detailed rows
n = 200,000, queries = 250, 3 repetitions:
| Lanes | ns/query | Visual |
|---|
1 | 7,129,608 | #################### |
2 | 3,419,579 | ########## |
4 | 1,604,336 | ##### |
8 | 611,275 | ## |
16 | 250,894 | # |
6.10 Bloom Chunked List
Code Shape
bool searchBloomChunked(const BloomChunkNode<ChunkSize>* head, int key) {
for (auto* node = head; node != nullptr; node = node->next) {
if ((node->bloom & bloomMask(key)) == 0U) {
continue;
}
for (int i = 0; i < node->count; ++i) {
if (node->values[i] == key) {
return true;
}
}
}
return false;
}
Cache-Line Access Visual
negative lookup:
CPU -> [CL chunk metadata: bloom bits]
bit absent -> skip [values] -> next chunk
maybe hit:
CPU -> [CL chunk metadata: bloom bits]
bit present -> [CL values: * * * *]
reference point:
cheap metadata can avoid scanning values
Interpretation
| Field | Notes |
|---|
| Expectation | Bloom filtering should help negative lookups by skipping local chunk scans. |
| Measured result | n = 200,000, queries = 2,000, 3 repetitions: detailed rows below. |
| What happened | The hit-ratio sweep improved as generated hits became more common and early in the benchmark shape. |
| Takeaway | Bloom-per-chunk needs hit-ratio and hit-position separated before making a broad claim. |
| Evidence status | Partially supported. |
Detailed rows
n = 200,000, queries = 2,000, 3 repetitions:
| Hit ratio | ns/query | Visual |
|---|
0% | 314,063 | #################### |
10% | 282,604 | ################## |
50% | 157,390 | ########## |
100% | 1,543 | # |
6.11 Compare-All And Mutation Rows
Code Shape
sameQueries();
run(heapList);
run(arenaList);
run(prefetchList);
run(chunkedList);
// mutation timing is separated from search-to-position cost
insertAfter(cursor);
deleteAfter(cursor);
spliceAfter(source, dest);
Cache-Line Access Visual
compare-all:
same query stream -> heap / arena / prefetch / chunked
mutation:
[cursor] -> [new_node] -> [old_next]
[cursor] -----------> [next_after_delete]
reference point:
lookup-friendly layout may move cost to update paths
Interpretation
| Field | Notes |
|---|
| Expectation | Chunking should win lookup-heavy rows. Vector-backed mutation should avoid heap-allocation overhead. |
| Measured result | Completion compare-all, n = 200,000, queries = 500, 3 repetitions: detailed rows below. |
| What happened | Chunked lookup was strongest in compare-all. Heap allocation was expensive in mutation. Indexed mutation rows were cheapest in this pass. |
| Takeaway | List layout decisions should be evaluated on both lookup and mutation paths. |
| Evidence status | Supported for measured append/insert/delete/splice shapes; arbitrary middle-position search cost is separate. |
Detailed rows
Completion compare-all, n = 200,000, queries = 500, 3 repetitions:
| Variant | ns/query | Visual |
|---|
| heap randomized | 11,208,462 | #################### |
| prefetch next | 963,053 | ## |
| arena sequential | 667,221 | # |
| chunked 64 | 257,240 | # |
Insert/delete/splice mutation, operations = 100,000, 5 repetitions:
| Layout | Operation | ns/op | Bytes/node | Visual |
|---|
| pointer arena | insert after, preallocated | 32.796 | 16 | #### |
| pointer heap | insert after, new | 177.696 | 16 | ################## |
| pointer arena | delete after unlink | 112.199 | 16 | ########### |
| pointer arena | splice after relink | 194.450 | 16 | #################### |
| indexed | insert after, preallocated | 13.728 | 8 | # |
| indexed | delete after free list | 27.253 | 8 | ### |
| indexed | splice after relink | 39.530 | 8 | #### |
7. Binary-Search Code Shapes
Family question:
Can fewer global jumps beat fewer comparisons?
7.1 Tiny Linear Search
Code Shape
bool linearContains(const int* values, std::size_t n, int key) {
for (std::size_t i = 0; i < n; ++i) {
if (values[i] == key) {
return true;
}
}
return false;
}
Cache-Line Access Visual
small array:
CPU -> [CL0: * * * * * * * * * * * * * * * *]
reference point:
for tiny arrays, the full scan can stay in one or a few cache lines
Interpretation
| Field | Notes |
|---|
| Expectation | Linear search should win early-hit rows for tiny arrays; binary search should win misses as n grows. |
| Measured result | queries = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Linear search won early hits. Binary search won misses once n grew. |
| Takeaway | Small local search policy depends on hit position and miss rate. |
| Evidence status | Supported by timing. |
Detailed rows
queries = 1,000,000, 5 repetitions:
| n | linear early hit | binary early hit | linear miss | binary miss |
|---|
4 | 3.882 | 15.478 | 17.717 | 17.401 |
8 | 3.852 | 25.126 | 27.938 | 18.645 |
16 | 3.870 | 28.822 | 48.121 | 20.227 |
32 | 3.857 | 33.978 | 88.823 | 23.728 |
64 | 3.904 | 35.470 | 172.843 | 28.091 |
7.2 Normal Binary Search
Code Shape
bool binaryContains(const std::vector<int>& values, int key) {
std::size_t left = 0;
std::size_t right = values.size();
while (left < right) {
const std::size_t mid = left + (right - left) / 2U;
if (values[mid] == key) {
return true;
}
if (values[mid] < key) {
left = mid + 1U;
} else {
right = mid;
}
}
return false;
}
Cache-Line Access Visual
CPU -> [CL mid: * . . .]
-> [CL quarter: * . . .]
-> [CL eighth: * . . .]
reference point:
few comparisons, but jumpy memory access
Interpretation
| Field | Notes |
|---|
| Expectation | Normal binary search should be a strong comparison-count baseline but can lose to layouts with better memory locality. |
| Measured result | Large-array compare-all, n = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Normal binary was slower than Eytzinger and blocked layouts in the large-array rows shown here. |
| Takeaway | Minimizing comparisons does not necessarily minimize memory stalls. |
| Evidence status | Supported by timing. |
Detailed rows
Large-array compare-all, n = 1,000,000, queries = 1,000,000, 5 repetitions:
| Variant | ns/query | Visual |
|---|
| normal binary | 591.551 | #################### |
| Eytzinger | 326.475 | ########### |
Completion compare-all:
| Variant | ns/query | Visual |
|---|
| binary | 738.569 | ############## |
| Eytzinger | 328.340 | ###### |
7.3 Branchless Binary
Code Shape
bool branchlessContains(const std::vector<int>& values, int key) {
std::size_t base = 0;
std::size_t len = values.size();
while (len > 1U) {
const std::size_t half = len / 2U;
const std::size_t mid = base + half;
const bool move = values[mid] <= key;
base = move ? mid : base;
len -= half;
}
return !values.empty() && values[base] == key;
}
Cache-Line Access Visual
CPU -> [CL mid: *]
compute move flag
update base without branch-shaped control flow
-> [CL next mid: *]
reference point:
branch behavior changes; memory jumps remain
Interpretation
| Field | Notes |
|---|
| Expectation | Branchless search should help when branch misses matter, but cannot remove jumpy memory access. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Branchless timing beat normal binary in the displayed compare-all row. The counter row is program-level, not per variant. |
| Takeaway | Branch behavior is material, but per-variant counter isolation is still needed. |
| Evidence status | Partially supported by timing and representative counters. |
Detailed rows
| Variant | ns/query | Visual |
|---|
| branchless | 338.353 | ########### |
| normal binary | 591.551 | #################### |
Representative program-level branch counter row:
| Family | Status | Branches | Branch misses | Miss rate |
|---|
| binary search | measured with Linux perf stat | 381,401,896 | 44,259,946 | 11.60% |
7.4 Blocked Search
Code Shape
bool blockedContains(const BlockedArray& array, int key) {
auto it = std::lower_bound(array.blockMax.begin(), array.blockMax.end(), key);
if (it == array.blockMax.end()) {
return false;
}
const int block = static_cast<int>(it - array.blockMax.begin());
const int first = block * array.blockSize;
const int last = std::min(first + array.blockSize, static_cast<int>(array.values.size()));
for (int i = first; i < last; ++i) {
if (array.values[static_cast<std::size_t>(i)] == key) {
return true;
}
}
return false;
}
Cache-Line Access Visual
metadata:
CPU -> [CL blockMax: * . . .] -> choose block
local scan:
CPU -> [CL block: * * * * * * * *]
[CL block: * * * * * * * *]
reference point:
one global decision leads to local sequential work
Interpretation
| Field | Notes |
|---|
| Expectation | Blocked search should have a sweet spot: small blocks cost metadata search, large blocks cost local scanning. |
| Measured result | Block-size sweep: detailed rows below. |
| What happened | Block size 64 was best in this pass. |
| Takeaway | The correct block size is empirical. It balances global jumps against local scan work. |
| Evidence status | Supported by timing. |
Detailed rows
Block-size sweep:
| Block size | ns/query | Visual |
|---|
8 | 568.714 | ############# |
16 | 496.838 | ############ |
32 | 480.809 | ########### |
64 | 472.178 | ########### |
128 | 529.335 | ############ |
256 | 615.460 | ############## |
512 | 846.829 | #################### |
7.5 Cache-Line Block Search
Code Shape
for (int blockSize : {16, 32, 64}) {
auto index = buildBlockedArray(values, blockSize);
runQueries(index);
}
Cache-Line Access Visual
int32 values:
block 16 -> [CL0: * * * * * * * * * * * * * * * *]
block 32 -> [CL0: * * * * ...] [CL1: * * * * ...]
block 64 -> [CL0] [CL1] [CL2] [CL3]
reference point:
test one, two, and four nearby cache lines
Interpretation
| Field | Notes |
|---|
| Expectation | One-cache-line blocks may be good, but larger nearby blocks can win if they reduce more global jumps. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | The 64 value block was best in this pass. |
| Takeaway | ”One cache line” is not automatically the optimal block size. |
| Evidence status | Supported by timing. |
Detailed rows
| Block size | ns/query | Visual |
|---|
16 | 499.228 | #################### |
32 | 481.549 | ################### |
64 | 471.386 | ################## |
7.6 Eytzinger Search
Code Shape
bool eytzingerContains(const std::vector<int>& eytzinger, int key) {
std::size_t index = 1;
while (index < eytzinger.size()) {
if (eytzinger[index] == key) {
return true;
}
index = key < eytzinger[index] ? index * 2U : index * 2U + 1U;
}
return false;
}
Cache-Line Access Visual
array-backed tree:
index: 1
/ \
2 3
/ \ / \
4 5 6 7
CPU -> [CL: eyt[1] eyt[2] eyt[3] ...] -> child index by arithmetic
reference point:
tree shape without pointer nodes
Interpretation
| Field | Notes |
|---|
| Expectation | Eytzinger should improve large read-only search by changing physical layout while keeping compact array storage. |
| Measured result | Large-array compare-all: detailed rows below. |
| What happened | Eytzinger was the fastest large-array row in the displayed passes. |
| Takeaway | Physical layout can matter as much as search logic. |
| Evidence status | Supported by timing. |
Detailed rows
Large-array compare-all:
| Variant | ns/query | Visual |
|---|
| Eytzinger | 326.475 | ########### |
| normal binary | 591.551 | #################### |
Completion compare-all:
| Variant | ns/query | Visual |
|---|
| Eytzinger | 328.340 | ###### |
| binary | 738.569 | ############## |
7.7 Eytzinger Prefetch
Code Shape
if (prefetchDistance >= 1) {
const std::size_t child = index * 2U;
if (child < eytzinger.size()) {
prefetchRead(&eytzinger[child]);
}
}
Cache-Line Access Visual
CPU at eyt[i]
P child -> eyt[2*i]
P grandchild -> eyt[4*i]
CPU compares eyt[i]
CPU chooses next child
reference point:
future addresses are arithmetic, but not all prefetched paths are used
Interpretation
| Field | Notes |
|---|
| Expectation | Prefetch may help when distance is right and hurt when it is too near, too far, or polluting. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Child prefetch was worse. Grandchild prefetch was slightly faster than the no-prefetch row in this pass. |
| Takeaway | Prefetch distance is a parameter, not a feature toggle. |
| Evidence status | Partially supported by timing. |
Detailed rows
| Variant | ns/query | Visual |
|---|
| Eytzinger | 314.106 | ############# |
| prefetch child | 494.878 | #################### |
| prefetch grandchildren | 312.936 | ############# |
7.8 Implicit B-Tree
Code Shape
int childSlotLinear(const BTreeNodeShape& node, int key) {
int slot = 0;
while (slot < node.count && key > node.keys[static_cast<std::size_t>(slot)]) {
++slot;
}
return slot;
}
Cache-Line Access Visual
binary tree:
CPU -> [node] -> [node] -> [node] -> [node]
implicit B-tree:
CPU -> [CL node: k0 k1 k2 k3 k4 k5 k6 k7 | child offsets]
-> child node
reference point:
more local comparisons, fewer levels
Interpretation
| Field | Notes |
|---|
| Expectation | Higher fanout should reduce depth, but too much inside-node search can cost more. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Linear inside-node scan won across tested fanouts. 64 keys was the fastest linear inside-node row. |
| Takeaway | Local sequential work can beat branchier inside-node search. |
| Evidence status | Supported by timing. |
Detailed rows
| Node keys | Linear inside-node ns/query | Binary inside-node ns/query | Faster |
|---|
4 | 662.335 | 735.386 | linear |
8 | 605.165 | 736.048 | linear |
16 | 511.993 | 734.992 | linear |
32 | 508.562 | 743.072 | linear |
64 | 498.007 | 739.088 | linear |

7.9 Power-Of-Two Padding
Code Shape
for (std::size_t n : {pow2, pow2 + 123U, pow2 + 257U, pow2 + 4099U}) {
auto values = makeSortedEvenKeys(n);
runSearchVariants(values, queries);
}
Cache-Line Access Visual
n = 2^20:
CPU -> regular midpoint offsets -> possible repeated set mapping !
n = 2^20 + 257:
CPU -> shifted midpoint offsets -> different set mapping
reference point:
same algorithm, changed address placement
Interpretation
| Field | Notes |
|---|
| Expectation | Power-of-two sizes can expose address-mapping sensitivity. |
| Measured result | Normal binary: detailed rows below. |
| What happened | The padded 2^20 + 257 row improved normal binary substantially. Eytzinger was less sensitive in this table. |
| Takeaway | Array size can change search speed even when the algorithm is identical. |
| Evidence status | Supported by timing; counter confirmation is missing. |
Detailed rows
Normal binary:
| n | ns/query | Visual |
|---|
| 1,048,576 | 750.393 | ############### |
| 1,048,699 | 649.605 | ############# |
| 1,048,833 | 628.107 | ############# |
| 1,052,675 | 609.551 | ############ |
| 2,097,152 | 1,000.268 | #################### |
| 2,097,275 | 846.635 | ################# |
Eytzinger:
| n | ns/query | Visual |
|---|
| 1,048,576 | 336.239 | ############### |
| 1,048,699 | 352.473 | ############### |
| 1,048,833 | 347.754 | ############### |
| 1,052,675 | 335.315 | ############## |
| 2,097,152 | 456.035 | #################### |
| 2,097,275 | 441.157 | ################### |
7.10 Query Ordering
Code Shape
std::sort(queries.begin(), queries.end());
runSearchVariants(values, queries);
Cache-Line Access Visual
random queries:
q900 -> q17 -> q700 -> q3
CPU -> [CL far] -> [CL far] -> [CL far]
clustered queries:
q100 -> q101 -> q104 -> q110
CPU -> [CL region A] -> [CL region A] -> [CL nearby]
reference point:
same data structure, different temporal locality
Interpretation
| Field | Notes |
|---|
| Expectation | Sorted or clustered queries should reuse nearby array/tree regions. |
| Measured result | n = 1,000,000, queries = 1,000,000, 5 repetitions: detailed rows below. |
| What happened | Clustered queries made normal binary much faster than random queries. |
| Takeaway | Work ordering can be a memory optimization without changing the data structure. |
| Evidence status | Supported by timing. |
Detailed rows
n = 1,000,000, queries = 1,000,000, 5 repetitions:
| Query order | binary ns/query | Eytzinger ns/query | blocked ns/query |
|---|
| random | 595.078 | 328.620 | 475.115 |
| sorted | 184.750 | 207.668 | 179.653 |
| clustered | 80.564 | 171.971 | 133.153 |
| bucketed high bits | 185.036 | 207.205 | 179.505 |
7.11 Duplicate-Key And Lower-Bound Behavior
Code Shape
auto lower = std::lower_bound(values.begin(), values.end(), key);
auto upper = std::upper_bound(values.begin(), values.end(), key);
return std::make_pair(lower, upper);
Cache-Line Access Visual
membership:
CPU -> find any equal key
lower_bound:
CPU -> first key >= query
equal_range:
CPU -> lower bound path + upper bound path
reference point:
same sorted array, different query contract
Interpretation
| Field | Notes |
|---|
| Expectation | Membership timing should not be assumed to represent lower-bound or range queries. |
| Measured result | n = 1,048,576, queries = 1,000,000, 3 repetitions: detailed rows below. |
| What happened | Duplicate-heavy rows were faster in this benchmark shape, but range queries still have their own cost. |
| Takeaway | Define the exact search contract before comparing layouts. |
| Evidence status | Supported by timing. |
Detailed rows
n = 1,048,576, queries = 1,000,000, 3 repetitions:
| Duplicate run | binary search membership | lower-bound membership | lower-bound position | equal-range count |
|---|
1 | 729.078 | 713.280 | 726.278 | 743.355 |
4 | 715.141 | 706.743 | 707.536 | 737.183 |
16 | 700.384 | 689.846 | 686.602 | 724.461 |
64 | 644.760 | 632.632 | 636.485 | 678.741 |
8. Hash-Map Code Shapes
Family question:
How much does O(1) lookup depend on collision layout and metadata placement?
8.1 Chained Heap
Code Shape
struct ChainNode {
int key;
int value;
ChainNode* next;
};
bool find(int key, int& out) const {
ChainNode* node = buckets[hash(key) % bucketCount];
while (node != nullptr) {
if (node->key == key) {
out = node->value;
return true;
}
node = node->next;
}
return false;
}
Cache-Line Access Visual
CPU -> [bucket pointer]
-> [? heap node: key value next]
-> [? heap node: key value next]
reference point:
random bucket plus pointer chain
Interpretation
| Field | Notes |
|---|
| Expectation | Heap chaining should be the pointer-chasing baseline. |
| Measured result | implemented but not shown as a separate public table in this article. |
| What happened | The displayed compare-all row uses chained arena rather than chained heap. |
| Takeaway | Heap chaining needs a direct row before it can support a public timing claim here. |
| Evidence status | Implemented but not shown. |
8.2 Chained Arena
Code Shape
std::vector<ChainNode> arena;
std::vector<int> bucketHead;
int index = bucketHead[bucket];
while (index != -1) {
const ChainNode& node = arena[static_cast<std::size_t>(index)];
if (node.key == key) {
return true;
}
index = node.nextIndex;
}
Cache-Line Access Visual
CPU -> [bucket index]
-> [CL arena: node0 node1 node2 node3]
-> maybe next arena index
reference point:
collision chain remains, but nodes are compact
Interpretation
| Field | Notes |
|---|
| Expectation | Arena chaining should beat scattered chaining and may beat open addressing if chains stay short and compact. |
| Measured result | n = 524,288, queries = 500,000, 3 repetitions: detailed rows below. |
| What happened | Chained arena was fastest in this implementation and workload. |
| Takeaway | Open addressing does not automatically win. Implementation shape and compactness matter. |
| Evidence status | Supported by timing. |
Detailed rows
n = 524,288, queries = 500,000, 3 repetitions:
| Variant | ns/query | Visual |
|---|
| chained arena | 69.551 | ########### |
| Robin Hood | 95.566 | ############## |
std::unordered_map | 100.008 | ############### |
| tag split 8 | 108.137 | ################ |
| linear probing | 118.353 | ################## |
| bucketized 8 | 120.731 | ################## |
8.3 Linear Probing
Code Shape
std::size_t i = hash(key) & mask;
while (table[i].occupied) {
if (table[i].key == key) {
return true;
}
i = (i + 1U) & mask;
}
return false;
Cache-Line Access Visual
CPU -> [? bucket start]
-> [CL: slot0 slot1 slot2 slot3]
-> [CL next: slot4 slot5 slot6 slot7]
reference point:
random start, then local probe scan
Interpretation
| Field | Notes |
|---|
| Expectation | Linear probing should be good at moderate load factor and degrade sharply at high load factor. |
| Measured result | Load-factor sweep: detailed rows below. |
| What happened | Probe tails grew much faster than medians at high load. At 0.95 load, p99 probes exceeded 1200. |
| Takeaway | Hash tables need load-factor and tail-probe measurements, not only average lookup time. |
| Evidence status | Supported by timing and representative counters. |
Detailed rows
Load-factor sweep:
| Target load | Inserted keys | ns/query | p50 probes | p95 probes | p99 probes |
|---|
0.30 | 157,286 | 138.442 | 1 | 3 | 5 |
0.50 | 262,144 | 175.606 | 1 | 6 | 11 |
0.70 | 367,001 | 227.824 | 2 | 16 | 32 |
0.85 | 445,644 | 297.676 | 3 | 63 | 144 |
0.95 | 498,073 | 676.317 | 7 | 531 | 1207 |
Representative perf stat row:
| Event | Count |
|---|
| cycles | 725,366,509 |
| instructions | 1,174,526,099 |
| cache references | 10,805,729 |
| cache misses | 7,064,581 |
| LLC load misses | 6,815,291 |
| dTLB load misses | 2,542,337 |
| branches | 284,527,336 |
| branch misses | 9,065,424 |

8.4 Robin Hood Hashing
Code Shape
while (slotOccupied(i)) {
if (table[i].key == key) {
return true;
}
if (probeDistance(i) < currentProbeDistance) {
return false;
}
i = (i + 1U) & mask;
}
Cache-Line Access Visual
CPU -> [? bucket]
-> [CL: slot probe=0 | slot probe=1 | slot probe=2]
reference point:
local probes remain, but distance metadata controls tails
Interpretation
| Field | Notes |
|---|
| Expectation | Robin Hood should reduce probe-tail variance compared with basic linear probing. |
| Measured result | Robin Hood row: detailed rows below. |
| What happened | Robin Hood beat linear probing in the compare-all row and had low p99 probes in the tail row. |
| Takeaway | Probe-tail distribution is as important as mean lookup time. |
| Evidence status | Supported directionally. |
Detailed rows
Robin Hood row:
| Variant | ns/query | p50 probes | p95 probes | p99 probes |
|---|
| Robin Hood | 134.260 | 1 | 3 | 5 |
Compare-all row:
| Variant | ns/query | Visual |
|---|
| Robin Hood | 95.566 | ############## |
| linear probing | 118.353 | ################## |
8.5 Bucketized Hash
Code Shape
Bucket& bucket = buckets[bucketIndex(key)];
for (int slot = 0; slot < BucketSize; ++slot) {
if (bucket.occupied[slot] && bucket.keys[slot] == key) {
return true;
}
}
Cache-Line Access Visual
bucket size 4:
CPU -> [CL bucket: slot0 slot1 slot2 slot3]
bucket size 16:
CPU -> [CL bucket part A] [CL bucket part B] [CL bucket part C] ...
reference point:
one bucket access tests several candidates
Interpretation
| Field | Notes |
|---|
| Expectation | Bucket size should have a sweet spot: too small means more overflow, too large means too much local scan. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Bucket size 4 was best in this pass. |
| Takeaway | Bucket size is the hash-map version of block size: one memory access should buy enough comparisons, but not too many. |
| Evidence status | Supported by timing. |
Detailed rows
| Bucket size | ns/query | Visual |
|---|
4 | 91.262 | ########### |
8 | 120.048 | ############### |
16 | 164.193 | #################### |
8.6 Dense ID Array
Code Shape
int get(std::uint32_t denseId) {
return values[denseId];
}
Cache-Line Access Visual
CPU -> [CL values: v0 v1 v2 v3 v4 v5 v6 v7]
direct index
reference point:
hashing and collision handling are removed
Interpretation
| Field | Notes |
|---|
| Expectation | Dense IDs should be the upper-bound baseline when the key domain allows direct indexing. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Dense ID lookup was much cheaper than hash-table rows. |
| Takeaway | When possible, remove hashing from the hot path rather than optimizing it. |
| Evidence status | Supported by timing and byte accounting. |
Detailed rows
| Variant | ns/query |
|---|
| dense ID array | 11.384 |
Bytes-per-lookup row:
| Variant | ns/query | Avg probes | p95 probes | Estimated bytes/lookup |
|---|
| dense ID array | 13.156 | 1.000 | 1 | 4.000 |
8.7 Tag Split Hash
Code Shape
auto tag = fingerprint(hash);
auto& meta = metadata[bucket];
for (int i = 0; i < 16; ++i) {
if (meta.tags[i] == tag) {
const Entry& entry = entries[meta.entryIndex[i]];
if (entry.key == key) {
return true;
}
}
}
return false;
Cache-Line Access Visual
CPU -> [CL tag metadata: t0 t1 t2 ... t15]
tag mismatch -> stop
tag match -> [CL full entry: key value]
reference point:
metadata can reject many misses before loading payload
Interpretation
| Field | Notes |
|---|
| Expectation | Tag split should help misses and large-payload lookups, but tag width can trade false positives against metadata cost. |
| Measured result | Tag-size sweep: detailed rows below. |
| What happened | Tag width differences were small. Tag split did not win broadly at 0% hits, but helped in 50% and 100% rows in this implementation. |
| Takeaway | Fingerprint metadata needs hit/miss-ratio measurements. It is not automatically faster. |
| Evidence status | Partially supported. |
Detailed rows
Tag-size sweep:
| Tag bits | ns/query | p50 probes | p95 probes | p99 probes | Visual |
|---|
7 | 144.311 | 1 | 3 | 4 | ################### |
8 | 146.515 | 1 | 3 | 4 | ################### |
12 | 147.959 | 1 | 3 | 4 | #################### |
16 | 145.981 | 1 | 3 | 4 | ################### |
Hit/miss ratio sweep:
| Variant | Hit ratio | ns/query | p95 probes | p99 probes |
|---|
| linear | 0% | 161.529 | 8 | 13 |
| tag8 | 0% | 175.518 | 8 | 13 |
| tag16 | 0% | 170.531 | 8 | 13 |
| linear | 50% | 174.359 | 6 | 11 |
| tag8 | 50% | 160.616 | 6 | 11 |
| tag16 | 50% | 162.057 | 6 | 11 |
| linear | 100% | 117.693 | 4 | 7 |
| tag8 | 100% | 109.967 | 4 | 7 |
| tag16 | 100% | 106.242 | 4 | 7 |
8.8 Hot/Cold Hash
Code Shape
struct EntryHot {
std::uint64_t key;
std::uint32_t valueIndex;
State state;
};
if (hot[i].key == key) {
return values[hot[i].valueIndex];
}
Cache-Line Access Visual
normal entry:
CPU -> [CL: key state value payload payload]
split entry:
CPU -> [CL hot: key state valueIndex]
only on match -> [cold payload]
reference point:
payload stays cold during failed lookup
Interpretation
| Field | Notes |
|---|
| Expectation | Hot/cold split should help once value payload is large enough. |
| Measured result | Detailed rows below; benchmark numbers unchanged. |
| What happened | Split layout lost at 16 B and won for larger payloads. |
| Takeaway | Hot/cold split is conditional. It helps when payload is large and not always consumed. |
| Evidence status | Supported by timing. |
Detailed rows
| Payload | Normal entry ns/query | Split hot/cold ns/query |
|---|
16 B | 177.117 | 191.221 |
64 B | 211.278 | 181.337 |
128 B | 222.311 | 178.956 |
256 B | 236.144 | 184.487 |
8.9 Batched Prefetch Lookup
Code Shape
for (std::size_t i = 0; i < keys.size(); i += Batch) {
for (std::size_t j = i; j < i + Batch; ++j) {
bucket[j - i] = table.bucketIndex(keys[j]);
prefetchRead(&table.entries[bucket[j - i]]);
}
for (std::size_t j = i; j < i + Batch; ++j) {
found += table.findFromBucket(keys[j], bucket[j - i]);
}
}
Cache-Line Access Visual
phase 1:
CPU computes bucket0 bucket1 bucket2 bucket3
P -> [CL bucket0] [CL bucket1] [CL bucket2] [CL bucket3]
phase 2:
CPU uses bucket0 bucket1 bucket2 bucket3
reference point:
batching creates independent future addresses
Interpretation
| Field | Notes |
|---|
| Expectation | Batching should expose memory-level parallelism. Too-large batches may add overhead or pollution. |
| Measured result | Single-thread: detailed rows below. |
| What happened | Batch 8 was best in the single-thread row. |
| Takeaway | Batching can improve lookup throughput, but batch size is workload-specific. |
| Evidence status | Supported by timing. |
Detailed rows
Single-thread:
| Batch size | ns/query | Visual |
|---|
1 | 137.675 | #################### |
2 | 122.769 | ################## |
4 | 108.192 | ################ |
8 | 99.972 | ############### |
16 | 106.113 | ############### |
32 | 110.872 | ################ |
8.10 Set-Conflict Hash
Code Shape
std::size_t indexIdentity(std::uint64_t key) {
return key & mask;
}
std::size_t indexMixed(std::uint64_t key) {
return splitmix64(key) & mask;
}
Cache-Line Access Visual
identity + patterned keys:
CPU -> [CL set A] -> [CL set A] -> [CL set A] !
mixed hash:
CPU -> [CL set A] -> [CL set C] -> [CL set F]
reference point:
hash choice affects table placement and cache-set pressure
Interpretation
| Field | Notes |
|---|
| Expectation | Hash quality and table size should affect lookup time, especially with patterned keys. |
| Measured result | Sequential keys: detailed rows below. |
| What happened | Identity hashing was not worst on this sequential-key workload. |
| Takeaway | Conflict experiments must include realistic bad key patterns, not only sequential keys. |
| Evidence status | Partially supported. |
Detailed rows
Sequential keys:
| Table size | Hash | ns/query | Visual |
|---|
| 1,048,576 | identity | 36.584 | ########### |
| 1,048,576 | splitmix64 | 67.638 | ################### |
| 1,048,576 | multiplicative | 73.249 | #################### |
| 1,048,699 | identity | 36.638 | ########### |
| 1,048,699 | splitmix64 | 65.814 | ################## |
| 1,049,585 | identity | 35.920 | ########## |
| 1,049,585 | splitmix64 | 64.511 | ################## |
8.11 Tombstone Degradation
Code Shape
eraseSomeKeys(); // leaves tombstones
measureLookup();
rehashWithoutDeadSlots();
measureLookup();
Cache-Line Access Visual
before repair:
CPU -> [CL: filled tombstone filled tombstone empty]
after rehash:
CPU -> [CL: filled filled empty . .]
reference point:
tombstones extend probe chains
Interpretation
| Field | Notes |
|---|
| Expectation | Tombstones should degrade lookup by lengthening probe sequences. Rehash repair should reduce tails. |
| Measured result | Original benchmark status: detailed rows below. |
| What happened | Rehash repair reduced both time and probe tails. |
| Takeaway | Open-addressed hash tables must account for table aging, not just fresh construction. |
| Evidence status | Replacement measured; original benchmark blocked by correctness. |
Detailed rows
Original benchmark status:
hashmap_tombstone_degradation: blocked by correctness
Focused replacement row:
| Variant | Tombstones | ns/query | p50 probes | p95 probes | p99 probes |
|---|
| with tombstones | 72,447 | 212.913 | 1 | 10 | 18 |
| rehashed | 0 | 156.024 | 1 | 6 | 11 |
8.12 Bytes Per Lookup
Code Shape
bytes += probes * bytesPerHotSlot;
if (keyMatch) {
bytes += bytesPerPayload;
}
Cache-Line Access Visual
dense ID:
CPU -> [CL values: *] about 4 B useful payload
tag split:
CPU -> [CL tags: * * * *] -> maybe [full entry]
bucketized:
CPU -> [CL bucket with many slots] more bytes per probe
reference point:
estimated bytes and timing may rank layouts differently
Interpretation
| Field | Notes |
|---|
| Expectation | Estimated bytes touched should explain part of lookup time but not all of it. |
| Measured result | 50% hits: detailed rows below. |
| What happened | Dense ID touched the fewest bytes and was fastest. Bucketized had low probe count but high estimated bytes. |
| Takeaway | Probe count alone is incomplete. Bytes touched and cache-line footprint also matter. |
| Evidence status | Supported by software byte accounting and timing. |
Detailed rows
50% hits:
| Variant | ns/query | Avg probes | p95 probes | Estimated bytes/lookup |
|---|
| dense ID array | 13.156 | 1.000 | 1 | 4.000 |
| Robin Hood | 108.014 | 1.625 | 3 | 39.001 |
| tag split 8 | 125.573 | 1.277 | 3 | 26.323 |
| bucketized 8 | 135.593 | 1.001 | 1 | 136.074 |
| linear probing | 143.145 | 1.994 | 6 | 47.846 |
| hot/cold split | 168.590 | 1.994 | 6 | 51.846 |

9. Multithreaded Code Shapes
Family question:
What changes when cache lines become ownership units shared across cores?
9.1 False Sharing
Code Shape
for (std::uint64_t i = 0; i < steps; ++i) {
counters[threadId].value++;
}
Cache-Line Access Visual
bad adjacent counters:
[CL0: T0 counter | T1 counter | T2 counter | T3 counter]
! ownership moves between cores on writes
reference point:
threads write different variables but share one cache line
Interpretation
| Field | Notes |
|---|
| Expectation | Adjacent counters should scale poorly because writes move cache-line ownership between cores. |
| Measured result | steps = 5,000,000, 3 repetitions: detailed rows below. |
| What happened | The adjacent row was worse than padded at 8 threads, but local write-once was much stronger than both. |
| Takeaway | Avoid repeated shared writes inside hot loops. |
| Evidence status | Supported by timing; perf c2c evidence is blocked by tooling. |
Detailed rows
steps = 5,000,000, 3 repetitions:
| Threads | Bad adjacent ns/inc | Padded ns/inc | Local write-once ns/inc |
|---|
1 | 7.642 | 7.669 | 0.047 |
2 | 3.958 | 3.826 | 0.022 |
4 | 2.020 | 1.926 | 0.015 |
8 | 2.368 | 1.737 | 0.020 |

9.2 Padded Counters
Code Shape
struct alignas(64) PaddedCounter {
std::uint64_t value = 0;
};
counters[threadId].value++;
Cache-Line Access Visual
padded counters:
[CL0: T0 counter]
[CL1: T1 counter]
[CL2: T2 counter]
[CL3: T3 counter]
reference point:
each writer owns a separate line
Interpretation
| Field | Notes |
|---|
| Expectation | Padding should reduce false sharing when threads repeatedly publish per-thread counters. |
| Measured result | 8-thread row: detailed rows below. |
| What happened | Padding improved the 8-thread counter row. |
| Takeaway | Padding is useful when repeated writes to result slots are unavoidable. |
| Evidence status | Supported by timing. |
Detailed rows
8-thread row:
| Variant | ns/inc |
|---|
| bad adjacent counters | 2.368 |
| padded counters | 1.737 |
9.3 Local Write-Once Counters
Code Shape
std::uint64_t local = 0;
for (auto query : myQueries) {
local += search(query);
}
results[threadId].value = local;
Cache-Line Access Visual
hot loop:
T0 -> register/local stack only
T1 -> register/local stack only
end:
T0 -> [CL result0]
T1 -> [CL result1]
reference point:
shared write happens once
Interpretation
| Field | Notes |
|---|
| Expectation | Local accumulation should beat both adjacent and padded counters when the computation can reduce at the end. |
| Measured result | 8-thread row: detailed rows below. |
| What happened | Local write-once was much cheaper than both shared-counter layouts. |
| Takeaway | Make reads shared and writes private. |
| Evidence status | Supported by timing. |
Detailed rows
8-thread row:
| Variant | ns/inc |
|---|
| bad adjacent counters | 2.368 |
| padded counters | 1.737 |
| local write-once | 0.020 |
9.4 Read-Only Binary Search Scaling
Code Shape
std::size_t begin = threadId * queries.size() / threadCount;
std::size_t end = (threadId + 1U) * queries.size() / threadCount;
for (std::size_t i = begin; i < end; ++i) {
local += contains(sharedIndex, queries[i]);
}
Cache-Line Access Visual
shared data:
all threads -> read-only [array/tree cache lines]
queries:
T0 -> query range 0
T1 -> query range 1
T2 -> query range 2
reference point:
shared reads, private counters
Interpretation
| Field | Notes |
|---|
| Expectation | Read-only search should scale until memory latency, bandwidth, topology, or frequency effects dominate. |
| Measured result | n = 1,000,000, queries = 1,000,000, 3 repetitions: detailed rows below. |
| What happened | Eytzinger stayed the fastest 8-thread row. Some speedups were above ideal, so exact speedup values remain provisional. |
| Takeaway | Use these rows for direction, not as fixed-frequency scaling claims. |
| Evidence status | Supported directionally; fixed-frequency and SMT-off reruns are missing. |
Detailed rows
n = 1,000,000, queries = 1,000,000, 3 repetitions:
| Variant | 1 thread ns/query | 4 threads ns/query | 8 threads ns/query | 8-thread speedup |
|---|
| binary | 751.195 | 203.478 | 124.309 | 6.043x |
| branchless | 570.114 | 217.112 | 112.639 | 5.061x |
| Eytzinger | 324.589 | 82.790 | 66.675 | 4.868x |
| blocked | 493.224 | 124.722 | 84.513 | 5.836x |
9.5 Read-Only Hash Scaling
Code Shape
for (std::size_t i = begin; i < end; ++i) {
local += table.find(queries[i]);
}
results[threadId].value = local;
Cache-Line Access Visual
T0 -> [? bucket] -> local probes
T1 -> [? bucket] -> local probes
T2 -> [? bucket] -> local probes
shared table is read-only
per-thread result is private or padded
Interpretation
| Field | Notes |
|---|
| Expectation | Read-only hash lookup should scale until memory resources saturate. |
| Measured result | n = 500,000, queries = 1,000,000, 3 repetitions: detailed rows below. |
| What happened | Robin Hood was the fastest 8-thread hash-layout row. |
| Takeaway | Threaded lookup should be measured per layout; single-thread ranking may not directly transfer. |
| Evidence status | Supported directionally. |
Detailed rows
n = 500,000, queries = 1,000,000, 3 repetitions:
| Variant | 1 thread ns/query | 4 threads ns/query | 8 threads ns/query | 8-thread speedup |
|---|
| linear probing | 108.105 | 27.522 | 19.392 | 5.575x |
| Robin Hood | 88.702 | 23.636 | 17.099 | 5.187x |
| bucketized 8 | 120.989 | 30.956 | 31.086 | 3.892x |
| tag split 8 | 79.193 | 21.796 | 20.722 | 3.822x |
| hot/cold split | 122.975 | 33.371 | 26.947 | 4.564x |

9.6 Chunked-List Scaling
Code Shape
std::size_t firstChunk = threadId * chunks.size() / threadCount;
std::size_t lastChunk = (threadId + 1U) * chunks.size() / threadCount;
for (std::size_t c = firstChunk; c < lastChunk; ++c) {
scanChunk(chunks[c]);
}
Cache-Line Access Visual
T0 -> chunks [0..K/T)
T1 -> chunks [K/T..2K/T)
T2 -> chunks [2K/T..3K/T)
reference point:
threads scan disjoint ranges instead of all starting at the same head
Interpretation
| Field | Notes |
|---|
| Expectation | Chunk-range partitioning should scale better than all threads scanning from the same list head. |
| Measured result | n = 1,000,000, chunk size 64, 3 repetitions: detailed rows below. |
| What happened | Scaling improved through 4 threads and then got worse at 8 threads. |
| Takeaway | Partitioning avoids duplicate work, but memory throughput and scheduling still limit scaling. |
| Evidence status | Supported directionally. |
Detailed rows
n = 1,000,000, chunk size 64, 3 repetitions:
| Threads | ns/value | Speedup | Efficiency |
|---|
1 | 1.228 | 1.000x | 1.000 |
2 | 0.708 | 1.734x | 0.867 |
4 | 0.501 | 2.453x | 0.613 |
8 | 0.950 | 1.293x | 0.162 |
9.7 Batched Hash Scaling
Code Shape
for (std::size_t i = begin; i < end; i += Batch) {
prefetchBatch(i);
local += lookupBatch(i);
}
Cache-Line Access Visual
inside each thread:
P -> [bucket0] [bucket1] [bucket2] [bucket3]
then lookup bucket0..bucket3
across threads:
T0 batch stream
T1 batch stream
T2 batch stream
reference point:
MLP inside a thread plus parallel query ranges
Interpretation
| Field | Notes |
|---|
| Expectation | Batching should improve lookup throughput, but the best batch size may change with thread count. |
| Measured result | n = 500,000, queries = 1,000,000, 3 repetitions: detailed rows below. |
| What happened | Batch 8 was best single-thread. Batch 4 was best at 8 threads. |
| Takeaway | Batch size is not constant across thread counts. |
| Evidence status | Supported by timing. |
Detailed rows
n = 500,000, queries = 1,000,000, 3 repetitions:
| Batch | 1 thread ns/query | 4 threads ns/query | 8 threads ns/query | 8-thread speedup |
|---|
1 | 137.084 | 48.877 | 37.931 | 3.614x |
2 | 118.550 | 32.283 | 21.828 | 5.431x |
4 | 104.851 | 27.642 | 19.576 | 5.356x |
8 | 97.752 | 37.052 | 27.606 | 3.541x |
16 | 105.205 | 27.144 | 19.794 | 5.315x |
32 | 107.414 | 28.266 | 20.231 | 5.309x |
9.8 Read-Only Stream Bandwidth
Code Shape
for (std::size_t i = begin; i < end; ++i) {
local += values[i];
}
Cache-Line Access Visual
T0 -> [CL range 0: * * * *]
T1 -> [CL range 1: * * * *]
T2 -> [CL range 2: * * * *]
T3 -> [CL range 3: * * * *]
reference point:
disjoint read-only streaming ranges
Interpretation
| Field | Notes |
|---|
| Expectation | Stream bandwidth should rise with threads until memory bandwidth saturates. |
| Measured result | n = 8,388,608 uint64_t values, 5 repetitions: detailed rows below. |
| What happened | The stream row peaked at 4 threads and fell at 8 threads. |
| Takeaway | This gives a concrete bandwidth-saturation boundary for the measured setup. |
| Evidence status | Supported directionally; fixed-frequency rerun missing. |
Detailed rows
n = 8,388,608 uint64_t values, 5 repetitions:
| Threads | ns/value | GB/s | Speedup | Efficiency |
|---|
1 | 1.333 | 6.000 | 1.000x | 1.000 |
2 | 0.735 | 10.886 | 1.814x | 0.907 |
4 | 0.473 | 16.903 | 2.817x | 0.704 |
8 | 0.519 | 15.413 | 2.569x | 0.321 |
