String Structure and Matching Professional Playbook
Overview and Reading Path
String structure problems become efficient when repeated character comparisons are replaced with reusable memory.
Use this family when the problem asks about:
- finding patterns
- prefix/suffix borders
- repeated substrings
- dictionary prefixes
- many patterns at once
- palindromes
- subsequences and edit distance
- suffixes, duplicate substrings, or substring fingerprints
Read in this order:
- Identify the string relationship.
- Decide whether matches are contiguous or gapped.
- Choose the reusable memory structure.
- Define the mismatch or transition rule.
- Verify boundary handling.
Core Vocabulary
| Term | Meaning |
|---|---|
| Pattern | String being searched for or matched |
| Text | String being scanned |
| Prefix | Beginning segment of a string |
| Suffix | Ending segment of a string |
| Border | A substring that is both prefix and suffix |
| Fingerprint | Hash or encoded representation of a substring |
| Trie node | A shared prefix state for dictionary words |
| Alignment state | Pair of positions in two strings |
Beginner Mental Model
Think of searching a word in a book. The slow way restarts from scratch after every mismatch.
String structure matching asks:
What did the previous comparisons already prove?
Can I reuse that proof instead of comparing characters again?
Sometimes the reusable memory is a prefix table. Sometimes it is a hash, trie, palindrome radius, or DP table.
From Brute Force To This Pattern
Brute force compares characters again and again:
try a start position
compare characters
on mismatch, move one step
compare the same characters again
String structure matching changes the question:
What did the previous comparisons prove, and how can I reuse that proof?
The shift is from repeated character comparison to reusable string memory.
What Information Becomes Reusable
Reusable information depends on the string relationship:
- KMP stores reusable prefix/suffix borders
- Z stores prefix-match lengths at every position
- rolling hash stores substring fingerprints
- tries store shared dictionary prefixes
- failure links store fallback states for many patterns
- palindrome centers store mirror expansion facts
- two-string DP stores alignment subproblems
What This Mental Model Prevents
This model prevents:
- restarting a pattern match from scratch after every mismatch
- confusing substring with subsequence
- recomputing substring equality character by character
- searching every dictionary word independently
- using floating or ad hoc string keys when a structural index is needed
- losing endpoint correctness in substring ranges
Small Visual Model
text: a b a b a c
pattern: a b a b x
^
mismatch
reuse border:
matched "abab" has border "ab"
continue from the border instead of restarting from zero
Core Mental Model
The central question is:
What string relationship can I remember so future comparisons do less work?
String structure problems are not solved by comparing characters harder. They are solved by naming a repeated relationship and storing enough memory to avoid restarting from scratch.
String algorithms are usually built from four pieces:
unit:
character, substring, prefix, suffix, word, or alignment state
memory:
border table, hash prefix, trie node, palindrome radius, DP cell, suffix rank
fallback or transition:
what happens when the next character does or does not match
answer surface:
index, count, length, reconstructed string, set of matches, or decision
How To Read This Family Visually
Use the same symbols throughout this section:
[] active substring, matched prefix, trie path, DP cell, or suffix interval
{} reusable memory: border, hash, trie node, radius, DP table, suffix order
^ current character, center, fallback point, or alignment state
--> match extension, fallback, trie edge, or DP transition
x mismatch, impossible alignment, or discarded comparison
== proven equal string relation
... unchecked text or remaining suffix
One-minute mental model:
see repeated string comparison
-> ask what relationship can be remembered
-> store border, hash, trie path, radius, DP cell, or suffix order
-> extend on match and fall back/transition on mismatch
-> invariant: memory represents the longest/exact reusable relation so far
Read every diagram as:
shape:
what string unit is being compared?
memory:
what relation is remembered from previous characters?
action:
does the next character extend, fall back, or branch?
invariant:
what equality or alignment fact remains true?
ASCII Visual Map
String structure should feel like saving a relationship so the next mismatch does not erase all previous work.
KMP border fallback:
pattern: a b a b a c
text: a b a b a x
^
mismatch
matched prefix "ababa" has border "aba"
fallback:
keep "aba" as useful matched work
continue from that border length
rolling hash interval:
string: a b c d e f
prefix: H0 H1 H2 H3 H4 H5 H6
hash [l, r):
l r
v v
c d e
hash = H[r] - H[l] * base^(r-l)
trie shared prefixes:
root
/ \
c d
/ \
a o
/ \ \
r t g
"car" and "cat" reuse the path c -> a before splitting.
palindrome center expansion:
chars: a b a b a
^
center
compare left/right mirrors:
a b a b a
^ | ^
left center right
two-string DP grid:
t index ->
+---+---+---+---+
s index | | | | |
| +---+---+---+---+
v | |(i,j) |
+---+---+---+---+
Cell (i, j) means the answer for two prefixes or two suffixes, depending on
the chosen convention.
Unit of Comparison
Choose what one step compares.
| Problem shape | Unit |
|---|---|
| exact pattern search | pattern character against text character |
| repeated substring | prefix/suffix border |
| substring equality | interval hash or suffix rank |
| dictionary matching | trie edge or word end |
| palindrome | mirrored character pair around a center |
| edit distance | (i, j) alignment state |
| subsequence matching | next character position with gaps allowed |
| suffix indexing | suffix order and common prefix length |
Problem Shape Diagrams
exact pattern search:
text: a b c a b c a b
pattern: a b c
^ ^ ^
unit = compare pattern[j] with text[i]
invariant: matched prefix equals the current text suffix
repeated substring:
string: a b a b a b
prefix: a b a b
suffix: a b a b
border = prefix that is also suffix
invariant: border can be reused after a mismatch
substring equality:
string: a b c d a b c x
[a b c] [a b c]
l1 r1 l2 r2
unit = interval hash or suffix-rank evidence
invariant: equal interval evidence implies equal substring under the model
dictionary matching:
root
|
c
/ \
a o
/ \
t d
unit = trie edge or terminal word marker
invariant: current trie path equals the scanned prefix
palindrome:
chars: r a c e c a r
^
center
^ ^
mirror pair compare
invariant: all matched mirror pairs around the center are equal
edit distance:
target index ->
+---+---+---+
source | | | |
index +---+---+---+
| | | i,j |
v +---+---+---+
unit = alignment state (i, j)
invariant: cell (i, j) owns the answer for chosen prefixes or suffixes
subsequence matching:
text: a x b y c z d
pattern: a b c
^ ^ ^
Gaps are allowed; order must be preserved.
invariant: matched pattern characters appear in increasing text positions
suffix indexing:
string: banana
suffixes:
0 banana
1 anana
2 nana
3 ana
4 na
5 a
sorted suffix order exposes common prefixes and duplicates.
invariant: adjacent sorted suffixes reveal reusable common-prefix structure
The unit decides the state. A substring problem can be about ranges, hashes, suffixes, or DP intervals depending on what relationship is reused.
Reusable Memory
The memory structure stores a proof about earlier characters.
| Strategy | Reusable memory |
|---|---|
| KMP | longest proper prefix that is also suffix for each pattern prefix |
| Z algorithm | longest match from current position to global prefix |
| rolling hash | numeric fingerprint of every prefix |
| trie | shared prefix path across many words |
| Aho-Corasick | trie plus failure links to next valid border |
| palindrome expansion | center and radius or DP interval truth |
| two-string DP | best answer for prefixes or suffixes (i, j) |
| suffix indexing | sorted suffixes, LCPs, or hash+binary-search evidence |
Good memory answers the next comparison question directly. Bad memory stores raw substrings repeatedly and recreates the same comparisons.
Fallback or Transition
The fallback rule is what makes the algorithm professional.
match:
extend the current relationship
mismatch:
reuse the largest still-valid relationship instead of restarting blindly
complete match:
record an answer and fall back to allow overlapping matches
For KMP, fallback follows the prefix table. For Aho-Corasick, failure links move to the longest suffix that is also a trie prefix. For edit DP, mismatch chooses insert, delete, or replace transitions.
Equality Representation
String equality can be represented in several ways.
| Need | Representation |
|---|---|
| exact online pattern match | KMP or Z |
| many substring equality queries | rolling hash or suffix structure |
| many dictionary prefixes | trie |
| many patterns in one scan | Aho-Corasick |
| mirror equality | palindrome center/radius |
| gapped equality | subsequence or edit DP |
Rolling hash is a probabilistic shortcut unless collision handling or double hashing is used. KMP/Z are exact but specialized for prefix-style reuse.
Endpoint and Interval Meaning
String bugs often come from unclear interval ownership.
prefix length k:
characters [0, k)
substring start, length:
characters [start, start + length)
DP state (i, j):
usually prefixes s[0:i], t[0:j] or suffixes s[i:], t[j:]
State the convention before mixing hashes, DP tables, or substring boundaries.
Universal String Structure Template
identify the string unit and match relation
choose exact reusable memory for that relation
build or update the memory with a clear endpoint convention
scan, query, or transition through the string:
compare the current unit
extend on match
fall back, branch, or update DP on mismatch
record answer when the memory proves the requested relation
return index, count, length, reconstruction, or decision
Strategy-Specific Mental Model
| Strategy | Mental model | Main risk |
|---|---|---|
| KMP/Z | prefix relationships prevent restart | wrong fallback after full match |
| rolling hash | substring equality becomes prefix arithmetic | collision or modulo mistakes |
| trie | words share prefix paths | forgetting terminal word state |
| Aho-Corasick | many patterns share trie and fallback links | missing output links |
| palindrome | mirror equality expands around centers | even/odd center mismatch |
| two-string DP | prefixes/suffixes encode alignment choices | wrong base row/column |
| suffix indexing | suffix order exposes duplicate/distinct substrings | unclear LCP semantics |
Mini Translation Table
| Statement signal | Likely memory |
|---|---|
| ”find all occurrences of pattern” | KMP/Z |
| ”many words with shared prefixes” | trie |
| ”many patterns in one text” | Aho-Corasick |
| ”repeated/duplicate substring” | suffix structure or rolling hash |
| ”palindromic substring/subsequence” | center/radius or interval DP |
| ”minimum edits” | two-string DP |
| ”subsequence count” | alignment DP or next-position table |
Mini Traces
Trace 1: KMP fallback.
matched:
pattern: a b a b a c
text: a b a b a x
^
mismatch
step 1:
matched length = 5 ("ababa")
invariant: first 5 pattern chars matched text suffix
step 2:
fallback to border length 3 ("aba")
invariant: "aba" is still a valid matched suffix, so work is reused
Trace 2: trie dictionary scan.
words:
cat, car, dog
scan prefix "ca":
root -> c -> a
step:
next char t reaches terminal "cat"
invariant:
current trie node represents exactly the scanned prefix path
Contrast Diagrams
Trie vs KMP vs rolling hash:
trie:
many dictionary words share prefix paths
invariant: node path equals a word prefix
KMP:
one pattern reuses its own borders during text scan
invariant: border length is the longest reusable matched prefix
rolling hash:
many substring equality checks become numeric interval checks
invariant: hash interval represents substring content under the hash model
Palindrome DP vs expand-around-center:
expand center:
left <- center -> right
invariant: mirrored chars matched for this center radius
interval DP:
dp[l][r] depends on dp[l+1][r-1]
invariant: cell records whether substring [l, r] is palindromic
Misclassification Checks
Use this family when repeated textual relationships are the bottleneck. Be careful when:
| Symptom | Better family may be |
|---|---|
| string is only scanned with counts | Sequential Aggregation |
| problem is mostly graph reachability | Connectivity Discovery |
| choices over string cuts repeat | State Space Exploration |
| range queries over character counts dominate | Range Query & Indexed Aggregation |
| numeric parsing/formula is the core | Mathematical Numeric Reasoning |
The clean test is: can you name the reusable string relationship that prevents the next comparison from restarting?
What This Family Is / Is Not
Use this family when repeated string relationships are the bottleneck.
Do not use it when:
- the problem only needs a simple frequency count
- the answer is about arbitrary graph reachability
- the main relation is numeric rather than textual
- the string is just incidental input for a different pattern
Universal Template
identify the string unit
define what it means to match
build reusable memory
scan or query using that memory
apply the mismatch or transition rule
return the required match, count, length, or decision
Detailed Translation Workflow
Translate every string-structure problem in this order:
1. Is the match contiguous or gapped?
2. What exact relationship repeats?
3. What unit is being compared?
4. What memory avoids repeated comparison?
5. What is the mismatch or transition rule?
6. What endpoint convention prevents off-by-one mistakes?
7. What does the answer ask for: index, count, length, boolean, or construction?
If the match is gapped, think subsequence/DP before substring algorithms. If the match is contiguous and repeated many times, think borders, hashes, tries, or suffix indexing.
Strategy Map
| Strategy | Use when | Reusable memory |
|---|---|---|
| KMP / Z | prefix-pattern reuse matters | border table or Z-box |
| Rolling hash | many substring equality checks | prefix hash arrays |
| Trie | many words share prefixes | prefix tree nodes |
| Aho-Corasick / multi-pattern | many patterns scan one text | trie plus failure links |
| Palindrome structure | mirror equality matters | centers, radii, intervals |
| Subsequence/edit matching | gapped alignment matters | (i, j) DP state |
| Suffix indexing | duplicate/distinct substring structure matters | suffix order, LCP, hash+binary search |
Problem Translation Table
| Problem type | Unit / state | Reuse rule | Engine |
|---|---|---|---|
| Find First Occurrence | pattern index | KMP border fallback | KMP |
| Repeated Substring Pattern | prefix/suffix border | longest border test | KMP |
| Shortest Palindrome | prefix palindrome border | KMP on combined string | KMP |
| Sum of Scores of Built Strings | prefix match length | Z-box reuse | Z algorithm |
| Repeated DNA Sequences | fixed substring | rolling hash / encoded window | rolling hash |
| Longest Duplicate Substring | substring length | hash equality plus binary search | rolling hash |
| Implement Trie | node per character | shared prefix path | trie |
| Word Break | string index | dictionary prefix transitions | trie or DP |
| Word Search II | grid path and trie node | prune missing prefixes | trie + backtracking |
| Aho-Corasick Search | automaton state | failure links | multi-pattern matching |
| Longest Palindromic Substring | center or interval | expand/mirror | palindrome |
| Palindromic Substrings | center | expand around center | palindrome |
| Longest Palindromic Subsequence | i, j | shrink interval | DP |
| Longest Common Subsequence | i, j | advance one or both strings | two-string DP |
| Edit Distance | i, j | insert/delete/replace | two-string DP |
| Distinct Subsequences | i, j | match or skip source char | DP |
| Suffix Array Search | suffix rank | neighboring suffix comparisons | suffix indexing |
Canonical Examples
KMP: Pattern Search
Invariant:
lps[i] is the length of the longest proper prefix that is also a suffix ending
at pattern[i].
Skeleton:
def build_lps(pattern):
lps = [0] * len(pattern)
j = 0
for i in range(1, len(pattern)):
while j > 0 and pattern[i] != pattern[j]:
j = lps[j - 1]
if pattern[i] == pattern[j]:
j += 1
lps[i] = j
return lps
Rolling Hash: Repeated DNA
Invariant:
The rolling window fingerprint represents the current fixed-length substring.
Use direct encoding for small alphabets when possible. It avoids collision risk from modular hashing.
Trie: Prefix Search
Invariant:
Every path from root to a node represents exactly one prefix.
class TrieNode:
def __init__(self):
self.children = {}
self.end = False
Palindrome: Expand Around Center
Invariant:
While left and right match, s[left:right + 1] is a palindrome.
def longest_palindrome(s):
best = ""
def expand(left, right):
nonlocal best
while left >= 0 and right < len(s) and s[left] == s[right]:
left -= 1
right += 1
candidate = s[left + 1:right]
if len(candidate) > len(best):
best = candidate
for i in range(len(s)):
expand(i, i)
expand(i, i + 1)
return best
Edit Distance: Two-String DP
Invariant:
dp[i][j] is the minimum edits to convert the first i characters of A into the
first j characters of B.
Decision Rules
Use KMP/Z when mismatch reuse in a single pattern matters. Use rolling hash when many substring equality checks are needed. Use a trie when many words share prefixes or prefix queries are frequent. Use multi-pattern automata when one text must be scanned against many patterns. Use palindrome center/interval logic when mirror symmetry is the structure. Use two-string DP when matching is gapped or edit-like. Use suffix indexing when the problem is about duplicate, distinct, or ordered substrings at scale.
Common Failure Modes
- Restarting pattern search from scratch after a mismatch.
- Mixing inclusive and half-open substring endpoints.
- Ignoring rolling-hash collision risk.
- Forgetting terminal markers in tries.
- Not pruning trie branches in grid word search.
- Treating subsequence matching as substring matching.
- Using center expansion for a subsequence problem.
Performance Notes
KMP/Z are linear. Rolling hash is usually linear for fixed windows and
O(n log n) when paired with binary search on length. Trie memory can be large
because each node stores children. Two-string DP is usually O(mn) and can
often compress rows. Suffix arrays are powerful but implementation-heavy.
Cheat Sheet
Pattern search:
What can I reuse after mismatch?
Hashing:
What substring equality checks repeat?
Trie:
Which prefixes are shared?
Palindrome:
What mirrors what?
Two-string DP:
What do i and j mean?
Suffix indexing:
Do I need global substring order or duplicates?
Extended Translation Notes
String structure problems should start with the relationship being reused.
Step 1: Identify Match Type
| Match type | Meaning | Typical tools |
|---|---|---|
| exact substring | contiguous pattern occurrence | KMP, Z, rolling hash |
| prefix/suffix border | prefix equals suffix | KMP, Z |
| dictionary prefix | many words share starts | trie |
| many patterns | many words against one text | trie, Aho-Corasick |
| palindrome | mirrored equality | center expansion, DP, Manacher |
| subsequence | ordered but not contiguous | two-pointer, DP |
| edit alignment | insert/delete/replace | two-string DP |
| duplicate substring | repeated contiguous block | rolling hash, suffix array |
Do not choose an algorithm before deciding whether the match is contiguous.
Step 2: Define the Unit
The unit may be:
character
substring
prefix
suffix
dictionary word
pattern state
palindrome center
two-string index pair
suffix rank
Step 3: Define Reuse After Mismatch
This is the heart of string matching.
Examples:
KMP:
fallback to the longest border
Z algorithm:
reuse the current prefix-match box
Trie:
stop when a prefix edge is missing
Aho-Corasick:
follow failure links
Edit distance:
try insert, delete, replace
Step 4: Choose Endpoint Convention
Use half-open ranges when possible:
s[left:right]
left included
right excluded
This reduces off-by-one bugs in substring DP and hashing.
Strategy Invariants
KMP
Invariant:
j is the number of pattern characters currently matched.
On mismatch, KMP does not reset to zero unless no border remains.
Z Algorithm
Invariant:
The Z-box [left, right] is the farthest known interval matching the prefix.
Z is often better when every position needs a prefix-match length.
Rolling Hash
Invariant:
hash(left, right) is computed from prefix hashes in O(1).
Use double hashing or direct encoding when collision risk matters.
Trie
Invariant:
Every node represents the prefix formed by the path from root.
Terminal markers distinguish complete words from prefixes.
Palindrome
Invariant:
Expansion is valid while left and right characters mirror each other.
Interval DP uses:
s[i] == s[j] and inside interval is palindrome
Two-String DP
Invariant:
dp[i][j] describes the relationship between prefixes or suffixes of two
strings.
Define whether i and j mean prefix lengths or current suffix indexes before
coding.
Worked Problem Snapshots
Repeated Substring Pattern
State:
longest border length of the whole string
Rule:
if n % (n - border) == 0, the string repeats a smaller unit
Engine:
KMP prefix table
Shortest Palindrome
Goal:
find longest palindromic prefix
Technique:
run KMP-style border on s + "#" + reversed(s)
Trap:
Use a separator that cannot appear as a normal bridge between the two strings.
Sum of Scores of Built Strings
State:
Z[i] = prefix match length starting at i
Answer:
sum of all Z values plus full string length
Engine:
Z algorithm
Longest Duplicate Substring
State:
candidate length
Check:
does any substring of this length have a repeated hash?
Engine:
binary search on length plus rolling hash
Word Search II
State:
grid position, trie node, visited path
Invariant:
If the current trie node has no edge for a character, no word under that prefix
can be found on this branch.
Engine:
trie plus backtracking
Longest Palindromic Subsequence
State:
i, j interval
Transition:
if s[i] == s[j]: 2 + solve(i + 1, j - 1)
else max(solve(i + 1, j), solve(i, j - 1))
Engine:
interval DP
Distinct Subsequences
State:
i in source, j in target
Choices:
skip source[i]
use source[i] if it matches target[j]
Engine:
two-string counting DP
Longest Common Substring
State:
dp[i][j] = length of common suffix ending at i and j
Trap:
Substring resets to zero on mismatch; subsequence does not.
Dry Runs
Dry Run: KMP Fallback
Pattern:
ababaca
When matching ababa and seeing a mismatch after length 5:
matched prefix = ababa
fallback length = lps[4] = 3
next comparison resumes as if aba is already matched
KMP reuses the border instead of restarting at the next text position.
Dry Run: Rolling Hash Window
String:
AAAAACCCCCAAAAA
For fixed length 10:
window 0: AAAAACCCCC -> store hash
window 1: AAAACCCCCA -> roll hash
...
later window: AAAAACCCCC -> hash repeats
The repeated hash identifies a repeated fixed-length substring.
Dry Run: Palindrome Center
String:
babad
Center at a:
left b, center a, right b -> bab
next expansion fails
Candidate:
bab
Another center gives aba; both length 3 answers are valid.