Module 9: String Matching¶
Pattern Matching¶
- Search for a string (pattern) in a large body of text.
- \(T[0..n - 1]\) – The text (or haystack) being searched within.
- \(P[0..m - 1]\) – The pattern (or needle) being searched for.
- Strings over alphabet \(\Sigma\).
-
Return the first \(i\) such that
\[ P[j] = T[i + j] \quad \text{for} \quad 0 \leq j \leq m - 1 \] -
This is the first occurrence of \(P\) in \(T\).
- If \(P\) does not occur in \(T\), return
FAIL. - Applications:
- Information Retrieval (text editors, search engines)
- Bioinformatics
- Data Mining
Definitions¶
- Substring \(T[i..j]\) where \(0 \leq i \leq j < n\): a string of length \(j - i + 1\) which consists of characters \(T[i], \dots, T[j]\) in order.
- Prefix of \(T\): a substring \(T[0..i]\) of \(T\) for some \(0 \leq i < n\).
- Suffix of \(T\): a substring \(T[i..n - 1]\) of \(T\) for some \(0 \leq i \leq n - 1\).
General Idea of Algorithms¶
Pattern matching algorithms consist of guesses and checks:
- A guess is a position \(i\) such that \(P\) might start at \(T[i]\). Valid guesses (initially) are \(0 \leq i \leq n - m\).
- A check of a guess is a single position \(j\) with \(0 \leq j < m\) where we compare \(T[i + j]\) to \(P[j]\). We must perform \(m\) checks of a single correct guess, but may make (many) fewer checks of an incorrect guess.
We will diagram a single run of any pattern matching algorithm by a matrix of checks, where each row represents a single guess.
Brute-force Algorithm¶
Idea:
Check every possible guess.
**BruteforcePM(T[0..n - 1], P[0..m - 1])**
T: String of length n (text), P: String of length m (pattern)
for i ← 0 to n - m do
match ← true
j ← 0
while j < m and match do
if T[i + j] = P[j] then
j ← j + 1
else
match ← false
if match then
return i
return FAIL
Example¶

- What is the worst possible input? \(P = a^{m-1}b, \, T = a^n\)
- Worst case performance \(\Theta((n - m + 1)m)\)
- If \(m \leq n/2 \Rightarrow \Theta(mn)\)
KMP Algorithm¶
-
Knuth-Morris-Pratt algorithm (1977):
- Compares the pattern to the text in left-to-right order.
- Shifts the pattern more intelligently than the brute-force algorithm.
- Key question: When a mismatch occurs, what is the most we can shift the pattern while reusing knowledge from previous matches?
-
KMP Answer: The largest prefix of \(P[0..j]\) that is a suffix of \(P[1..j]\).
KMP Failure Array¶
- Preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.
- The failure array \(F\) of size \(m\): \(F[j]\) is defined as the length of the largest prefix of \(P[0..j]\) that is also a suffix of \(P[1..j]\).
- \(F[0] = 0\)
- On mismatch at \(P[j] \neq T[i]\), set \(j \leftarrow F[j - 1]\).
KMP Algorithm¶
**KMP(T, P)**
T: String of length n (text), P: String of length m (pattern)
F ← failureArray(P)
i ← 0
j ← 0
while i < n do
if T[i] = P[j] then
if j = m - 1 then
return i - j // match
else
i ← i + 1
j ← j + 1
else
if j > 0 then
j ← F[j - 1]
else
i ← i + 1
return -1 // no match
Computing the Failure Array¶
**failureArray(P)**
P: String of length m (pattern)
F[0] ← 0
i ← 1
j ← 0
while i < m do
if P[i] = P[j] then
F[i] ← j + 1
i ← i + 1
j ← j + 1
else if j > 0 then
j ← F[j - 1]
else
F[i] ← 0
i ← i + 1
Analysis¶
failureArray
- At each iteration of the while loop, either:
- \(i\) increases by one, or
- The guess index \(i - j\) increases by at least one (\(F[j - 1] < j\)).
- There are no more than \(2m\) iterations of the while loop.
- Running time: \(\Theta(m)\).
KMP
- failureArray can be computed in \(\Theta(m)\) time.
- At each iteration of the while loop, either:
- \(i\) increases by one, or
- The guess index \(i - j\) increases by at least one (\(F[j - 1] < j\)).
- There are no more than \(2n\) iterations of the while loop.
- Running time: \(\Theta(n)\).
Boyer-Moore Algorithm¶
Based on three key ideas:
- Reverse-order searching: Compare \(P\) with a subsequence of \(T\) moving backwards.
- Bad character jumps: When a mismatch occurs at \(T[i] = c\):
- If \(P\) contains \(c\), shift \(P\) to align the last occurrence of \(c\) in \(P\) with \(T[i]\).
- Otherwise, shift \(P[0]\) with \(T[i + 1]\).
- Good suffix jumps: If a suffix of \(P\) has already matched, but there is a mismatch:
- Shift \(P\) forward to align with the previous occurrence of that suffix (similar to the failure array in KMP).
- When a mismatch occurs, Boyer-Moore uses either the bad character or good suffix shift, choosing the one that moves the pattern further.
- Can skip large parts of \(T\).
Last-Occurrence Function¶
- Preprocess the pattern \(P\) and the alphabet \(\Sigma\).
- Build the last-occurrence function \(L\) mapping \(\Sigma\) to integers.
- \(L(c)\) is defined as:
- The largest index \(i\) such that \(P[i] = c\), or
- \(-1\) if no such index exists.
- The last-occurrence function can be computed in time \(O(m + |\Sigma|)\).
- In practice, \(L\) is stored in a size-\(|\Sigma|\) array.
Suffix Skip Array¶
- Preprocess the pattern \(P\) to build a table.
-
Suffix skip array \(S\) of size \(m\): For \(0 \leq i < m\), \(S[i]\) is the largest index \(j\) such that
\(P[i+1..m-1] = P[j+1..j+m-1-i]\) and \(P[j] \neq P[i]\).
-
Note: In this calculation, any negative indices are considered to make the given condition true (these correspond to letters that we might not have checked yet).
- Similar to the KMP failure array, with an extra condition in \(\Theta(m)\) time.
Boyer-Moore Algorithm¶
**boyer-moore(T, P)**
L ← last occurrence array computed from P
S ← suffix skip array computed from P
i ← m - 1, j ← m - 1
while i < n and j ≥ 0 do
if T[i] = P[j] then
i ← i - 1
j ← j - 1
else
i ← i + m - 1 - min(L[T[i]], S[j])
j ← m - 1
if j = -1 return i + 1
else return FAIL
Boyer-Moore Algorithm Conclusion¶
- Worst-case running time: \(\in O(n + |\Sigma|)\).
- This complexity is difficult to prove.
- What is the worst case?
- On typical English text, the algorithm probes approximately 25% of the characters in \(T\).
- Faster than KMP in practice on English text.
Tries of Suffixes and Suffix Trees¶
- What if we want to search for many patterns \(P\) within the same fixed text \(T\)?
- Idea: Preprocess the text \(T\) rather than the pattern \(P\).
- Observation: \(P\) is a substring of \(T\) if and only if \(P\) is a prefix of some suffix of \(T\).
- To achieve this:
- Store all suffixes of \(T\) in a trie.
- To save space:
- Use a compressed trie.
- Store suffixes implicitly via indices into \(T\).
- This is called a suffix tree.
Building Suffix Trees¶
- Text ****\(T\) has \(n\) characters and \(n + 1\) suffixes.
- Construction:
- Build the suffix tree by inserting each suffix of \(T\) into a compressed trie.
- This process takes time \(\Theta(n^2)\).
- There is a way to build a suffix tree of \(T\) in \(\Theta(n)\) time.
- However, this method is quite complicated and beyond the scope of the course.
Suffix Trees: String Matching¶
Assume we have a suffix tree of text \(T\).
To search for pattern \(P\) of length \(m\):
- We assume that \(P\) does not have the final \(\texttt{\$}\).
- \(P\) is the prefix of some suffix of \(T\).
- In the uncompressed trie, searching for \(P\) would be straightforward:
- \(P\) exists in \(T\) if and only if the search for \(P\) reaches a node in the trie.
- In the suffix tree, search for \(P\) until one of the following occurs:
- If the search fails due to "no such child," then \(P\) is not in \(T\).
- If we reach the end of \(P\), say at node \(v\), then jump to leaf \(\ell\) in the subtree of \(v\).
- (Suffix trees typically store such shortcuts.)
- Else we reach a leaf \(\ell = v\) while characters of \(P\) are left.
- Either way, the left index at \(\ell\) gives the shift that we should check.
- Time Complexity: This takes \(O(|P|)\) time.
Pattern Matching Conclusion¶
| Brute-Force | KMP | Boyer-Moore | Suffix Trees | |
|---|---|---|---|---|
| Preprocessing | – | \(O(m)\) | $O(m + | \Sigma |
| Search time | \(O(nm)\) | \(O(n)\) | \(O(n)\) (often better) | \(O(m)\) |
| Extra space | – | \(O(m)\) | $O(m + | \Sigma |