Module 9: String Matching¶

Pattern Matching¶

Search for a string (pattern) in a large body of text.
$T[0..n - 1]$ – The text (or haystack) being searched within.
$P[0..m - 1]$ – The pattern (or needle) being searched for.
Strings over alphabet $\Sigma$.
Return the first $i$ such that

\[ P[j] = T[i + j] \quad \text{for} \quad 0 \leq j \leq m - 1 \]
This is the first occurrence of $P$ in $T$.
If $P$ does not occur in $T$, return FAIL.
Applications:
- Information Retrieval (text editors, search engines)
- Bioinformatics
- Data Mining

Definitions¶

Substring $T[i..j]$ where $0 \leq i \leq j < n$: a string of length $j - i + 1$ which consists of characters $T[i], \dots, T[j]$ in order.
Prefix of $T$: a substring $T[0..i]$ of $T$ for some $0 \leq i < n$.
Suffix of $T$: a substring $T[i..n - 1]$ of $T$ for some $0 \leq i \leq n - 1$.

General Idea of Algorithms¶

Pattern matching algorithms consist of guesses and checks:

A guess is a position $i$ such that $P$ might start at $T[i]$. Valid guesses (initially) are $0 \leq i \leq n - m$.
A check of a guess is a single position $j$ with $0 \leq j < m$ where we compare $T[i + j]$ to $P[j]$. We must perform $m$ checks of a single correct guess, but may make (many) fewer checks of an incorrect guess.

We will diagram a single run of any pattern matching algorithm by a matrix of checks, where each row represents a single guess.

Brute-force Algorithm¶

Idea:

Check every possible guess.

**BruteforcePM(T[0..n - 1], P[0..m - 1])**
T: String of length n (text), P: String of length m (pattern)

for i ← 0 to n - m do
        match ← true
        j ← 0
        while j < m and match do
                if T[i + j] = P[j] then
                        j ← j + 1
                else
                        match ← false
        if match then
                return i
    return FAIL

Example¶

What is the worst possible input? $P = a^{m-1}b, \, T = a^n$
Worst case performance $\Theta((n - m + 1)m)$
If $m \leq n/2 \Rightarrow \Theta(mn)$

KMP Algorithm¶

Knuth-Morris-Pratt algorithm (1977):
- Compares the pattern to the text in left-to-right order.
- Shifts the pattern more intelligently than the brute-force algorithm.
- Key question: When a mismatch occurs, what is the most we can shift the pattern while reusing knowledge from previous matches?
KMP Answer: The largest prefix of $P[0..j]$ that is a suffix of $P[1..j]$.

KMP Failure Array¶

Preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.
The failure array $F$ of size $m$: $F[j]$ is defined as the length of the largest prefix of $P[0..j]$ that is also a suffix of $P[1..j]$.
$F[0] = 0$
On mismatch at $P[j] \neq T[i]$, set $j \leftarrow F[j - 1]$.

KMP Algorithm¶

**KMP(T, P)**
T: String of length n (text), P: String of length m (pattern)

F ← failureArray(P)
i ← 0
j ← 0
while i < n do
    if T[i] = P[j] then
        if j = m - 1 then
            return i - j  // match
        else
            i ← i + 1
            j ← j + 1
    else
        if j > 0 then
            j ← F[j - 1]
        else
            i ← i + 1
return -1  // no match

Computing the Failure Array¶

**failureArray(P)**
P: String of length m (pattern)

F[0] ← 0
i ← 1
j ← 0
while i < m do
    if P[i] = P[j] then
        F[i] ← j + 1
        i ← i + 1
        j ← j + 1
    else if j > 0 then
        j ← F[j - 1]
    else
        F[i] ← 0
        i ← i + 1

Analysis¶

failureArray

At each iteration of the while loop, either:
1. $i$ increases by one, or
2. The guess index $i - j$ increases by at least one ($F[j - 1] < j$).
There are no more than $2m$ iterations of the while loop.
Running time: $\Theta(m)$.

KMP

failureArray can be computed in $\Theta(m)$ time.
At each iteration of the while loop, either:
1. $i$ increases by one, or
2. The guess index $i - j$ increases by at least one ($F[j - 1] < j$).
There are no more than $2n$ iterations of the while loop.
Running time: $\Theta(n)$.

Boyer-Moore Algorithm¶

Based on three key ideas:

Reverse-order searching: Compare $P$ with a subsequence of $T$ moving backwards.
Bad character jumps: When a mismatch occurs at $T[i] = c$:
- If $P$ contains $c$, shift $P$ to align the last occurrence of $c$ in $P$ with $T[i]$.
- Otherwise, shift $P[0]$ with $T[i + 1]$.
Good suffix jumps: If a suffix of $P$ has already matched, but there is a mismatch:
- Shift $P$ forward to align with the previous occurrence of that suffix (similar to the failure array in KMP).
When a mismatch occurs, Boyer-Moore uses either the bad character or good suffix shift, choosing the one that moves the pattern further.
Can skip large parts of $T$.

Last-Occurrence Function¶

Preprocess the pattern $P$ and the alphabet $\Sigma$.
Build the last-occurrence function $L$ mapping $\Sigma$ to integers.
$L(c)$ is defined as:
- The largest index $i$ such that $P[i] = c$, or
- $-1$ if no such index exists.
The last-occurrence function can be computed in time $O(m + |\Sigma|)$.
In practice, $L$ is stored in a size-$|\Sigma|$ array.

Suffix Skip Array¶

Preprocess the pattern $P$ to build a table.
Suffix skip array $S$ of size $m$: For $0 \leq i < m$, $S[i]$ is the largest index $j$ such that

$P[i+1..m-1] = P[j+1..j+m-1-i]$ and $P[j] \neq P[i]$.
Note: In this calculation, any negative indices are considered to make the given condition true (these correspond to letters that we might not have checked yet).
Similar to the KMP failure array, with an extra condition in $\Theta(m)$ time.

Boyer-Moore Algorithm¶

**boyer-moore(T, P)**

L ← last occurrence array computed from P
S ← suffix skip array computed from P
i ← m - 1, j ← m - 1
while i < n and j ≥ 0 do
    if T[i] = P[j] then
        i ← i - 1
        j ← j - 1
    else
        i ← i + m - 1 - min(L[T[i]], S[j])
        j ← m - 1
if j = -1 return i + 1
else return FAIL

Boyer-Moore Algorithm Conclusion¶

Worst-case running time: $\in O(n + |\Sigma|)$.
This complexity is difficult to prove.
What is the worst case?
On typical English text, the algorithm probes approximately 25% of the characters in $T$.
Faster than KMP in practice on English text.

Tries of Suffixes and Suffix Trees¶

What if we want to search for many patterns $P$ within the same fixed text $T$?
Idea: Preprocess the text $T$ rather than the pattern $P$.
Observation: $P$ is a substring of $T$ if and only if $P$ is a prefix of some suffix of $T$.
To achieve this:
- Store all suffixes of $T$ in a trie.
To save space:
- Use a compressed trie.
- Store suffixes implicitly via indices into $T$.
This is called a suffix tree.

Building Suffix Trees¶

Text ****$T$ has $n$ characters and $n + 1$ suffixes.
Construction:
- Build the suffix tree by inserting each suffix of $T$ into a compressed trie.
- This process takes time $\Theta(n^2)$.
There is a way to build a suffix tree of $T$ in $\Theta(n)$ time.
- However, this method is quite complicated and beyond the scope of the course.

Suffix Trees: String Matching¶

Assume we have a suffix tree of text $T$.

To search for pattern $P$ of length $m$:

We assume that $P$ does not have the final $\texttt{\$}$.
$P$ is the prefix of some suffix of $T$.
In the uncompressed trie, searching for $P$ would be straightforward:
- $P$ exists in $T$ if and only if the search for $P$ reaches a node in the trie.
In the suffix tree, search for $P$ until one of the following occurs:
1. If the search fails due to "no such child," then $P$ is not in $T$.
2. If we reach the end of $P$, say at node $v$, then jump to leaf $\ell$ in the subtree of $v$.
  - (Suffix trees typically store such shortcuts.)
3. Else we reach a leaf $\ell = v$ while characters of $P$ are left.
Either way, the left index at $\ell$ gives the shift that we should check.
Time Complexity: This takes $O(|P|)$ time.

Pattern Matching Conclusion¶

	Brute-Force	KMP	Boyer-Moore	Suffix Trees
Preprocessing	–	$O(m)$	$O(m +	\Sigma
Search time	$O(nm)$	$O(n)$	$O(n)$ (often better)	$O(m)$
Extra space	–	$O(m)$	$O(m +	\Sigma

	Brute-Force	KMP	Boyer-Moore	Suffix Trees
Preprocessing	–	\(O(m)\)	$O(m +	\Sigma
Search time	\(O(nm)\)	\(O(n)\)	\(O(n)\) (often better)	\(O(m)\)
Extra space	–	\(O(m)\)	$O(m +	\Sigma