Module 6: Dictionaries for special keys¶

Lower Bound¶

Dictionary ADT: Implementations thus far¶

Realizations we have seen so far:

Balanced Binary Search trees (AVL trees):
- $\Theta(\log n)$ search, insert, and delete (worst-case)
Skip lists:
- $\Theta(\log n)$ search, insert, and delete (expected)
Various other realizations:
- Sometimes faster on insert, but search always takes $\Omega(\log n)$ time.

Question:

Can one do better than $\Theta(\log n)$ time for search?

Answer:

Yes and no! It depends on what we allow.

No: Comparison-based searching lower bound is $\Omega(\log n)$.
Yes: Non-comparison-based searching can achieve $o(\log n)$ (under restrictions!).

Lower bound for search¶

Theorem: Any comparison-based algorithm requires in the worst case $\Omega(\log n)$ comparisons to search among n distinct items.

\[ \begin{align*} & \geq n+1 \text{ outcomes} \\ & \geq \log(n+1) \text{ height} \\ & \geq \log(n+1) \text{ comparisons for same input} \\ \end{align*} \]

Interpolation Search¶

Code very similar to binary search, but compare at index

\[ \ell + \left\lfloor \frac{k - A[\ell]}{A[r] - A[\ell]} \times (r - \ell) \right\rfloor \]

$k - A[\ell]$ is the distance from left key, $A[r] - A[\ell]$ is the distance between left and right keys, and $(r - \ell)$ is the number of indices in range.

Need a few extra tests to avoid crash during computation of $m$.

**interpolation-search(A, k)**

l←0,r←A.size−1
while (l ≤ r)
    if (k < A[l] or k > A[r]) return “not found”
    if (k = A[r]) then return “found at A[r]”
    m←l+⌊(k−A[l])·(r−l)⌋ / (A[r]−A[l])
    if (A[m] equals k) then return “found at A[m]”
    else if (A[m]<k) then l←m+1
    else r←m−1

In the worst case, the algorithm can be very slow ($\Theta(n)$ time) (when the values are not uniformly distributed), but it works well on averge.

Can show: $T^\text{avg}(n) \leq T^\text{avg}(\sqrt{n}) + \Theta(1)$
This resolves to $T^\text{avg}(n) \in O(\log \log n)$

Tries¶

Words (review)¶

Scenario: Keys in dictionary are words. Need brief review.

Words (= strings): sequences of characters over alphabet $\Sigma$

Typical alphabets: {0, 1} (→bitstrings), ASCII, {C, G, T, A}
Stored in an array: w[i] gets i-th character (for i = 0, 1...)
Convention: Words have end-sentinel $ (sometimes not shown)
- w.size = |w| = # non-sentinel characters: be$=2.

Should know:

Prefix, suffix, substring
Sort words lexicographically: be$ $<_{lex}$ bear$ $<_{lex}$ beer$
- This is different from sorting numbers: 001$ $<_{lex}$ 010$ $<_{lex}$ 1$

Tries: Introduction¶

Trie (also known as radix tree): A dictionary for bitstrings.

Comes from retrieval, but pronounced "try"
A tree based on bitwise comparisons: Edge labelled with corresponding bit
Similar to radix sort: use individual bits, not the whole key
Due to end-sentinels, all key-value pairs are at leaves.

Tries: Search¶

Follow links that correspond to current bits in w.
Repeat until no such link or w found at a leaf.

Similar to skip lists, we first find the search-path separately.

**Trie::get-path-to(w)**
Output: Stack with all ancestors of where w would be stored.

P ← empty stack; z ← root; d ← 0; P.push(z)
while d ≤ |w|
    if z has a child-link labelled with w[d]
        z ← child at this link; d++; P.push(z)
    else break
return P

**Trie::search(w)**

P ← get-path-to(w), z ← P.top
if (z is not a leaf) then
    return “not found, would be in sub-trie of z”
return key-value pair at z

Tries: Leaf-references¶

For later applications of tries, another search operation is essential:

prefix-search(w): Find word w' in trie for which w is a prefix.

Features:

Testing: Checking whether w' exists is straightforward. (Could involve simply traversing down the trie to see if the prefix path exists.)
Leaf-references: To efficiently find w', nodes store references to leaves:
- Each node z stores a reference z.leaf to a leaf in its subtree.
- Convention: Store the leaf with the longest word to ensure all potential matches are covered.

**Trie::prefix-search(w)**

P ← get-path-to(w), p ← P.top()
if number of nodes on P is w.size or less
    return “no extension of w found”
return p.leaf

We need more than w.size nodes on P to have an extension. (we have root in the stack P, so that’s why we need more than w.size)

Tries: Insert¶

Trie::insert(w)

$P$ ← get-path-to(w) gives ancestors that exist already.
Expand the trie from P.top() by adding necessary nodes that correspond to extra bits of w.
Update leaf-references (also at P if w is longer than previous leaves).

Tries: Delete¶

Trie::delete(w)

$P$ ← get-path-to(w) gives all ancestors.
Let ℓ be the leaf where w is stored.
Delete ℓ and nodes on $P$ until ancestor has two or more children.
Update leaf-references on rest of $P$. (If z ∈ $P$ referred to ℓ, find new z.leaf from other children.)

Binary Tries summary¶

search(w), prefix-search(w), insert(w), delete(w) all take time $\Theta(|w|)$.

Search-time is independent of number $n$ of words stored in the trie!
Search-time is small for short words.

The trie for a given set of words is unique (except for order of children and ties among leaf-references)

Disadvantages:

Tries can be wasteful with respect to space.
Worst-case space is $\Theta(n · (\text{maximum length of a word}))$

Variations of Tries: Pruned Tries¶

Pruned Trie:

Stop adding nodes to trie as soon as the key is unique.

A node has a child only if it has at least two descendants.
Saves space if there are only few bitstrings that are long.
Could even store infinite bitstrings (e.g. real numbers).

A more efficient version of tries, but the operations get a bit more complicated.

Compressed Tries¶

Another (important!) variation:

Compress paths of nodes with only one child.
Each node stores an index, corresponding to the level of the node in the uncompressed trie. (On level $d$, we searched for link with $w[d]$.)

Compressed Tries: Search¶

As for tries, follow links that corresponds to current bits in $w$.
Main difference: stored indices say which bits to compare.
Also: must compare $w$ to word found at the leaf

**CompressedTrie::get-path-to(w)**

P ← empty stack; z ← root; P.push(z)
while z is not a leaf and (d ← z.index ≤ w.size) do
    if (z has a child-link labelled with w[d]) then
        z ← child at this link; P.push(z)
    else break
return P

**CompressedTrie::search(w)**

P ← get-path-to(w), z ← P.top
if (z is not a leaf or word stored at z is not w) then
    return "not found"
return key-value pair at z

prefix-search(w): Compare w to z.leaf at last visited node z.

Compressed Tries: Summary¶

search(w) and prefix-search(w) are easy.
insert(w) and delete(w) are conceptually simple:
- Search for path $P$ to word w (say we reach node z)
- Uncompress this path (using characters of z.leaf)
- Insert/Delete w as in an uncompressed trie.
- Compress path from root to where change happened
- (Pseudocode gets more complicated and is omitted.)
All operations take $O(|w|)$ time for a word w.
Compressed tries use $O(n)$ space
- We have $n$ leaves.
- Every internal node has two or more children.
- Can show: Therefore more leaves than internal nodes.

Overall, code is more complicated, but space-savings are worth it if words are unevenly distributed.

Multiway Tries: Larger Alphabet¶

To represent strings over any fixed alphabet $\Sigma$
Any node will have at most $|\Sigma| + 1$ children (one child for the end-of-word character $)
Example: A trie holding strings {bear$, ben$, be$, soul$, soup$}

Compressed Multiway Tries¶

Variation: Compressed multi-way tries: compress paths as before
Example: A compressed trie holding strings {bear$, ben$, be$, soul$, soup$}

Multiway Tries: Summary¶

Operations search(w), prefix-search(w), insert(w) and delete(w) are exactly as for tries for bitstrings.
Run-time $O(|w|)$ (time to find the appropriate child))
Each node now has up to $|\Sigma| + 1$ children. How should they be stored?
Time/space tradeoff:
- Arrays are fast, lists are space-efficient.
- Dictionary best in theory, not worth it in practice unless $|\Sigma|$ is huge.