Module 6: Dictionaries for special keys¶
Lower Bound¶
Dictionary ADT: Implementations thus far¶
Realizations we have seen so far:
- Balanced Binary Search trees (AVL trees):
- \(\Theta(\log n)\) search, insert, and delete (worst-case)
- Skip lists:
- \(\Theta(\log n)\) search, insert, and delete (expected)
- Various other realizations:
- Sometimes faster on insert, but search always takes \(\Omega(\log n)\) time.
Question:
Can one do better than \(\Theta(\log n)\) time for search?
Answer:
Yes and no! It depends on what we allow.
- No: Comparison-based searching lower bound is \(\Omega(\log n)\).
- Yes: Non-comparison-based searching can achieve \(o(\log n)\) (under restrictions!).
Lower bound for search¶
Theorem: Any comparison-based algorithm requires in the worst case \(\Omega(\log n)\) comparisons to search among n distinct items.

Interpolation Search¶
Code very similar to binary search, but compare at index
\(k - A[\ell]\) is the distance from left key, \(A[r] - A[\ell]\) is the distance between left and right keys, and \((r - \ell)\) is the number of indices in range.
Need a few extra tests to avoid crash during computation of \(m\).
**interpolation-search(A, k)**
l←0,r←A.size−1
while (l ≤ r)
if (k < A[l] or k > A[r]) return “not found”
if (k = A[r]) then return “found at A[r]”
m←l+⌊(k−A[l])·(r−l)⌋ / (A[r]−A[l])
if (A[m] equals k) then return “found at A[m]”
else if (A[m]<k) then l←m+1
else r←m−1
In the worst case, the algorithm can be very slow (\(\Theta(n)\) time) (when the values are not uniformly distributed), but it works well on averge.
- Can show: \(T^\text{avg}(n) \leq T^\text{avg}(\sqrt{n}) + \Theta(1)\)
- This resolves to \(T^\text{avg}(n) \in O(\log \log n)\)
Tries¶
Words (review)¶
Scenario: Keys in dictionary are words. Need brief review.
Words (= strings): sequences of characters over alphabet \(\Sigma\)
- Typical alphabets: {0, 1} (→bitstrings), ASCII, {C, G, T, A}
- Stored in an array:
w[i]gets i-th character (for i = 0, 1...) - Convention: Words have end-sentinel
$(sometimes not shown)- w.size = |w| = # non-sentinel characters: be$=2.
Should know:
- Prefix, suffix, substring
- Sort words lexicographically: be$ \(<_{lex}\) bear$ \(<_{lex}\) beer$
- This is different from sorting numbers: 001$ \(<_{lex}\) 010$ \(<_{lex}\) 1$
Tries: Introduction¶
Trie (also known as radix tree): A dictionary for bitstrings.
- Comes from retrieval, but pronounced "try"
- A tree based on bitwise comparisons: Edge labelled with corresponding bit
- Similar to radix sort: use individual bits, not the whole key
- Due to end-sentinels, all key-value pairs are at leaves.

Tries: Search¶
- Follow links that correspond to current bits in
w. - Repeat until no such link or
wfound at a leaf.
Similar to skip lists, we first find the search-path separately.
**Trie::get-path-to(w)**
Output: Stack with all ancestors of where w would be stored.
P ← empty stack; z ← root; d ← 0; P.push(z)
while d ≤ |w|
if z has a child-link labelled with w[d]
z ← child at this link; d++; P.push(z)
else break
return P
**Trie::search(w)**
P ← get-path-to(w), z ← P.top
if (z is not a leaf) then
return “not found, would be in sub-trie of z”
return key-value pair at z
Tries: Leaf-references¶
For later applications of tries, another search operation is essential:
- prefix-search(w): Find word
w'in trie for whichwis a prefix.
Features:
- Testing: Checking whether
w'exists is straightforward. (Could involve simply traversing down the trie to see if the prefix path exists.) - Leaf-references: To efficiently find
w', nodes store references to leaves:- Each node
zstores a referencez.leafto a leaf in its subtree. - Convention: Store the leaf with the longest word to ensure all potential matches are covered.
- Each node

**Trie::prefix-search(w)**
P ← get-path-to(w), p ← P.top()
if number of nodes on P is w.size or less
return “no extension of w found”
return p.leaf
We need more than w.size nodes on P to have an extension. (we have root in the stack P, so that’s why we need more than w.size)
Tries: Insert¶
Trie::insert(w)
- \(P\) ←
get-path-to(w)gives ancestors that exist already. - Expand the trie from
P.top()by adding necessary nodes that correspond to extra bits of w. - Update leaf-references (also at P if w is longer than previous leaves).
Tries: Delete¶
Trie::delete(w)
- \(P\) ←
get-path-to(w)gives all ancestors. - Let ℓ be the leaf where w is stored.
- Delete ℓ and nodes on \(P\) until ancestor has two or more children.
- Update leaf-references on rest of \(P\). (If z ∈ \(P\) referred to ℓ, find new z.leaf from other children.)
Binary Tries summary¶
search(w), prefix-search(w), insert(w), delete(w) all take time \(\Theta(|w|)\).
- Search-time is independent of number \(n\) of words stored in the trie!
- Search-time is small for short words.
The trie for a given set of words is unique (except for order of children and ties among leaf-references)
Disadvantages:
- Tries can be wasteful with respect to space.
- Worst-case space is \(\Theta(n · (\text{maximum length of a word}))\)

Variations of Tries: Pruned Tries¶
Pruned Trie:
Stop adding nodes to trie as soon as the key is unique.
- A node has a child only if it has at least two descendants.
- Saves space if there are only few bitstrings that are long.
- Could even store infinite bitstrings (e.g. real numbers).
A more efficient version of tries, but the operations get a bit more complicated.

Compressed Tries¶
Another (important!) variation:
- Compress paths of nodes with only one child.
- Each node stores an index, corresponding to the level of the node in the uncompressed trie. (On level \(d\), we searched for link with \(w[d]\).)

Compressed Tries: Search¶
- As for tries, follow links that corresponds to current bits in \(w\).
- Main difference: stored indices say which bits to compare.
- Also: must compare \(w\) to word found at the leaf
**CompressedTrie::get-path-to(w)**
P ← empty stack; z ← root; P.push(z)
while z is not a leaf and (d ← z.index ≤ w.size) do
if (z has a child-link labelled with w[d]) then
z ← child at this link; P.push(z)
else break
return P
**CompressedTrie::search(w)**
P ← get-path-to(w), z ← P.top
if (z is not a leaf or word stored at z is not w) then
return "not found"
return key-value pair at z
prefix-search(w): Compare w to z.leaf at last visited node z.

Compressed Tries: Summary¶
- search(w) and prefix-search(w) are easy.
- insert(w) and delete(w) are conceptually simple:
- Search for path \(P\) to word w (say we reach node z)
- Uncompress this path (using characters of z.leaf)
- Insert/Delete w as in an uncompressed trie.
- Compress path from root to where change happened
- (Pseudocode gets more complicated and is omitted.)
- All operations take \(O(|w|)\) time for a word w.
- Compressed tries use \(O(n)\) space
- We have \(n\) leaves.
- Every internal node has two or more children.
- Can show: Therefore more leaves than internal nodes.
Overall, code is more complicated, but space-savings are worth it if words are unevenly distributed.
Multiway Tries: Larger Alphabet¶
- To represent strings over any fixed alphabet \(\Sigma\)
- Any node will have at most \(|\Sigma| + 1\) children (one child for the end-of-word character $)
- Example: A trie holding strings {bear\(, ben\), be\(, soul\), soup$}

Compressed Multiway Tries¶
- Variation: Compressed multi-way tries: compress paths as before
- Example: A compressed trie holding strings {bear\(, ben\), be\(, soul\), soup$}

Multiway Tries: Summary¶
- Operations
search(w),prefix-search(w),insert(w)anddelete(w)are exactly as for tries for bitstrings. - Run-time \(O(|w|)\) (time to find the appropriate child))
-
Each node now has up to \(|\Sigma| + 1\) children. How should they be stored?

-
Time/space tradeoff:
- Arrays are fast, lists are space-efficient.
- Dictionary best in theory, not worth it in practice unless \(|\Sigma|\) is huge.