Skip to content

Module 6: Dictionaries for special keys

Lower Bound

Dictionary ADT: Implementations thus far

Realizations we have seen so far:

  • Balanced Binary Search trees (AVL trees):
    • \(\Theta(\log n)\) search, insert, and delete (worst-case)
  • Skip lists:
    • \(\Theta(\log n)\) search, insert, and delete (expected)
  • Various other realizations:
    • Sometimes faster on insert, but search always takes \(\Omega(\log n)\) time.

Question:

Can one do better than \(\Theta(\log n)\) time for search?

Answer:

Yes and no! It depends on what we allow.

  • No: Comparison-based searching lower bound is \(\Omega(\log n)\).
  • Yes: Non-comparison-based searching can achieve \(o(\log n)\) (under restrictions!).

Theorem: Any comparison-based algorithm requires in the worst case \(\Omega(\log n)\) comparisons to search among n distinct items.

\[ \begin{align*} & \geq n+1 \text{ outcomes} \\ & \geq \log(n+1) \text{ height} \\ & \geq \log(n+1) \text{ comparisons for same input} \\ \end{align*} \]

image.png

Interpolation Search

Code very similar to binary search, but compare at index

\[ \ell + \left\lfloor \frac{k - A[\ell]}{A[r] - A[\ell]} \times (r - \ell) \right\rfloor \]

\(k - A[\ell]\) is the distance from left key, \(A[r] - A[\ell]\) is the distance between left and right keys, and \((r - \ell)\) is the number of indices in range.

Need a few extra tests to avoid crash during computation of \(m\).

**interpolation-search(A, k)**

l←0,r←A.size−1
while (l ≤ r)
    if (k < A[l] or k > A[r]) return “not found”
    if (k = A[r]) then return “found at A[r]”
    m←l+⌊(k−A[l])·(r−l)⌋ / (A[r]−A[l])
    if (A[m] equals k) then return “found at A[m]”
    else if (A[m]<k) then l←m+1
    else r←m−1

In the worst case, the algorithm can be very slow (\(\Theta(n)\) time) (when the values are not uniformly distributed), but it works well on averge.

  • Can show: \(T^\text{avg}(n) \leq T^\text{avg}(\sqrt{n}) + \Theta(1)\)
  • This resolves to \(T^\text{avg}(n) \in O(\log \log n)\)

Tries

Words (review)

Scenario: Keys in dictionary are words. Need brief review.

Words (= strings): sequences of characters over alphabet \(\Sigma\)

  • Typical alphabets: {0, 1} (→bitstrings), ASCII, {C, G, T, A}
  • Stored in an array: w[i] gets i-th character (for i = 0, 1...)
  • Convention: Words have end-sentinel $ (sometimes not shown)
    • w.size = |w| = # non-sentinel characters: be$=2.

Should know:

  • Prefix, suffix, substring
  • Sort words lexicographically: be$ \(<_{lex}\) bear$ \(<_{lex}\) beer$
    • This is different from sorting numbers: 001$ \(<_{lex}\) 010$ \(<_{lex}\) 1$

Tries: Introduction

Trie (also known as radix tree): A dictionary for bitstrings.

  • Comes from retrieval, but pronounced "try"
  • A tree based on bitwise comparisons: Edge labelled with corresponding bit
  • Similar to radix sort: use individual bits, not the whole key
  • Due to end-sentinels, all key-value pairs are at leaves.

image.png

  • Follow links that correspond to current bits in w.
  • Repeat until no such link or w found at a leaf.

Similar to skip lists, we first find the search-path separately.

**Trie::get-path-to(w)**
Output: Stack with all ancestors of where w would be stored.

P ← empty stack; z ← root; d ← 0; P.push(z)
while d ≤ |w|
    if z has a child-link labelled with w[d]
        z ← child at this link; d++; P.push(z)
    else break
return P
**Trie::search(w)**

P ← get-path-to(w), z ← P.top
if (z is not a leaf) then
    return “not found, would be in sub-trie of z”
return key-value pair at z

Tries: Leaf-references

For later applications of tries, another search operation is essential:

  • prefix-search(w): Find word w' in trie for which w is a prefix.

Features:

  • Testing: Checking whether w' exists is straightforward. (Could involve simply traversing down the trie to see if the prefix path exists.)
  • Leaf-references: To efficiently find w', nodes store references to leaves:
    • Each node z stores a reference z.leaf to a leaf in its subtree.
    • Convention: Store the leaf with the longest word to ensure all potential matches are covered.

image.png

**Trie::prefix-search(w)**

P ← get-path-to(w), p ← P.top()
if number of nodes on P is w.size or less
    return “no extension of w found”
return p.leaf

We need more than w.size nodes on P to have an extension. (we have root in the stack P, so that’s why we need more than w.size)

Tries: Insert

Trie::insert(w)

  • \(P\)get-path-to(w) gives ancestors that exist already.
  • Expand the trie from P.top() by adding necessary nodes that correspond to extra bits of w.
  • Update leaf-references (also at P if w is longer than previous leaves).

Tries: Delete

Trie::delete(w)

  • \(P\)get-path-to(w) gives all ancestors.
  • Let be the leaf where w is stored.
  • Delete and nodes on \(P\) until ancestor has two or more children.
  • Update leaf-references on rest of \(P\). (If z\(P\) referred to , find new z.leaf from other children.)

Binary Tries summary

search(w), prefix-search(w), insert(w), delete(w) all take time \(\Theta(|w|)\).

  • Search-time is independent of number \(n\) of words stored in the trie!
  • Search-time is small for short words.

The trie for a given set of words is unique (except for order of children and ties among leaf-references)

Disadvantages:

  • Tries can be wasteful with respect to space.
  • Worst-case space is \(\Theta(n · (\text{maximum length of a word}))\)

image.png

Variations of Tries: Pruned Tries

Pruned Trie:

Stop adding nodes to trie as soon as the key is unique.

  • A node has a child only if it has at least two descendants.
  • Saves space if there are only few bitstrings that are long.
  • Could even store infinite bitstrings (e.g. real numbers).

A more efficient version of tries, but the operations get a bit more complicated.

image.png

Compressed Tries

Another (important!) variation:

  • Compress paths of nodes with only one child.
  • Each node stores an index, corresponding to the level of the node in the uncompressed trie. (On level \(d\), we searched for link with \(w[d]\).)

image.png

  • As for tries, follow links that corresponds to current bits in \(w\).
  • Main difference: stored indices say which bits to compare.
  • Also: must compare \(w\) to word found at the leaf
**CompressedTrie::get-path-to(w)**

P ← empty stack; z ← root; P.push(z)
while z is not a leaf and (d ← z.index ≤ w.size) do
    if (z has a child-link labelled with w[d]) then
        z ← child at this link; P.push(z)
    else break
return P
**CompressedTrie::search(w)**

P ← get-path-to(w), z ← P.top
if (z is not a leaf or word stored at z is not w) then
    return "not found"
return key-value pair at z

prefix-search(w): Compare w to z.leaf at last visited node z.

image.png

Compressed Tries: Summary

  • search(w) and prefix-search(w) are easy.
  • insert(w) and delete(w) are conceptually simple:
    • Search for path \(P\) to word w (say we reach node z)
    • Uncompress this path (using characters of z.leaf)
    • Insert/Delete w as in an uncompressed trie.
    • Compress path from root to where change happened
    • (Pseudocode gets more complicated and is omitted.)
  • All operations take \(O(|w|)\) time for a word w.
  • Compressed tries use \(O(n)\) space
    • We have \(n\) leaves.
    • Every internal node has two or more children.
    • Can show: Therefore more leaves than internal nodes.

Overall, code is more complicated, but space-savings are worth it if words are unevenly distributed.

Multiway Tries: Larger Alphabet

  • To represent strings over any fixed alphabet \(\Sigma\)
  • Any node will have at most \(|\Sigma| + 1\) children (one child for the end-of-word character $)
  • Example: A trie holding strings {bear\(, ben\), be\(, soul\), soup$}

image.png

Compressed Multiway Tries

  • Variation: Compressed multi-way tries: compress paths as before
  • Example: A compressed trie holding strings {bear\(, ben\), be\(, soul\), soup$}

image.png

Multiway Tries: Summary

  • Operations search(w), prefix-search(w), insert(w) and delete(w) are exactly as for tries for bitstrings.
  • Run-time \(O(|w|)\) (time to find the appropriate child))
  • Each node now has up to \(|\Sigma| + 1\) children. How should they be stored?

    image.png

  • Time/space tradeoff:

    • Arrays are fast, lists are space-efficient.
    • Dictionary best in theory, not worth it in practice unless \(|\Sigma|\) is huge.