Final Review Quick Notes¶

Binary Search Tree: BST::search, BST::insert, BST::delete all have cost $\Theta(h)$, where $h$ is the height of the tree = max. path length from root to leaf.
AVL Tree: an AVL Tree is a BST with an additional height-balance property at every node: the heights of the left and right subtree differ by at most 1.
- An AVL tree on $n$ nodes has $\Theta(\log n)$ height.
- insert, search, delete all cost $\Theta(\log n)$ time in the worst case.
- For insert, restructure will be called at most once.
- For delete, break tie by avoiding double rotation (when both subtrees have same height), restructure may be called $\Theta(height)$ times.
Skip lists: $O(n)$ expected space, all operations take $O(\log n)$ expected time.
search always takes $\Omega(\log n)$ time (comparison-based searching lower bound).
Interpolation Search takes time $T^{avg}(n) \in O(\log \log n)$. The worst case can be $\Theta(n)$. It is similar to binary search but with

\[ m=\ell + \left\lfloor \frac{k - A[\ell]}{A[r] - A[\ell]} (r - \ell) \right\rfloor \]

Trie: due to end-sentinels $, all key-value pairs are at leaves.
- Prefix-Search: We need more than w.size nodes on $P$ to have an extension.
- Binary Trie: search(w), prefix-search(w), insert(w), delete(w) all take time $\Theta(|w|)$
  - Search-time is independent of number $n$ of words stored in the trie.
  - The worst case space is $\Theta(n \cdot \text{(maximum length of a word)})$
- Pruned Trie: stop adding nodes to trie as soon as the key is unique.
- Compressed Tries: compress paths of nodes with only one child. Each node stores an index, corresponding to the level of the node in the uncompressed trie.
  - For search, must compare $w$ to word found at the leaf.
  - All operations take $O(|w|)$ time for a word $w$.
  - Use $O(n)$ space.
  - More leaves than internal nodes.
- Multiway Tries: Run-time $O(|w| · (\text{time to find the appropriate child}))$
Hashing:
- Direct Addressing: search(k), insert(k, v), delete(k) takes $\Theta(1)$ time, total space is $\Theta(M)$.
- Hashing with Chaining: insert takes time $O(1)$, search and delete have run-time $O(1 + \text{length of list at }T (h(k)))$.
  - Expected cost of search and delete is $\Theta(1+\alpha)$, where $\alpha = \frac{n}{M}$.
  - Keep the load factor $\alpha$ small via rehashing, which costs $\Theta(M+n)$ time plus the time to find a new hash function.
  - The amortized expected cost for hashing with chaining is $O(1)$ and the space is $\Theta(n)$ (assuming uniform hashing and $\alpha \in O(1)$ throughput.)
- Linear Probing: $h(k,j) = (h(k)+j) \text{ mod } M$, for some hash function $h$.
  - When delete, mark the spot as deleted(rather than NULL), search continues past deleted spots, and insertion reuses deleted spots
  - Choose $M$ prime
- Double hashing: open addressing with probe sequence
  - The second hash function is usually chosen as $h_1(k)= \lfloor M(kA-\lfloor kA \rfloor)\rfloor$
\[ h(k,j)=(h_0(k)+j \cdot h_1(k)) \text{ mod } M \]
- Cuckoo Hashing:
  - If insertion is possible, then there are at most $2n$ evictions.
  - Expected number of evictions during insert is $O(1)$.
    - In practice, stop evictions much earlier than $2n$ rounds.
    - This requires $\alpha < \frac12$
  - Expected space is $O(n)$.
- Modular method: $h(k)=k \text{ mod } M$, where $M$ should be a prime.
Range Search
- Unsorted list/array/hash table: range search requires $\Omega(n)$ time.
- Sorted array: range search can be done in $O(\log n+s)$ time, where $s$ is the output-size, that is, the number of items in the range.
Quadtrees:
- Use the smallest $k$ such that the max-coordinate in $P$ is $< 2^k$.
- Points on split lines belong to right/top side.
- When doing insert, split the leaf while there are two points in one region.
- When doing delete, remove the point and if its parent has only one point left, delete parent (recursively for all ancestors that have only one point left).
- Assume that each node of the quadtree stores the associated square.
- Spread factor: $\frac{\text{sidelength of } R}{\text{minimum distance between points in } P}$, the height $h$ of quadtree is in $\Theta(\log(\text{spread factor}))$.
- Complexity to build initial tree: $\Theta(nh)$ worst case.
- Complexity of range search: $\Theta(nh)$ worst-case even if the answer is empty set.
- Space potentially wasteful, but good if points are well-distributed
kd-trees
- kd-tree idea: Split the region at upper median of coordinates. Continue splitting, switching between vertical and horizontal lines, until every point is in a separate region
- Constructing kd-trees: can find $X$ and partition $P$ in $\Theta(n)$ expected time, so both subtrees have $\approx n/2$ points. The following resolves to $\Theta(n\log n)$ expected time. Can be reduced to $\Theta(n\log n)$ worst-case time by pre-sorting.
\[ T^{exp}(n) = 2T^{exp}(n/2)+O(n) \]
- The height of the tree is $h(1)=0, h(n) \leq h(\lceil n/2 \rceil)+1$, which resolves to $O(\log n)$. The space is $\Theta(n)$.
- kd-trees do not handle insertion/deletion well due to the need to rebuild the entire subtrees occasionally (But range-search will be slower).
- The complexity of range search in kd-trees is $O(s+\sqrt{n})$
- For any dimensions $d$, storage $O(n)$, height $O(\log n)$, construction time $O(n\log n)$, range search time $O(s+n^{1-1/d})$
- Range trees: wasteful in space, but permit much faster range search
- Primary tree (balanced binary search trees) $T$ uses $O(n)$ space. Point of $z$ is in associate tree $T_{ass}(u)$ if and only if $u$ is an ancestor of $z$ in $T$.
- Every point belongs to $O(\log n)$ associated trees, so all associate trees together use $O(n\log n)$ space.
- A range tree with $n$ points uses $O(n\log n)$ space.
- We want the binary search trees to be balanced. This makes insert/delete if we use AVL-trees. Instead, completely rebuild highly unbalanced subtrees occasionally.
- Run-time for insert/delete becomes $O(\log^2 n)$ amortized.
- Assuming the binary search tree is balanced, search for path $P_1$ and $P_2$ takes $O(\log n)$ time, so $O(\log n)$ boundary nodes, and we spend $O(1)$ time on each. Run-time for 1d range search takes $O(\log n +s)$.
- 2d range Search Run-time: $O(\log n)$ time to find boundary and topmost inside nodes in primary tree. There are $O(\log n)$ such nodes. $O(\log n + s_v)$ time for each topmost inside node $v$, where $s_v$ is the number of points in $T_{\text{ass}}(v)$ that are reported. Two topmost inside nodes have no common point in their trees, so every point is reported in at most one associate structure such that $\sum_{v \text{ topmost inside } s_v} \leq s$. Time for range search in range tree is proportional to
\[ \sum_{v \text{ topmost inside}} (\log n + s_v) \in O(\log^2 n + s) \]
- For d-dimentional space, space $O(n(\log n)^{d-1})$ [kd-trees: $O(n)$], construction time $O(n(\log n)^{d})$ [kd-trees: $O(n\log n)$], range search time $O(s+(\log n)^d)$ [kd-trees: $O(s+n^{1-1/d})$].
- Brute-force Algorithm: check every possible guess.
- Worst case performance $\Theta((n-m+1)m)$. If $m \leq n/2 \Rightarrow \Theta(mn)$
- KMP Algorithm: the largest prefix of $P[0..j]$ that is a suffix of $P[1..j]$
- Failure Array: Preprocess the pattern to find matches of prefixes of the pattern with the pattern itself. The failure array $F$ of size $m$: $F[j]$ is defined as the length of the largest prefix of $P[0..j]$ that is also a suffix of $P[1..j]$. If a mismatch occurs at $P[j] \neq T[i]$ we set $j \leftarrow F[j − 1]$
- Running time of failure array: $\Theta(m)$
- Running time of KMP: $\Theta(n)$
- Boyer-Moore Algorithm: compare $P$ with a subsequence of $T$ moving backwards
- The last-occurrence function $L$ can be computed in time $O(m+|\Sigma|)$ (used for Bad character)
- The suffix skip array $S$ can be computed in $\Theta(m)$ time (used for Good suffix)
- When $T[i] \neq P[j]$, $i \leftarrow i+m-1-\min(L[T[i]], S[j])$
- Boyer-moore algorithm has worst case running time $\in O(n+|\Sigma|)$
- For a letter to be a "( )" bracket, it must match the letter of the pattern in the previous line and that has to be a suffix of the pattern. If the previous mismatch letter in the TEXT is the same as the current ceil one and it is also the last occurrence in the pattern, then we put the “[]” around the letter, since it is known to be matched based on the last occurrence array. (Both have to be matched with the text)
- Suffix Tree: compressed trie of suffixes
- There is a way to build a suffix tree of $T$ in $\Theta(n)$ time, but usually $O(n^2)$.
- In the suffix tree, search for $P$ until one of the following occurs:
  1. If the search fails due to "no such child," then $P$ is not in $T$.
  2. If we reach the end of $P$, say at node $v$, then jump to leaf $\ell$ in the subtree of $v$.
    - (Suffix trees typically store such shortcuts.)
  3. Else we reach a leaf $\ell = v$ while characters of $P$ are left.
  4. Either way, left index of $\ell$ tells us where we should begin to check if the pattern matches for the first $|P|$ indices of the leaf.
- String matching takes $O(|P|)$ time
- Conclusion of Pattern Matching:

Compression Ratio:

\[ \frac{|C|\cdot \log|\Sigma_C|}{|S|\cdot \log|\Sigma_S|} \]

No lossless encoding scheme can have compression ratio $< 1$ for all input strings.
A character encoding (or single-character encoding) maps each character in the source alphabet to a string in code alphabet. Fixed-length codes do not compress.
Prefix-free codes $E$: no codeword is a prefix of another. This corresponds to a trie with characters of $\Sigma_S$ only at the leaves. Any prefix-free code is uniquely decodable.
The runtime of prefixFree::decoding is $O(|C|)$. The runtime of prefixFree::encoding is $O(|\Sigma_S|+|C|)$
Huffman’s Algorithm: Always pair up most infrequent characters.
- Runtime of encoding: $O(|\Sigma_S|\log |\Sigma_S|+ |C|)$, decoding: $O(|C|)$
Run-Length Encoding: Input must be a bitstring ($\{0,1\}$)
- Send the leftmost bit of $S$ (either $0$ or $1$)
- Then send a sequence of integers indicating run lengths.
- We don’t have to give the bit for runs since they alternate.
- Base-2 encoding: $E_{b2}(k) :=$ the string $w$ such that $(w)_2 = k$
- Bijective base-2 encoding: $E_{bb2}(k) := E_{b2}(k+1)$ with the leading $1$ removed ($A$ represents 0 and $B$ represents 1).
- Elias gamma code: to encode $k$:
  1. Take binary representation $E_{b_2}(k)$ of $k$.
  2. Add leading zeros so that the initial $1$ is in the middle.
  3. This means adding $\lfloor \log k \rfloor$ leading zeros.
- An all-$0$ string of length $n$ would be compressed to $0E_{\gamma}(n)$, which has $2 \lfloor \log n \rfloor+2 \in o(n)$. (The $2$ consists of the middle bit and the indicator bit to indicate the bit of the first run).
  - No compression usually until run-length $k \geq 6$
  - Expansion when $k =2$ or $k=4$
Lempel-Ziv-Welch:
- Generally: If code number is about to be added to $D$, then it encodes “previous string” + “first character of previous string”
- Decoding and encoding runtime: $O(|S|)$ (assuming constant alphabet)
Burrows-Wheeler Transform: The Burrows-Wheeler transform consists of the last characters of the cyclic shifts of $S$ after sorting them lexicographically.
- Steps:
  1. Write down cyclic shifts - $i^{th}$ cyclic shift: move $i$ characters from front to back
    - Observe: every column contains a permutation of $S$.
  2. Sort lexicographically - use MSD-radix-sort; $\Theta(n)$ strings of length $\Theta(n)$ → $\Theta(n^2)$ worst-case time.
    - Observe: every column continues to contain a permutation of $S$.
  3. Extract rightmost column
- BWT encoding cost: $O(n\log n)$
- BWT decoding cost: $O(n+|\Sigma_S|)$
- Encoding and decoding both use $O(n)$ space
- BWT (hence bzip2) is a block compression method that compresses one block at a time.
  - bzip2 encoding cost: $O(n(\log n + |\Sigma|))$ with a big constant
  - bzip2 decoding cost: $O(n|\Sigma|)$ with a big constant.
Compression summary

a-b-trees:
- A $d$-node stores $d$ keys, has $d+1$ subtrees, and stored keys are between the keys in the subtrees. We always have one more subtree than keys (but subtrees may be empty).
- An a-b-tree (for some $b \geq 3$ and $2 \leq a \leq \lceil \frac{b}{2} \rceil$) satisfies:
  1. Non-root Nodes:
    - Each is a $d$-node where $a-1 \leq d \leq b-1$.
    - Has $a$ to $b$ subtrees and $a-1$ to $b-1$ keys.
  2. Root Node:
    - Is a $d$-node where $1 \leq d \leq b-1$.
    - Has $2$ to $b$ subtrees and $1$ to $b-1$ keys.
  3. Empty Subtrees:
    - All empty subtrees are at the same level.
- An a-b-tree with $n$ keys has $O(\log_a(n))$ height.
- Total Non-Root Nodes:
  - Sum of all non-root nodes:
    
    \[ \# \text{ non-root nodes} \geq \sum_{i=1}^h 2a^{i-1} = 2 \sum_{j=0}^{h-1} a^j = 2 \cdot \frac{a^h - 1}{a - 1} \]
  - The number of KVPs
    
    \[ n = \#\text{ KVPs} \geq 1 + (a-1) \cdot \frac{2(a^h - 1)}{a - 1} = 2a^h - 1 \]
  - Therefore $h \leq \log_a(\frac{n+1}{2})$
    - During search operation, the number of visited nodes is $O(\log_an)$ (one per level)
    - insert: Do abTree::search and add key and empty subtree at leaf. If the leaf had room then we are done. Else overflow: More keys/subtrees than permitted. Resolve overflow by node splitting. When create move nodes to the parent, use upper median!
    - search and insert visit $O(\log_an)$ nodes, delete can also be implemented with $O(\log_an)$ node-visits.
    - Convert a 2-4-tree to red-black-tree: a $d$-node becomes a black node with $d−1$ red children. Any red node has a black parent, and any empty subtree $T$ has the same black depth ((number of black nodes on path from root to T).
    - Convert back to 2-4-tree: black node with $0 ≤ d ≤ 2$ red children becomes a ($d+1$)-node
    - Red-black trees have height $O(\log n)$.
    - insert can be done in $O(\log n)$ worst case time.
  - Convert relevant part to 2-4-tree.
  - Do insertion in the 2-4-tree.
  - Convert relevant parts back to red-black tree.
    - delete can also be done in $O(\log n)$ worst-case time
  - B-tree:
    - All operations require $\Theta(\log_B n)$ block transfers.
  - External Hashing
    - Only $O(1)$ block transfers expected for any operation
    - To make more space, we typically only add one block. We rarely change the size of the directory. We never have to move all items.