Module 5: Other Dictionary Implementations¶

Dictionary ADT: Implementations thus far¶

A dictionary is a collection of key-value pairs (KVPs), supporting operations search, insert, and delete.

Realizations we have seen so far:

Unordered array or list:
- Insert: \(\Theta(1)\)
- Search: \(\Theta(n)\)
- Delete: \(\Theta(n)\)
Ordered array:
- Search: \(\Theta(\log n)\)
- Insert: \(\Theta(n)\)
- Delete: \(\Theta(n)\)
Binary search trees:
- Search, Insert, and Delete: \(\Theta(height)\)
Balanced Binary Search trees (AVL trees):
- Search, Insert, and Delete: \(\Theta(\log n)\)

Skip Lists¶

Note

A hierarchy of ordered linked lists (levels) \(L_0, L_1, \ldots, L_h\):

Each list \(L_i\) contains the special keys \(-\infty\) and \(+\infty\) (sentinels).
List \(L_0\) contains the KVPs of \(S\) in non-decreasing order.
- (The other lists store only keys and references.)
Each list is a subsequence of the previous one, i.e., \(L_0 \supseteq L_1 \supseteq \ldots \supseteq L_h\)
List \(L_h\) contains only the sentinels.

A few more definitions related to a hierarchy of ordered linked lists:

node: entry in one list vs. KVP = one non-sentinel entry in \(L_0\).
- There are (usually) more nodes than KVPs.
- Here # (non-sentinel) nodes = 14 vs. \(n\) \(\leftarrow\) # KVPs = 9.
root: topmost left sentinel is the only field of the skip list.
Each node \(p\) has references \(p.after\) and \(p.below\).
Each key \(k\) belongs to a tower of nodes:
- Height of tower of \(k\): maximal index \(i\) such that \(k \in L_i\).
- Height of skip list: maximal index \(h\) such that \(L_h\) exists.

Search in Skip Lists¶

For each list, find predecessor (node before where \(k\) would be). This will also be useful for insert/delete.

**get-predecessors (k)**

p ← root
P ← stack of nodes, initially containing p
while p.below != NULL do
    p ← p.below
    while p.after.key < k do p ← p.after
    P.push(p)
return P

**skipList::search (k)**

P ← get-predecessors(k)
p0 ← P.top() // predecessor of k in L0
if p0.after.key = k return KVP at p0.after
else return “not found, but would be after p0”

Delete in Skip Lists¶

**skipList::delete(k )**

P ← get-predecessors(k)
while P is non-empty
    p ← P.pop() // predecessor of k in some list
    if p.after.key=k
        p.after ← p.after.after
    else break // no more copies of k
p ← left sentinel of the root-list
while p.below.after is the ∞-sentinel
    // the two top lists are both only sentinels, remove one
    p.below ← p.below.below
    p.after.below ← p.after.below.below

Insert in Skip Lists¶

`skipList::insert(k, v)`¶

Overview

There is no choice as to where to put the tower of k.
The only choice is: how tall should we make the tower of k?

Tower Height Determination
- Choose randomly! Repeatedly toss a coin until you get tails.
- Let i be the number of times the coin came up heads.
- We want key k to be in lists \(L_0, \ldots, L_i\), so i → height of tower of k.
- Probability \(P(\text{tower of key } k \text{ has height } \geq i) = \left(\frac{1}{2}\right)^i\).

Pre-Insertion Setup

Before we can insert, we must ensure these lists exist.
- Add sentinel-only lists, if needed, until height \(h\) satisfies \(h > i\).

Insertion Process

Use get-predecessors(k) to get stack \(P\).
The top i items of \(P\) are the predecessors \(p_0, p_1, \ldots, p_i\) of where k should be in each list \(L_0, L_1, \ldots, L_i\).
Insert (k, v) after \(p_0\) in \(L_0\), and k after \(p_j\) in \(L_j\) for \(1 \leq j \leq i\).

**skipList::insert(k , v )**

for (i ← 0; random(2) = 1;i++) {} // random tower height
for (h ← 0, p ← root.below;p ̸= NULL;p ← p.below, h++) {}
while i ≥ h // increase skip-list height?
    create new sentinel-only list; link it in below topmost list
    h++
P ← get-predecessors(k )
p ← P.pop() // insert (k,v) in L0
z_below ← new node with (k,v);
z_below.after ← p.after; p.after ← z_below
while i > 0 // insert k in L1,...,Li
    p ← P.pop()
    z ← new node with k
    z.after ← p.after; p.after ← z; z.below ← z_below ; z_below ← z
    i←i−1

Analysis of Skip Lists¶

Expected Space Usage

The expected space usage is \(O(n)\).
Define \(X_k\) as the tower height of key \(k\). The probability that \(X_k \geq i\) is given by \(\left(\frac{1}{2}\right)^i\), that is, \(\Pr(X_k \geq i) = (\frac12)^i\).
Define \(|L_i|\) as the number of non-sentinel nodes in list \(L_i\). Observe \(|L_i| = \sum_k \chi(X_k \geq i)\)
The expected number of nodes in \(L_i\), denoted as \(E[|L_i|]\), is calculated based on the height probabilities of all keys:

\[ \begin{align*} E[|L_i|] & = \sum_k E[\chi(X_k\geq i)] \\ & = \sum_k \Pr(\chi(X_k\geq i)=1) \cdot 1+\sum_k \Pr(\chi(X_k\geq i)=0) \cdot 0 \\& = \sum_k \Pr(X_k \geq i) \\ & = \sum_k \frac{1}{2^i} \\ & = \frac{n}{2^i} \end{align*} \]

Note

❗

Concept Explanation from ChatGPT:

Here, \(\chi(X_k \geq i)\) is the indicator function that equals 1 if the height of the tower for node \(k\) is at least \(i\), and 0 otherwise. Since the height increment at each level is achieved with a probability of \(\frac{1}{2}\), the probability that \(X_k \geq i\) is \(\frac{1}{2^i}\). Given that there are \(n\) nodes, each contributing independently to the height at level \(i\), the total expected number of nodes at level \(i\) is \(n \cdot \frac{1}{2^i}\).

The total expected number of non-sentinel nodes across all lists can be expressed as:

\[ E[\text{\#non-sentinels}] = \sum_{i=0}^h E[|L_i|] = E\left[\sum_{i=0}^h|L_i|\right] = \sum^n_{i=0} \frac{n}{2^i}= n\sum^n_{i=0} \frac{1}{2^i} \leq n \sum^{\infin}_{i=0}\frac{1}{2^i} = 2n \]

Expected Height

The expected height of the skip list is \(O(\log n)\).
- A detailed (longer) proof of this is omitted here, but follows from the probabilistic nature of the node heights.

Get Predecessors Function

skipList::get-predecessors: Expected time complexity is \(O(\log n)\).
Important operations during this function:
- Drop Down: How often do we execute \(p \leftarrow p.below\)? This is linked to the height of the skip list.
- Step Forward: How often do we execute \(p \leftarrow p.after\)? It is expected to step forward at most once in each list.

Overall Complexity for Operations

The operations search, insert, and delete all have an expected time complexity of \(O(\log n)\), benefiting from the skip list's structure and probabilistic height distribution.

Summary of Skip Lists¶

\(O(n)\) expected space, all operations take \(O(\log n)\) expected time.

Lists make it easy to implement. We can also easily add more operations (e.g. successor, merge,...)

As described they are no better than randomized binary search trees.

But there are numerous improvements on the space:

Can save links (hence space) by implementing towers as array.

Biased coin-flips to determine tower-heights give smaller expected space
With both ideas, expected space is \(< 2n\) (less than for a BST).

Biased Search Requests¶

Improving Unsorted Lists/Arrays¶

Recall unsorted array realization:

search: \(O(n)\), insert: \(O(1)\), delete: \(O(1)\) (after a search)

Very simple and popular. Can we do something to make search more effective in practice?

No: if items are accessed equally likely.
- We can show that the average-case cost for search is then \(O(n)\).
Yes: if the search requests are biased:
- some items are accessed much more frequently than others.
- 80/20 rule: 80% of outcomes result from 20% of causes.
- access: insertion or successful search
- Intuition: Frequently accessed items should be in the front.
- Two scenarios: Do we know the access distribution beforehand or not?

Optimal Static Ordering¶

Scenario: We know the access distribution and want to find the best order of a list.

Example:

key	A	B	C	D	E
frequency of access	2	8	1	10	5
access-probability	\(\frac{2}{26}\)	\(\frac{8}{26}\)	\(\frac{1}{26}\)	\(\frac{10}{26}\)	\(\frac{5}{26}\)

Recall:

\[ \begin{align*} T_{\text{avg}}(n) &= \sum_{l \in I_n} T(l) \cdot (\text{relative frequency of } l) \\ &= \text{expected run-time on randomly chosen input}\\&= \sum_{l \in I_n} T(l) \cdot \Pr(\text{randomly chosen instance is } l) \end{align*} \]

Count cost \(i\) if search-key (= instance \(I\)) is at \(i\)-th position (\(i \geq 1\)).
\(T_{\text{avg}}(n) = \text{expected access cost} = \sum_{i \geq 1} i \cdot \Pr(\text{search for key at position } i)\)
- The \(\Pr(\cdot)\) refers to the Access probability of that key

Claim: Over all possible static orderings, the one that sorts items by non-increasing access-probability minimizes the expected access cost.

Dynamic Ordering: MTF¶

Scenario: We do not know the access probabilities ahead of time.

Idea: Modify the order dynamically, i.e., while we are accessing.
Rule of thumb (temporal locality): A recently accessed item is likely to be used soon again.
Move-To-Front heuristic (MTF): Upon a successful search, move the accessed item to the front of the list.

Note: We can also apply MTF on an array, but we should then insert and search from the back so that we have room to grow.

Other Heuristics for Dynamic Ordering¶

Transpose heuristic: Upon a successful search, swap the accessed item with the item immediately preceding it.

Here the changes are more gradual.

Frequency-count heuristic: Keep counters of how often items are accessed and sort them in non-increasing order based on access frequency. This method is effective but requires auxiliary space (store the frequency).