Module 7: Dictionaries via Hashing¶

Hashing Introduction¶

Direct Addressing¶

Special situation: For a known $M \in \mathbb{N}$, every key $k$ is an integer with $0 \leq k < M$.

We can then implement a dictionary easily: Use an array $A$ of size $M$ that stores $(k, v)$ via $A[k] \leftarrow v$.

search($k$): Check whether $A[k]$ is NULL
insert($k, v$): $A[k] \leftarrow v$
delete($k$): $A[k] \leftarrow \text{NULL}$

Each operation is $\Theta(1)$

Total space is $\Theta(M)$

This sorting algorithm reminds us of bucket sort.

Hashing¶

Two disadvantages of direct addressing:

It cannot be used if the keys are not integers.
It wastes space if $M$ is unknown or $n \ll M$.

Hashing idea: Map (arbitrary) keys to integers in range $\{0, \ldots, M - 1\}$ (for an integer $M$ of our choice), then use direct addressing.

Details:

Assumption: We know that all keys come from some universe $U$. (Typically $U =$ non-negative integers, sometimes $|U|$ finite.)
We pick a table-size $M$.
We pick a hash function $h : U \rightarrow \{0, 1, \ldots, M - 1\}$. (Commonly used: $h(k) = k \mod M$. We will see other choices later.)
Store dictionary in hash table, i.e., an array $T$ of size $M$.
An item with key $k$ wants to be stored in slot $h(k)$, i.e., at $T[h(k)]$.

Collisions¶

Generally, hash function $h$ is not injective, so many keys can map to the same integer.
- For example, $h(46) = 2 = h(13)$ if $h(k) = k \mod 11$.
We get collisions: we want to insert $(k, v)$ into the table, but $T[h(k)]$ is already occupied.
There are many strategies to resolve collisions:

Hashing with Chaining¶

The simplest collision-resolution strategy: Each slot stores a bucket containing 0 or more KVPs.

A bucket could be implemented by any dictionary realization (even another hash table!).
The simplest approach is to use unsorted lists with MTF for buckets. This is called collision resolution by chaining.
insert($k, v$): Add $(k, v)$ to the front of the list at $T[h(k)]$.
search($k$): Look for key $k$ in the list at $T[h(k)]$. Apply MTF-heuristic!
delete($k$): Perform a search, then delete from the linked list.

insert takes time $O(1)$.

search and delete have run-time $O(1 +$ length of list at $T(h(k)))$.

Complexity of chaining¶

Run-times: insert takes time $\Theta(1)$.

search and delete have run-time $\Theta\left(1 + \text{size of bucket } T[h(k)]\right)$.

The average bucket-size is $\frac{n}{M} =: \alpha$.
- ($\alpha$ is also called the load factor.)
However, this does not imply that the average-case cost of search and delete is $\Theta(1 + \alpha)$.
- Consider the case where all keys hash to the same slot.
- The average bucket-size is still $\alpha$.
- But the operations take $\Theta(n)$ time on average.
To get meaningful average-case bounds, we need some assumptions on the hash-functions and the keys!
To analyze average behavior, use randomized hashing.
Assume the hash function is chosen randomly.
Assume Uniform Hashing Assumption: Any possible hash function is equally likely to be chosen.
- (This assumption facilitates analysis, even though it's not entirely realistic.)

UHA implies that the distribution of keys is unimportant.

Claim: Hash-values are uniform.

Formally: $P(h(k) = i) = \frac{1}{M}$ for any key $k$ and slot $i$.

Proof:
- Let $\mathcal{H}_j$ (for $j = 0, \ldots, M-1$) be hash-functions with $h(k) = j$.
- For any $i \neq j$, can map $\mathcal{H}_i$ to $\mathcal{H}_j$ and vice versa.
- So $P(h(k) = i) = P(h \in \mathcal{H}_i) = \frac{1}{M}$.
- Claim: Hash-values of any two keys are independent of each other.
Proof: similar.

Back to complexity of chaining:

Each bucket has expected length $\frac{n}{M} \leq \alpha$.
- $n$ other keys are in this slot with probability $\frac{1}{M}$.
Each key in the dictionary is expected to collide with $\frac{n-1}{M}$ other keys.
- $n - 1$ other keys are in the same slot with probability $\frac{1}{M}$.
Expected cost of search and delete is hence $\Theta(1 + \alpha)$.

Load factor and re-hashing¶

For hashing with chaining (and also other collision resolution strategies), the run-time bound depends on $\alpha$.

(Recall: load factor $\alpha = \frac{n}{M}$.)
We keep the load factor small by rehashing when needed:
- Keep track of $n$ and $M$ throughout operations.
- If $\alpha$ gets too large, create a new (roughly twice as big) hash table, new hash function(s), and re-insert all items into the new table.

Summary¶

For Hashing with Chaining: Rehash so that $\alpha \in \Theta(1)$ throughout.
Rehashing costs $\Theta(M + n)$ time (plus the time to find a new hash function).
Rehashing happens rarely enough that we can ignore this term when amortizing over all operations.
We should also rehash when $\alpha$ gets too small, so that $M \in \Theta(n)$ throughout, and the space is always $\Theta(n)$.

Summary: The amortized expected cost for hashing with chaining is $O(1)$ and the space is $\Theta(n)$

(assuming uniform hashing and $\alpha \in \Theta(1)$ throughout).

Theoretically perfect, but too slow in practice.

Probe Sequences¶

Open Addressing¶

Main idea: Avoid the links needed for chaining by permitting only one item per slot, but allowing a key $k$ to be in multiple slots.

search and insert follow a probe sequence of possible locations for key $k$: $\langle h(k, 0), h(k, 1), h(k, 2), \dots, h(k, M-1) \rangle$ until an empty spot is found.

The simplest method for open addressing is linear probing: $h(k, j) = (h(k) + j) \mod M,$ for some hash function $h$.

Probe sequence operations¶

delete becomes problematic:

Cannot leave an empty spot behind; the next search might otherwise not go far enough.
We could try to move later items in the probe sequence forward. (But it is non-trivial to find one that can be moved.)

Better idea: lazy deletion:

Mark the spot as deleted (rather than NULL).
search continues past deleted spots.
insert reuses deleted spots.

Keep track of how many items are deleted and re-hash (to keep space at $\Theta(n)$) if there are too many.

Probe sequence operations¶

**probe-sequence::insert(T, (k, v))**

for (j = 0; j < M; j++)
    if T[h(k, j)] is NULL or "deleted"
        T[h(k, j)] = (k, v)
        return "success"
return "failure to insert" // need to re-hash

**probe-sequence-search(T, k)**

for (j = 0; j < M; j++)
    if T[h(k, j)] is NULL return "item not found"
    if T[h(k, j)] has key k return T[h(k, j)]
    // key is incorrect or "deleted"
    // try next probe, i.e., continue for-loop
return "item not found"

Independent Hash Functions¶

Some hashing methods require two hash functions, $h_0$ and $h_1$.
These hash functions should be independent in the sense that the random variables $P(h_0(k) = i)$ and $P(h_1(k) = j)$ are independent.
Using two modular hash functions often leads to dependencies.

Better idea: Use the multiplication method for the second hash function:

\[ h(k) = \lfloor M(kA - \lfloor kA \rfloor) \rfloor \]

$A$ is some floating-point number with $0 < A < 1$.
$kA - \lfloor kA \rfloor$ computes the fractional part of $kA$, which is in $[0, 1)$.
Multiply by $M$ to get a floating-point number in $[0, M)$.
Round down to get an integer in $\{0, \dots, M - 1\}$.

Our examples use $\varphi = \frac{\sqrt{5} - 1}{2} \approx 0.618033988749...$ as $A$.

Double Hashing¶

Assume we have two independent hash functions, $h_0$ and $h_1$.
Assume further that $h_1(k) \neq 0$ and that $h_1(k)$ is relatively prime to the table size $M$ for all keys $k$.
- Choose $M$ as a prime number.
- Modify standard hash functions to ensure $h_1(k) \neq 0$.
  - For example, modified multiplication method: $h(k) = 1 + \lfloor (M - 1)(kA - \lfloor kA \rfloor) \rfloor$.

Double hashing: Open addressing with probe sequence

\[ h(k, j) = (h_0(k) + j \cdot h_1(k)) \mod M \]

search, insert, and delete work just like for linear probing, but with this different probe sequence.

Note

❗

Concept Explanation from ChatGPT: Why we need to choose a prime $M$?

Choosing $M$ as a prime number helps reduce the likelihood of collisions and ensures that all slots in the hash table are accessible when using open addressing techniques, like double hashing or linear probing. Here’s why this is important:

Uniform Distribution of Hash Values: When $M$ is prime, the hash function $h(k) = k \mod M$ distributes keys more uniformly across the table. If $M$ were composite, there could be patterns or periodicity in the hash values, leading to clusters of values in certain slots, which increases the chance of collisions.
Full Probe Sequence Coverage: In double hashing, the probe sequence is given by: $h(k, j) = (h_0(k) + j \cdot h_1(k)) \mod M$ where $h_1(k)$ generates step sizes that are relatively prime to $M$ . If $M$ is prime, we can guarantee that $j \cdot h_1(k)$ cycles through all possible slots in the table, ensuring that every slot can be probed if necessary. This is crucial for double hashing, as it allows the probe sequence to eventually cover all positions in the table, avoiding "infinite loops" in certain cases.
Avoiding Cyclic Patterns: With a composite $M$ , there’s a higher chance that certain step sizes in the probe sequence will cycle within a subset of the table’s slots, rather than covering all slots. This could trap the hashing process in a subset of the table, preventing it from using the full table space efficiently.

By choosing $M$ as a prime number, we make the hash function more effective and reduce the likelihood of clustering, improving the performance of open addressing techniques.

Cuckoo Hashing¶

We use two independent hash functions, $h_0$ and $h_1$, and two tables, $T_0$ and $T_1$.

Main idea: An item with key $k$ can only be at $T_0[h_0(k)]$ or $T_1[h_1(k)]$.

search and delete then always take constant time.

Cuckoo Hashing Insertion¶

insert always initially puts the new item into $T_0[h_0(k)]$.
Evict item that may have been there already.
If so, the evicted item is inserted at an alternate position.
This may lead to a loop of evictions.
- Can show: If insertion is possible, then there are at most $2n$ evictions.
- So, abort after too many attempts.

**cuckoo::insert(k, v)**

(k_insert, v_insert) ← new key-value pair with (k, v)
i ← 0
do at most 2n times:
    (k_evict, v_evict) ← T_i[h_i(k_insert)]          // save old KVP
    T_i[h_i(k_insert)] ← (k_insert, v_insert)       // put in new KVP
    if (k_evict, v_evict) is NULL return "success"
    else
        (k_insert, v_insert) ← (k_evict, v_evict); i ← 1 - i
return "failure to insert"                          // need to re-hash

Cuckoo Hashing Discussions¶

Can show: The expected number of evictions during insert is $O(1)$.
- So in practice, stop evictions much earlier than $2n$ rounds.
This crucially requires a load factor $\alpha < \frac{1}{2}$.
- Here $\alpha = n / (\text{size of } T_0 + \text{size of } T_1)$.
So cuckoo hashing is wasteful on space.
In fact, space is $\omega(n)$ if insert forces lots of re-hashing.
Can show: Expected space is $O(n)$.

There are many possible variations:

The two hash tables could be combined into one.
Be more flexible when inserting: Always consider both possible positions.
Use $k > 2$ allowed locations (i.e., $k$ hash functions).

Complexity of open addressing strategies¶

For any open addressing scheme, we must have $\alpha \leq 1$.
For the analysis, we require $0 < \alpha < 1$ (not arbitrarily close).
Cuckoo hashing requires $0 < \alpha < \frac{1}{2}$ (not arbitrarily close).

Under these restrictions (and the universal hashing assumption):

All strategies have $O(1)$ expected time for search, insert, delete.
Cuckoo hashing has $O(1)$ worst-case time for search, delete.
Probe sequences use $O(n)$ worst-case space, while Cuckoo hashing uses $O(n)$ expected space.

However, for any hash function, the worst-case runtime is $\Theta(n)$ for insert.

In practice, double hashing is the most popular method, or cuckoo hashing if there are many more searches than insertions.

Hash Function Strategies¶

Choosing a Good Hash Function¶

Recall uniform hashing assumption: Hash function is randomly chosen among all possible hash functions.
Satisfying this is impossible: There are too many hash functions; we would not know how to look up $h(k)$.

We need to compromise:

Choose a hash function that is easy to compute.
Aim for $P(\text{two keys collide}) = \frac{1}{M}$ with respect to key distribution.
This is enough to prove the expected runtime bounds for chaining.

In practice, hope for good performance by choosing a hash function that is:

Unrelated to any possible patterns in the data, and
Depends on all parts of the key.

We saw two basic methods for integer keys:

Modular method: $h(k) = k \mod M$.
- We should choose $M$ to be a prime.
- This means finding a suitable prime quickly when re-hashing.
- This can be done in $O(M \log \log n)$ time (no details).
Multiplication method: $h(k) = \lfloor M(kA - \lfloor kA \rfloor) \rfloor$, for some number $A$ with $0 < A < 1$.
- Multiplying with $A$ is used to scramble the keys. So $A$ should be irrational to avoid patterns in the keys.
- Experiments show that good scrambling is achieved when $$A$$ is the golden ratio $\varphi = \frac{\sqrt{5} - 1}{2} \approx 0.618033988749...$.
- We should use at least $\log |U| + \log |M|$ bits of $A$.

But every hash function must do badly for some sequences of inputs:

If the universe contains at least $M \cdot n$ keys, then there are $n$ keys that all hash to the same value.

Carter-Wegman's Universal Hashing¶

Better idea: Choose hash function randomly!

Requires: All keys are in $\{0, \dots, p - 1\}$ for some (big) prime $p$.
At initialization, and whenever we re-hash:
- Choose $M < p$ arbitrarily; a power of 2 is acceptable.
- Choose (and store) two random numbers $a, b$:
  - $b = \text{random}(p)$
  - $a = 1 + \text{random}(p - 1)$ (so $a \neq 0$)
- Use as hash function: $h(k) = ((a k + b) \mod p) \mod M$
$h(k)$ can be computed quickly.

Analysis of these Carter-Wegman hash functions (no details):

Choosing $h$ in this way does not satisfy the uniform hashing assumption.
But we can show: two keys collide with probability at most $\frac{1}{M}$.
This suffices to prove the runtime bounds for hashing with chaining.

Multi-dimensional Data¶

What if the keys are multi-dimensional, such as strings?

The standard approach is to flatten string $w$ to an integer $f(w) \in \mathbb{N}$, for example:

\[ \begin{aligned} A \cdot P \cdot P \cdot L \cdot E &\rightarrow (65, 80, 80, 76, 69) \quad \text{(ASCII)} \\ &\rightarrow 65 R^4 + 80 R^3 + 80 R^2 + 76 R^1 + 69 R^0 \quad \text{(for some radix } R, \text{ e.g., } R = 255) \end{aligned} \]

We combine this with a modular hash function:

\[ h(w) = f(w) \mod M \]

To compute this in $O(|w|)$ time without overflow, use Horner's rule and apply mod early. For example, $h(\text{APPLE})$ is:

\[ \begin{aligned} &(((((65 R + 80) \mod M) R + 80) \mod M) R + 76) \mod M) R + 69) \mod M \end{aligned} \]

Hashing vs. Balanced Search Trees¶

Advantages of Balanced Search Trees¶

$O(\log n)$ worst-case operation cost
Does not require any assumptions, special functions, or known properties of input distribution
Predictable space usage (exactly $n$ nodes)
Never need to rebuild the entire structure
Supports ordered dictionary operations (successor, select, rank, etc.)

Advantages of Hash Tables¶

$O(1)$ operation cost (if hash function is random and load factor is small)
We can choose space-time tradeoff via load factor
Cuckoo hashing achieves $O(1)$ worst-case for search & delete