Lecture 5 on 02/09/2026 - FKS Hashing and Tail Bounds

Scribes: Carlos Aucacama and Mohammed Zaid

Summary of Lecture

Hashing with Chaining has expected constant query time.
Using stronger inequalities gives dramatically better tail bounds.
Recognizing when a random variable is a sum of independent Bernoullis is powerful.

3 Guarantees

Professor Goswami began by discussing the 3 guarantees of hashing and chaining:

1. Expected Query Time

Let $Q$ denote the query time. Then

E[Q] = \frac{n}{M}

If $M = \Theta(n)$ , then

E[Q] = O(1)

2. Markov Inequality Guarantee

\Pr(Q > T) \le \frac{1}{T}

3. Chebyshev Inequality Guarantee

Assuming $M = n$ , we have

\mathrm{Var}(Q) \le 1

Applying Chebyshev:

\Pr(Q > T) \le \frac{1}{T^2}

Variance Analysis of Query Time in Hashing with Chaining

Query Time as a Sum of Bernoulli Variables

Let $Q$ denote the query time. We write

Q = \sum_{i=1}^{n} X_i

where

X_i = \begin{cases} 1 & \text{if the $i$-th key hashes to the same bucket as the query,} \\ 0 & \text{otherwise.} \end{cases}

Thus, $Q$ counts the number of keys that hash to the same bucket as the query.

The chance that the $i$ -th key hashes to the same bucket as the query is $\frac{1}{M}$ .
The variance of the query time is the sum of the variance of each $X_i$ .
Variance of a Bernoulli is $p(1-p)$ , where $p = \frac{1}{M}$ . Summing over $n$ variables gives

\mathrm{Var}(Q) = n \cdot \frac{1}{M} \left(1 - \frac{1}{M}\right) = np(1-p)

Since $1 - \frac{1}{M} \le 1$ , replacing it by $1$ gives an upper bound:

\mathrm{Var}(Q) \le \frac{n}{M}

If $M = n$ , then

\mathrm{Var}(Q) \le 1

We compute the variance of $Q$ to apply Chebyshev’s inequality. Markov’s inequality only requires expectation, but Chebyshev requires both expectation and variance. Assuming $M = n$ , we have

E[Q] = 1 \quad \text{and} \quad \mathrm{Var}(Q) \le 1

Applying Chebyshev:

\Pr(|Q - 1| > t) \le \frac{1}{t^2}

Since $Q - 1 > t$ is equivalent to $Q > t+1$ , we get

\Pr(Q > T) \le \frac{1}{T^2}

This is much stronger than the Markov bound,

\Pr(Q > T) \le \frac{1}{T}

For example, when $T = 50$ , Markov gives approximately 2%, while Chebyshev gives 0.04%. There is no contradiction: Chebyshev uses more information (variance), so it gives a tighter bound.

Tail Bounds for Random Variables in Hashing with Chaining

Markov Inequality

\Pr(X \ge t) \le \frac{E[X]}{t}, \quad X \ge 0

Applies to any positive random variable.
Only requires the expectation $E[X]$ .

Chebyshev Inequality

\Pr(|X - E[X]| \ge t) \le \frac{\mathrm{Var}(X)}{t^2}

Applies to any random variable.
Requires expectation $E[X]$ and variance $\mathrm{Var}(X)$ .

Chernoff Bound

Q = \sum_{i=1}^{n} X_i, \quad X_i \sim \text{Bernoulli}(p), \text{ independent}

\Pr(Q > t) \le \frac{1}{e^t}, \quad e \approx 2.718

Only applies to sums of independent Bernoulli random variables.
Gives a much tighter bound than Markov or Chebyshev for large deviations.

Comparison for Large Deviations (e.g., $t=50$ )

\Pr(Q > 50) \le \begin{cases} \text{Markov: } \frac{1}{50} \approx 0.02 \\ \text{Chebyshev: } \frac{1}{50^2} \approx 0.0004 \\ \text{Chernoff: } \frac{1}{2^{50}} \approx 0 \end{cases}

Purpose of Tail Bounds

Estimate extreme events (tails) when exact probabilities are difficult.
Choice depends on type of random variable:
- Markov: expectation only
- Chebyshev: expectation + variance
- Chernoff: sum of independent Bernoullis

Motivation for New Algorithm

Binary search has query time

O(\log n)

which increases as $n$ increases.

Hashing with chaining improves this:

Preprocessing time: $O(n)$ (always)
Query time: $O(1)$ in expectation

However, the worst-case query time is not constant because collisions may occur.

We now design a hashing scheme with:

\text{Preprocessing time: } O(n) \text{ in expectation}

\text{Query time: } O(1) \text{ worst case (always)}

This scheme is called FKS hashing (Fredman–Komlós–Szemerédi). The randomness is moved entirely into preprocessing.

FKS Hashing Two-Level Hashing Construction

Let $S = \{x_1, \dots, x_n\}$ be the set of keys.

Step 1: First-Level Hashing

Pick a perfectly random hash function

h : U \to \{1,2,\dots,n\}

Hash all keys into $n$ buckets.

For each bucket $i$ , define

b_i = \text{number of keys mapped to bucket } i

Sum of all elements in each bucket is $n$ :

\sum_{i=1}^{n} b_i = n

Square each bucket:

\sum_{i=1}^{n} b_i^2

If the sum of squares is greater than $4n$ :

\sum_{i=1}^{n} b_i^2 > 4n

Discard hash function $h$ and choose a new hash function. Repeat this process until the sum of squares is less than or equal to $4n$ :

\sum_{i=1}^{n} b_i^2 \le 4n

This completes Step 1.

Step 2: Second-Level Hashing

For each bucket $i$ :

There are $b_i$ keys in bucket $i$ .
Allocate a second-level table of size $2b_i^2$ for each bucket
Choose a random hash function $g_i$ mapping those $b_i$ keys into this table.
If any collision occurs, discard $g_i$ and choose another hash function.

Repeat until all $b_i$ keys map to distinct cells.

Thus, every second-level table satisfies:

Each cell contains at most one key.
No collisions occur.

This completes preprocessing.

Space Analysis

First-level table uses:

n \text{ cells}

Second-level tables use:

\sum_{i=1}^{n} 2 b_i^2 = 2 \sum_{i=1}^{n} b_i^2

From Step 1:

\sum_{i=1}^{n} b_i^2 \le 4n

Therefore:

2 \sum_{i=1}^{n} b_i^2 \le 8n

Total space:

n + 8n = 9n

Thus total space is

O(n)

Query Analysis

Given a query key $q$ :

Compute $i = h(q)$ .
Compute $j = g_i(q)$ .
Inspect the cell in bucket $i$ at position $j$ .

Since second-level tables contain no collisions:

If the cell contains $q$ , return YES.
Otherwise, return NO.

The query performs:

One evaluation of $h$
One evaluation of $g_i$
One table lookup