Lecture 02/09/2026 - FKS Hashing and Tail Bounds

Scribes: Carlos Aucacama and Mohammed Zaid

Summary of Lecture

Hashing with Chaining has expected constant query time.
Using stronger inequalities gives dramatically better tail bounds.
Recognizing when a random variable is a sum of independent Bernoullis is powerful.

3 Guarantees

Professor Goswami began by discussing the 3 guarantees of hashing and chaining:

1. Expected Query Time

Let denote the query time. Then

If , then

2. Markov Inequality Guarantee

3. Chebyshev Inequality Guarantee

Assuming , we have

Applying Chebyshev:

Variance Analysis of Query Time in Hashing with Chaining

Query Time as a Sum of Bernoulli Variables

Let denote the query time. We write

where

Thus, counts the number of keys that hash to the same bucket as the query.

The chance that the -th key hashes to the same bucket as the query is .
The variance of the query time is the sum of the variance of each .
Variance of a Bernoulli is , where . Summing over variables gives

Since , replacing it by gives an upper bound:

If , then

We compute the variance of to apply Chebyshev’s inequality. Markov’s inequality only requires expectation, but Chebyshev requires both expectation and variance. Assuming , we have

Applying Chebyshev:

Since is equivalent to , we get

This is much stronger than the Markov bound,

For example, when , Markov gives approximately 2%, while Chebyshev gives 0.04%. There is no contradiction: Chebyshev uses more information (variance), so it gives a tighter bound.

Tail Bounds for Random Variables in Hashing with Chaining

Markov Inequality

Applies to any positive random variable.
Only requires the expectation .

Chebyshev Inequality

Applies to any random variable.
Requires expectation and variance .

Chernoff Bound

Only applies to sums of independent Bernoulli random variables.
Gives a much tighter bound than Markov or Chebyshev for large deviations.

Comparison for Large Deviations (e.g., )

Purpose of Tail Bounds

Estimate extreme events (tails) when exact probabilities are difficult.
Choice depends on type of random variable:
- Markov: expectation only
- Chebyshev: expectation + variance
- Chernoff: sum of independent Bernoullis

Motivation for New Algorithm

Binary search has query time

which increases as increases.

Hashing with chaining improves this:

Preprocessing time: (always)
Query time: in expectation

However, the worst-case query time is not constant because collisions may occur.

We now design a hashing scheme with:

This scheme is called FKS hashing (Fredman–Komlós–Szemerédi). The randomness is moved entirely into preprocessing.

FKS Hashing Two-Level Hashing Construction

Let be the set of keys.

Step 1: First-Level Hashing

Pick a perfectly random hash function

Hash all keys into buckets.

For each bucket , define

Sum of all elements in each bucket is :

Square each bucket:

If the sum of squares is greater than :

Discard hash function and choose a new hash function. Repeat this process until the sum of squares is less than or equal to :

This completes Step 1.

Step 2: Second-Level Hashing

For each bucket :

There are keys in bucket .
Allocate a second-level table of size for each bucket
Choose a random hash function mapping those keys into this table.
If any collision occurs, discard and choose another hash function.

Repeat until all keys map to distinct cells.

Thus, every second-level table satisfies:

(1) Each cell contains at most one key.

(2) No collisions occur.

This completes preprocessing.

Space Analysis

First-level table uses:

Second-level tables use:

From Step 1:

Therefore:

Total space:

Thus total space is

Query Analysis

Given a query key :

Compute .
Compute .
Inspect the cell in bucket at position .

Since second-level tables contain no collisions:

If the cell contains , return YES.
Otherwise, return NO.

The query performs:

One evaluation of
One evaluation of
One table lookup