Lecture 5 on 02/09/2026 - FKS Hashing and Tail Bounds
Scribes: Carlos Aucacama and Mohammed Zaid
Summary of Lecture
Section titled “Summary of Lecture”- Hashing with Chaining has expected constant query time.
- Using stronger inequalities gives dramatically better tail bounds.
- Recognizing when a random variable is a sum of independent Bernoullis is powerful.
3 Guarantees
Section titled “3 Guarantees”Professor Goswami began by discussing the 3 guarantees of hashing and chaining:
1. Expected Query Time
Let denote the query time. Then
If , then
2. Markov Inequality Guarantee
3. Chebyshev Inequality Guarantee
Assuming , we have
Applying Chebyshev:
Variance Analysis of Query Time in Hashing with Chaining
Section titled “Variance Analysis of Query Time in Hashing with Chaining”Query Time as a Sum of Bernoulli Variables
Section titled “Query Time as a Sum of Bernoulli Variables”Let denote the query time. We write
where
Thus, counts the number of keys that hash to the same bucket as the query.
- The chance that the -th key hashes to the same bucket as the query is .
- The variance of the query time is the sum of the variance of each .
- Variance of a Bernoulli is , where . Summing over variables gives
Since , replacing it by gives an upper bound:
If , then
We compute the variance of to apply Chebyshev’s inequality. Markov’s inequality only requires expectation, but Chebyshev requires both expectation and variance. Assuming , we have
Applying Chebyshev:
Since is equivalent to , we get
This is much stronger than the Markov bound,
For example, when , Markov gives approximately 2%, while Chebyshev gives 0.04%. There is no contradiction: Chebyshev uses more information (variance), so it gives a tighter bound.
Tail Bounds for Random Variables in Hashing with Chaining
Section titled “Tail Bounds for Random Variables in Hashing with Chaining”Markov Inequality
Section titled “Markov Inequality”- Applies to any positive random variable.
- Only requires the expectation .
Chebyshev Inequality
Section titled “Chebyshev Inequality”- Applies to any random variable.
- Requires expectation and variance .
Chernoff Bound
Section titled “Chernoff Bound”- Only applies to sums of independent Bernoulli random variables.
- Gives a much tighter bound than Markov or Chebyshev for large deviations.
Comparison for Large Deviations (e.g., )
Section titled “Comparison for Large Deviations (e.g., t=50t=50t=50)”Purpose of Tail Bounds
Section titled “Purpose of Tail Bounds”- Estimate extreme events (tails) when exact probabilities are difficult.
- Choice depends on type of random variable:
- Markov: expectation only
- Chebyshev: expectation + variance
- Chernoff: sum of independent Bernoullis
Motivation for New Algorithm
Section titled “Motivation for New Algorithm”Binary search has query time
which increases as increases.
Hashing with chaining improves this:
- Preprocessing time: (always)
- Query time: in expectation
However, the worst-case query time is not constant because collisions may occur.
We now design a hashing scheme with:
This scheme is called FKS hashing (Fredman–Komlós–Szemerédi). The randomness is moved entirely into preprocessing.
FKS Hashing Two-Level Hashing Construction
Section titled “FKS Hashing Two-Level Hashing Construction”Let be the set of keys.
Step 1: First-Level Hashing
Section titled “Step 1: First-Level Hashing”Pick a perfectly random hash function
Hash all keys into buckets.
For each bucket , define
Sum of all elements in each bucket is :
Square each bucket:
If the sum of squares is greater than :
Discard hash function and choose a new hash function. Repeat this process until the sum of squares is less than or equal to :
This completes Step 1.
Step 2: Second-Level Hashing
Section titled “Step 2: Second-Level Hashing”For each bucket :
- There are keys in bucket .
- Allocate a second-level table of size for each bucket
- Choose a random hash function mapping those keys into this table.
- If any collision occurs, discard and choose another hash function.
Repeat until all keys map to distinct cells.
Thus, every second-level table satisfies:
-
- Each cell contains at most one key.
- No collisions occur.
This completes preprocessing.
Space Analysis
Section titled “Space Analysis”First-level table uses:
Second-level tables use:
From Step 1:
Therefore:
Total space:
Thus total space is
Query Analysis
Section titled “Query Analysis”Given a query key :
- Compute .
- Compute .
- Inspect the cell in bucket at position .
Since second-level tables contain no collisions:
- If the cell contains , return YES.
- Otherwise, return NO.
The query performs:
- One evaluation of
- One evaluation of
- One table lookup
Therefore: