Skip to content

Lecture 6 on 02/11/2026 - FKS Hashing Analysis and Preprocessing

Scribes: Olivia Xu and Laura Torres

  • Comparison of FKS Hashing vs. Chaining (Worst-case vs. Expected)
  • Mathematical Intuition: Minimizing Sum of Squares for Equality
  • Analysis of Step 2: Probability of Second-level Collisions
  • Analysis of Step 1: Expected Collisions and Markov Bound

Professor Goswami began by distinguishing the guarantees:

  • Hashing with Chaining: Query time is O(1)O(1) expected. Preprocessing is always O(n)O(n).
  • FKS Hashing: Query time is O(1)O(1) worst-case. Preprocessing is O(n)O(n) expected.

First a Brief Summary of FKS Hashing Preprocessing Phase

Section titled “First a Brief Summary of FKS Hashing Preprocessing Phase”

Before analyzing the FKS hashing preprocessing, let’s review its steps.

  • Step 2: We take each bib_i row from the first step and find a hash function for each row that can distribute the keys into a row that is 2bi22b_i^2 in length such that there are no collisions.
  • By the end, we will have used n+1n+1 hash functions: 1 in step 1 and nn in step 2.

To understand why Step 1 limits the sum of squares, we consider a calculus problem:

Problem: Given x+y=1x + y = 1 and x,y0x, y \geq 0, minimize f(x,y)=x2+y2f(x,y) = x^2 + y^2.

  • Substituting y=1xy = 1-x, we get f(x)=x2+(1x)2=2x22x+1f(x) = x^2 + (1-x)^2 = 2x^2 - 2x + 1.
  • To minimize it, we differentiate it once and set it equal to zero:
f(x)=4x2=0    x=1/2,y=1/2f'(x) = 4x - 2 = 0 \implies x = 1/2, y = 1/2
  • Since the second derivative is positive, f(x)=4>0f''(x) = 4 > 0, this means that x=1/2x = 1/2 and y=1/2y = 1/2 is a minimum.
  • Insight: Minimizing the sum of squares is a mathematical way to enforce an equal distribution.
  • In FKS, bi=n\sum b_i = n. The sum of squares bi2\sum b_i^2 ranges from nn (perfectly equal, bi=1b_i=1) to n2n^2 (all in one bucket). FKS accepts any hh where bi24n\sum b_i^2 \leq 4n.
    • The most equal distribution will be when there are no collisions and each bib_i is equal to 1. Then bi2=n\sum b_i^2 = n:
b12+b22++bn2=12+12++12=nb_1^2+b_2^2+\dots+b_n^2 = 1^2+1^2+\dots+1^2=n
  • The most unequal distribution will occur when all keys go into the same bucket, so all the other buckets have 0 keys:
bi2=n2\sum b_i^2 = n^2 b12+b22++bn2=n2+02++02=n2b_1^2+b_2^2+\dots+b_n^2 = n^2+0^2+\dots+0^2=n^2

Step 2 Analysis: Second-level Collision Probability

Section titled “Step 2 Analysis: Second-level Collision Probability”

We need to prove that Step 2 terminates quickly.

Scenario: We hash bib_i keys into mi=2bi2m_i = 2b_i^2 cells.

  • Let CC be the total number of collisions in a bucket.
  • Using a Universal Hash Family, for any pair of keys, Pr(collision)1/mi=1/(2bi2)\Pr(\text{collision}) \leq 1/m_i = 1/(2b_i^2).
  • Total number of pairs is (bi2)=bi(bi1)2\binom{b_i}{2} = \frac{b_i(b_i-1)}{2}.
E[C]=pairsPr(collision)=bi(bi1)212bi2<bi24bi2=1/4E[C] = \sum_{\text{pairs}} \Pr(\text{collision}) = \frac{b_i(b_i-1)}{2} \cdot \frac{1}{2b_i^2} < \frac{b_i^2}{4b_i^2} = 1/4
  • By Markov’s Inequality: Pr(C1)E[C]1<1/4\Pr(C \geq 1) \leq \frac{E[C]}{1} < 1/4.
  • Conclusion: Since the failure probability is <1/2< 1/2, the number of tries follows a Geometric Random Variable with success p1/2p \geq 1/2. Expected tries 2\leq 2.

We prove that we don’t need to resample hh too many times. Step 1 succeeds with probability at least 1/21/2, so expected tries is at most 2.

Key observation: The total number of collisions is related to how keys distribute across buckets.

When bib_i keys hash to bucket ii, how many collision pairs are there? If we count ordered pairs (each unordered pair counted twice), we get bi(bi1)b_i(b_i-1) ordered collision pairs from bucket ii. Summing over all buckets:

C=i=1mbi(bi1)=i=1m(bi2bi)=i=1mbi2i=1mbi=bi2nC = \sum_{i=1}^{m} b_i(b_i-1) = \sum_{i=1}^{m} (b_i^2 - b_i) = \sum_{i=1}^{m} b_i^2 - \sum_{i=1}^{m} b_i = \sum b_i^2 - n

Since bi=n\sum b_i = n (all keys must go somewhere). Therefore:

bi2=C+n\sum b_i^2 = C + n

This connects the sum of squares directly to collision count, tying Step 1’s success condition to a probability argument.

When hashing nn keys into nn buckets using a universal hash function:

What’s the expected number of keys that collide with one particular key xjx_j?

  • There are n1n-1 other keys
  • Each hashes to the same bucket as xjx_j with probability 1/n1/n
  • Expected collisions with xjx_j: (n1)1n<1(n-1) \cdot \frac{1}{n} < 1

Since this holds for every key:

E[C]=j=1nE[collisions with xj]<nE[C] = \sum_{j=1}^{n} E[\text{collisions with } x_j] < n

Important fact: When hashing nn keys into nn buckets, E[C]<nE[C] < n.

Using the result above where E[C]<nE[C] < n, and C=bi2nC = \sum b_i^2 - n:

E[bi2n]=E[C]<n    E[bi2]<2nE[\sum b_i^2 - n] = E[C] < n \implies E[\sum b_i^2] < 2n

Step 1 fails when bi2>4n\sum b_i^2 > 4n. By Markov’s inequality:

Pr(bi2>4n)E[bi2]4n<2n4n=12\Pr\left(\sum b_i^2 > 4n\right) \leq \frac{E[\sum b_i^2]}{4n} < \frac{2n}{4n} = \frac{1}{2}

Conclusion:

  • Probability Step 1 succeeds: 1/2\geq 1/2
  • Expected number of tries: 2\leq 2 (geometric with success probability 1/2\geq 1/2)
  • Total expected work in Step 1: 2O(n)=O(n)2 \cdot O(n) = O(n)
  • Query: O(1)O(1) worst-case (two hash function evaluations).
  • Space: O(n)O(n) (since 2bi28n\sum 2b_i^2 \leq 8n from Step 1’s condition).
  • Preprocessing: O(n)O(n) expected (Geometric trials for both levels).