Lecture 10 on 03/02/2026 - Linear Probing and Hashing Complexity
Lecture on 03/02/2026
Section titled “Lecture on 03/02/2026”Scribes: Xuexiong Wu
Review
Section titled “Review”Review: Dictionary/Membership Problem by HWC and FSK
Section titled “Review: Dictionary/Membership Problem by HWC and FSK”| Pre-processing | Query | |
|---|---|---|
| HWC | worst case expected | (Chernoff) |
| FSK | expected | worst case |
What does Chernoff actually mean? At its core, Chernoff tells us something powerful: if a random variable is the sum of independent Bernoulli variables, then it stays very close to its expectation all the time.
To understand this intuitively: imagine flipping many independent coins and counting heads. You expect about half to be heads, but you’re curious—how far can the actual count deviate from this expectation? Chernoff answers this: deviations become exponentially less likely the further you go from expectation. This is remarkable because for other types of random variables, deviations might be quite common.
More formally, Chernoff bounds give us precise probabilities: the probability that a sum-of-Bernoullis random variable deviates far from its expectation is very small. If you understand the expectation of such a random variable, you can confidently trust that the actual value will stay close to that expectation.
When can we use Chernoff? Chernoff can only apply to random variables that are sums of independent Bernoulli variables. This is crucial: if your random variable has this structure, Chernoff gives you very strong (exponential) bounds. When the random variable is NOT a sum of independent Bernoullis, we can only use Markov or Chebyshev, which give weaker bounds.
Why keep Markov and Chebyshev? You might ask: if Chernoff is stronger, why do we ever use Markov or Chebyshev? The answer is that Chernoff only works when you have the specific structure of a sum of independent Bernoullis. Many random variables don’t have this structure. In those cases, Markov and Chebyshev are the only tools available, so they’re still essential.
HWC: The Pre-Processing time is in the worst case. The expected Query time is . Or we could say the by Chernoff, by Markov or by Chebyshev.
FSK: The Pre-Processing time is expected, the Query time is in worst case.
Linear Probing
Section titled “Linear Probing”Preprocessing
Section titled “Preprocessing”Let be a set of keys, and let the hash table have cells. We want to hash the keys one by one.
- If the cell is empty, we store key in cell .
- If the cell is already occupied, we walk to the right from until we find an empty slot.
- Suppose we walk steps. Then we store in cell
- Compute , look at cell .
- If is there, return yes
- If is not there, move to the right until we find or we find an empty slot. If is found, return yes. If empty slot is found, return no.
Who is Donald Knuth?
Section titled “Who is Donald Knuth?”Before discussing his remarkable analysis, it’s worth knowing who Donald Knuth is. Knuth is one of the most influential computer scientists of all time. He authored The Art of Computer Programming, a legendary five-volume series considered the encyclopedia of algorithms and the definitive reference on many topics in computer science. These volumes contain an incredible collection of problems, some of which remain open research questions today.
Interestingly, Knuth still offers rewards (originally 1 hexadecimal dollar = ) for anyone who finds an error in his books—a testament to their precision. Additionally, Knuth invented LaTeX, the very typesetting system you’re likely using to write your lecture notes!
Donald Knuth’s Analysis
Section titled “Donald Knuth’s Analysis”Here the or is the steps we have to walk to the right until we find an empty cell. Therefore, if the increases, the expected query time also increases.
Analysis
Section titled “Analysis”Block Definition and Intuition
Section titled “Block Definition and Intuition”To understand the behavior of linear probing, we need to introduce the concept of a block. This is the key to understanding why linear probing works well.
Intuitive idea: Imagine the hash table as a row of cells. When you insert keys, some cells become occupied and some remain empty. A block is simply a contiguous group of occupied cells sandwiched between two empty cells.
Formal definition: A sequence of consecutive cells is called a block if:
- The cell immediately before this sequence is empty (no one hashes there)
- The cell immediately after this sequence is empty (no one hashes there)
- Exactly keys have hashed into this block
Why blocks matter for query time: When you do a query and your hash value lands in a block, you have to walk through the entire block to find an empty cell. Therefore, your query time is determined by the size of the block you land in. The longer the blocks, the longer your queries take.
What makes a block “bad”? We call a block of cells bad when it is completely full (all cells occupied). When you insert a new element whose hash location falls into a bad block, you must walk past all occupied cells before finding an empty slot. This is expensive.
How many keys end up in a block? Let’s think about expectations. When we have keys and cells with load factor , the expected number of keys hashing to a single cell is . For a block of consecutive cells, we expect:
Now here’s the intuition: If (fewer keys than cells), then we expect fewer than keys in any -sized block. But what if, by bad luck, a block actually gets or more keys? That would be a significant deviation from expectation. This is exactly where Chernoff bounds help: they quantify how unlikely such “bad” scenarios are.
Applying Chernoff: Step-by-Step Analysis
Section titled “Applying Chernoff: Step-by-Step Analysis”Let be the random variable representing the number of keys hashing to a particular block of size .
What we know:
- (expected keys in the block)
- A block is bad if (at least keys in the block)
- We want to bound
Concrete example: Assume (5 times more cells than keys). Then:
- For a block of size : we expect only keys in the block
- But the block is bad if 10 keys somehow hash into it
- We need to find the probability of this worst-case 5× deviation from expectation
Setting up Chernoff: We want to apply the Chernoff bound, which comes in the following form for sums of Bernoullis:
This tells us: the probability that exceeds its expectation by a factor of decays exponentially in .
Finding for our case: We want to find when a block is bad, which means . So we need to express this in the form by solving:
Since , we substitute:
Dividing both sides by :
Therefore:
Concrete numbers: For (meaning we have 5 times more cells than keys):
This means a bad block requires a deviation from the expected number of keys.
Plugging into Chernoff: Now we substitute our expression for into the Chernoff bound:
Substituting and :
Important simplification: Look at the exponent:
Notice that the term is a constant—it only depends on our choice of , not on . Let’s call this constant :
Then we can rewrite our probability bound more simply as:
Define . Since , we have . Therefore:
What this means—Exponential Decay: The probability that a block is bad decreases exponentially with block size — this is crucial. For example, if :
- : probability (70%)
- : probability (49%)
- : probability (17%)
- : probability (2.8%)
You can see how rapidly the probabilities shrink. This exponential decay is exactly what we need: it ensures that long blocks are very rare, which means our expected insertion time stays constant.
Expected insertion time: The insertion (and query) time is proportional to the length of the bad block containing the hash location. Thus:
Computing Expected Insertion Time via Series Summation
Section titled “Computing Expected Insertion Time via Series Summation”Now let’s compute the expected insertion time. When you hash a query into position , you might collide with blocks of various sizes. How many blocks of size could contain your hash position?
Blocks containing a position: A hash position can be contained in up to different blocks of size (one where is the rightmost position, one where it’s second-from-right, …, one where it’s the leftmost position). So there are possible -sized blocks that could contain any given hash position.
Expected insertion time formula:
By overestimating (even if all blocks of size are bad), we get:
Substituting our Chernoff bound:
Why Chernoff is Essential: Comparison with Markov
Section titled “Why Chernoff is Essential: Comparison with Markov”Now let’s see why we needed Chernoff and why simpler inequalities don’t work.
What if we used Markov? Recall that Markov’s inequality says:
In our case, with :
Notice something troubling: this bound doesn’t depend on at all! Whether we’re looking at blocks of size 1 or size 1000, Markov gives us the same bound . This is too weak.
Why this breaks our analysis: If we tried to compute expected insertion time using Markov’s bound, we’d get:
The sum grows as . Since , we’d conclude that expected insertion time is —a terrible result!
Why Chernoff succeeds: Chernoff gives us the exponential bound where . The exponential decay dominates the polynomial growth of , making the sum converge to a constant. This is the power of Chernoff: it exploits the specific structure of sums of independent Bernoullis, while Markov only uses the expectation.
Series Convergence: Showing the Sum is O(1)
Section titled “Series Convergence: Showing the Sum is O(1)”Now we need to compute the sum of all our probabilities over all possible block sizes:
The question is: does this sum stay bounded as grows? Or does it blow up?
Intuition—Exponential beats polynomial: This is where the magic happens. We have:
- grows polynomially (slowly)
- decays exponentially (very fast)
When exponential decay competes with polynomial growth, exponential always wins.
Making this rigorous: For large enough , the term becomes incredibly small—smaller than . So:
The first sum has just a few terms, so it’s finite. The second sum is bounded by the convergent series:
Conclusion: The entire series is bounded by some constant (which may depend on , but not on or ). Therefore:
This completes the argument. With the right load factor (like ), linear probing achieves constant expected insertion and query time.
The Proof in Two Sentences
Section titled “The Proof in Two Sentences”The complete analysis can be summarized as follows:
-
Long runs are very improbable: If you have enough cells compared to your keys (load factor ), then finding long contiguous runs of packed elements should be rare.
-
They become more improbable as they get longer: The probability that a block is bad decreases exponentially with its length ( where ). This exponential decay dominates the polynomial growth of block sizes, making the expected insertion time sum to a constant.
Together, these two observations guarantee that linear probing—despite being simple to implement—achieves constant expected insertion and query time when the load factor is kept as a small constant (like ).