Lecture 21-22 (04/27/2026) - Finish Counting Distinct Elements: Flajolet-Martin Factor-32, Epsilon; Online Algorithms
Scribes: Juno Joseph, Raquel Gelbtuch, and Olivia Xu
Topics covered:
- Review of the Distinct Elements Problem
- Idealized hashing algorithm using minimum hash values
- Limitations of infinite precision hashing
- Flajolet-Martin / HyperLogLog intuition
- Least Significant Bit position tracking
- Probabilistic analysis of trailing zeros
- Approximation guarantees for distinct counting
- Markov and Chebyshev bounds in algorithm analysis
- Introduction to Online Algorithms
- Competitive analysis
- Ski Rental problem
- Pizza Finding / Linear Search problem
Review: Distinct Elements Problem
Section titled “Review: Distinct Elements Problem”The distinct elements problem asks the following question:
Given a stream of elements, how many distinct elements appear in the stream?
The challenge is that the stream may be extremely large, so storing every element may be too expensive. Instead, we want a randomized streaming algorithm that uses small space and gives a good approximation.
Idealized Algorithm Using Hashing
Section titled “Idealized Algorithm Using Hashing”Suppose we have a perfectly random hash function:
which maps every distinct element uniformly to a real number in the interval .
The algorithm is:
- Hash every distinct element.
- Maintain the smallest hash value observed.
- Output:
Why Does This Work?
Section titled “Why Does This Work?”Suppose there are distinct elements.
If we throw random points uniformly into the interval , then:
For example:
- If , then .
- If , then .
So:
This motivates the estimator.
Limitation of the Idealized Algorithm
Section titled “Limitation of the Idealized Algorithm”This algorithm is not realistic because it hashes to real numbers in , which requires infinite precision. Therefore, we need a version that uses finite bit representations.
HyperLogLog / Flajolet-Martin Intuition
Section titled “HyperLogLog / Flajolet-Martin Intuition”To mimic the idealized algorithm, we instead hash stream elements into integers.
Suppose:
For every arriving stream element :
- Compute its hash value.
- Convert the hash into binary form.
- Find its least significant bit position — the first position from the right containing a .
Positions are counted starting at from the right.
For example, in the bit string , the least significant occurs at position .
Algorithm
Section titled “Algorithm”The algorithm keeps track of , where is the largest least-significant-bit position observed so far.
At the end of the stream, the algorithm outputs:
Why Does This Work?
Section titled “Why Does This Work?”The intuition comes from trailing zeros.
- About half of bit strings end in .
- About one fourth end in .
- About one eighth end in .
In general, the number of elements whose hash has least significant bit at position is about:
The largest we expect to see is when this quantity is roughly :
Therefore:
So the algorithm outputs as its estimate for the number of distinct elements.
Approximation Guarantee of the Basic Algorithm
Section titled “Approximation Guarantee of the Basic Algorithm”Let be the true number of distinct elements in the stream. Let be the largest least significant bit position observed by the algorithm.
The output is .
The first guarantee is that this basic algorithm gives a -factor approximation with probability at least :
Although a factor of is not very accurate, this algorithm is useful as a building block for stronger algorithms such as HyperLogLog.
Why the Factor 32 Appears
Section titled “Why the Factor 32 Appears”Ideally, we would like the counter to be close to .
The key claim is:
with probability at least .
If this holds, then raising all sides to the power of gives:
Since and , we get:
So proving the algorithm is a -approximation reduces to proving that is within of .
Random Variables for the Analysis
Section titled “Random Variables for the Analysis”For each position , define to be the number of distinct elements whose least significant bit occurs exactly at position .
For example:
- counts elements whose hash ends in .
- counts elements whose hash ends in .
- counts elements whose hash ends in .
For a fixed element:
Therefore:
We also define to be the number of elements whose least significant bit occurs at a position greater than — meaning farther left than position .
Using a geometric series:
The important intuition is that these probabilities decrease geometrically as we move farther left in the bit representation.
Bad Events
Section titled “Bad Events”The algorithm fails to give a -approximation if is not within .
There are two possible bad events:
- Overestimate: the algorithm sees a least significant bit position that is too far left.
- Underestimate: the algorithm never sees a least significant bit position far enough left.
Let and .
The overestimate event occurs if at least one element has least significant bit position greater than .
The underestimate event occurs if no element has least significant bit position greater than .
Bounding the Overestimate Event Using Markov’s Inequality
Section titled “Bounding the Overestimate Event Using Markov’s Inequality”We want to bound:
First:
Since :
Using :
Now apply Markov’s inequality:
Using and :
So the probability of a large overestimate is at most .
Bounding the Underestimate Event Using Chebyshev’s Inequality
Section titled “Bounding the Underestimate Event Using Chebyshev’s Inequality”The underestimate event happens when .
To bound this, we use Chebyshev’s inequality:
Write as a sum of indicator random variables:
where:
Each is Bernoulli with .
For a Bernoulli random variable with probability :
Since :
Because the hash values are independent:
Thus:
The same reasoning applies to .
For , we first compute to see what Chebyshev gives us.
Each position contributes in expectation, so:
Substituting :
The last step uses , which gives .
Now apply Chebyshev. The underestimate event requires to deviate from its mean by at least :
In essence:
So the probability of the underestimate event is at most .
Combining the Bad Events
Section titled “Combining the Bad Events”The algorithm fails only if either the overestimate event or the underestimate event occurs.
Using the union bound:
Substituting the bounds:
Since :
Therefore, the algorithm gives a -approximation with probability at least .
Improving the Algorithm
Section titled “Improving the Algorithm”The basic estimator is not accurate enough by itself. A factor of is too large for most applications.
Improving the algorithm happens in two stages: first reduce the failure probability, then shrink the approximation factor from down to .
Boosting the Success Probability
Section titled “Boosting the Success Probability”To push the success probability from to , run independent copies of the basic estimator in parallel and take the median of their outputs. A Chernoff-bound argument shows the median is correct with probability at least . This is the same boosting technique from earlier in the course and works as a black box — you do not need to know the algorithm’s internals to apply it.
Improving the Approximation Factor
Section titled “Improving the Approximation Factor”Getting an -approximation takes a different idea. The starting observation is: if the stream has very few distinct elements, we do not need any clever algorithm — we can just store them all directly.
Algorithm for Small Sets (AFSS): Maintain an array of size . As elements arrive, add each new distinct element to the array. Once the array is full, stop tracking new elements. When queried, output the exact count of distinct elements stored.
AFSS is perfectly accurate whenever the number of distinct elements is at most , and it uses bits (each element from a universe of size takes bits).
The Combined Algorithm
Section titled “The Combined Algorithm”The full algorithm runs A1 (the basic 32-approximation algorithm) and copies of AFSS in parallel:
- A1 sees the entire stream and tracks the global maximum LSB position.
- AFSS[0], AFSS[1], …, AFSS[] each see only a filtered portion of the stream.
For each arriving element :
-
- Hash and find its LSB position .
- Feed to A1 (which updates its maximum LSB counter).
- Feed to AFSS[] only — not to any other copy.
The filtering is the key idea. About distinct elements hash to LSB position . The copy that matters most is AFSS[], where is the position A1 currently reports. At that level the number of distinct elements feeding into AFSS[] is roughly a small constant — well within the array’s capacity.
Query: When asked for an estimate:
-
- Ask A1 for the current maximum LSB position .
- Ask AFSS[] for its distinct element count .
- Output .
This output makes sense for the same reason A1’s estimator does: counts the elements at the rarest hashing level, and each such element represents distinct elements in expectation. The difference is that AFSS gives an exact count rather than the rough estimate A1 would produce.
Space Complexity
Section titled “Space Complexity”Each AFSS copy uses bits, and there are copies:
With probability boosting to , multiply by . A result by Jelani Nelson and others later showed the space can be reduced to , which is also optimal.
Introduction to Online Algorithms
Section titled “Introduction to Online Algorithms”An online algorithm is an algorithm that must make decisions without knowing the future.
This is different from an offline algorithm, which has access to the entire input in advance.
Online vs. Offline Algorithms
Section titled “Online vs. Offline Algorithms”An offline algorithm sees the whole input before making decisions.
An online algorithm receives the input piece by piece and must make decisions immediately.
Once an online algorithm makes a decision, it usually cannot go back and change it.
Why Online Algorithms Matter
Section titled “Why Online Algorithms Matter”Online algorithms appear in many real-world settings, including:
- renting or buying equipment,
- caching web pages,
- scheduling jobs,
- assigning resources,
- accepting or rejecting requests,
- making financial decisions without knowing future prices.
Competitive Analysis
Section titled “Competitive Analysis”Since online algorithms do not know the future, we compare them to an ideal offline algorithm.
The offline algorithm knows the entire input in advance and can make the best possible decisions. This ideal algorithm is called .
The performance of an online algorithm is measured using the competitive ratio.
An online algorithm is -competitive if:
for every possible input sequence.
The smaller the competitive ratio, the better the online algorithm.
The Ski Rental Problem
Section titled “The Ski Rental Problem”The ski rental problem is a classic example of online algorithms.
Suppose renting skis costs per day, while buying skis costs . The skier does not know how many days they will ski.
The question is: when should the skier stop renting and buy skis?
Offline Optimal Solution
Section titled “Offline Optimal Solution”If the skier knew in advance that they would ski for days, the optimal offline algorithm would choose:
This means:
- If , it is cheaper to rent every day.
- If , it is cheaper to buy.
Online Strategy
Section titled “Online Strategy”A natural online strategy is: rent for the first days, then buy.
The online algorithm does not know whether skiing will stop early or continue for a long time, so it waits until the rental cost equals the buying cost.
Cost of the Online Algorithm
Section titled “Cost of the Online Algorithm”If skiing stops before day , the online algorithm only rents, so its cost is , which matches the optimal offline cost.
If skiing continues for at least days, the online algorithm rents for days and then buys:
The offline algorithm would have bought immediately, paying . Thus:
So this online strategy is -competitive.
Pizza Finding / Linear Search Problem
Section titled “Pizza Finding / Linear Search Problem”The pizza finding problem is an example of an online search problem.
Imagine standing at a starting point on a long road. There is a pizza shop somewhere on the road, but we do not know whether it is to the left or to the right, and we do not know how far away it is.
The goal is to find the pizza shop while walking as little total distance as possible.
Why This Is an Online Problem
Section titled “Why This Is an Online Problem”This is an online problem because the algorithm does not know the location of the pizza shop ahead of time. It must decide which direction to search and how far to walk without knowing whether that choice is correct.
An offline algorithm that already knows where the pizza shop is would walk directly to it. If the pizza shop is distance away, then .
Doubling Strategy
Section titled “Doubling Strategy”A strong online strategy is the doubling trick. The algorithm searches in alternating directions, doubling the search distance each time:
For example:
- Go distance to the right, then return.
- Go distance to the left, then return.
- Go distance to the right, then return.
- Go distance to the left, then return.
- Continue alternating directions and doubling the distance.
This guarantees that eventually the algorithm searches far enough in the correct direction to find the pizza shop.
Competitive Ratio
Section titled “Competitive Ratio”Suppose the pizza shop is distance from the starting point. The offline optimal algorithm pays .
The online algorithm may waste distance searching in the wrong direction and returning to the starting point. However, because the search distance doubles each time, the total wasted distance is bounded by a constant factor of . Let’s actually work out that constant.
Setting up the worst case
Section titled “Setting up the worst case”Number the rounds where round walks distance in its search direction and then returns to the start. Directions alternate, so (say) even-numbered rounds go right and odd-numbered rounds go left.
Assume WLOG the pizza is to the right at distance . We finally find the pizza in some round where:
- is even (it’s a right-direction round), and
- (we walk far enough this time to reach the pizza), but
- (the previous right-direction round didn’t go far enough)
The adversary picks to make us as inefficient as possible. The worst is just barely larger than — if it were any smaller, we would’ve already found the pizza in round and not paid for rounds and .
Computing total cost
Section titled “Computing total cost”The algorithm’s total walking distance breaks into two parts:
-
Rounds through : each round is a complete round-trip of length (walk out, walk back). Summing the geometric series:
-
Round : we walk distance in the right direction and stop when we hit the pizza (no return needed). This contributes .
So the total cost is:
Taking the ratio
Section titled “Taking the ratio”The competitive ratio is:
Adversary’s worst-case choice: (just barely past the previous right-direction walk, so we’re forced through rounds and ). Substituting:
As grows (i.e., for far-away pizza shops), the correction vanishes and the ratio approaches:
So the doubling strategy is -competitive:
Main Idea
Section titled “Main Idea”The important lesson is that even without knowing where the target is, the online algorithm can still stay within a constant factor of the optimal offline solution by expanding its search exponentially.
This is why the doubling trick is useful in online algorithms: it prevents the algorithm from wasting too much time on small searches while still guaranteeing that the target will eventually be found.
Main Takeaways
Section titled “Main Takeaways”- The distinct elements problem asks us to estimate the number of unique items in a stream.
- The idealized minimum-hash algorithm estimates using the smallest hash value, but it requires infinite precision.
- The practical bit-based approach hashes elements to binary strings and tracks the largest least significant bit position.
- Seeing many trailing zeros is rare, so it suggests that many distinct elements have appeared.
- The basic estimator outputs .
- The basic estimator gives a -approximation with probability at least .
- Markov’s inequality is used to bound the probability of overestimating.
- Chebyshev’s inequality is used to bound the probability of underestimating.
- HyperLogLog improves the basic estimator by using multiple estimates and combining them.
- Online algorithms make decisions without knowing the future.
- Competitive analysis compares an online algorithm to the optimal offline algorithm.
- The ski rental problem is a classic example, and the rent-then-buy strategy is -competitive.
- The pizza finding problem is an online linear search problem, and the doubling strategy gives a -competitive algorithm.