Lecture 20 (04/22/2026) - Finish Counting Distinct Elements; Uniform RV; Introduce Flajolet-Martin
Scribes: Malaika Khan and Saartaj Alam
Summary of the Lecture
Section titled “Summary of the Lecture”- The idealized algorithm for the counting distinct elements problem
- Characteristics for uniform random variables and how it can be used in the distinct elements problem
- Analysis of the Flajolet algorithm
Counting Distinct Elements
Section titled “Counting Distinct Elements”Problem: Given a stream with , output () how many distinct elements have appeared in the stream so far?
- Real World Application: Imagine a cell tower with users connecting and disconnecting - how many people have connected to the cell tower?
- We will first look at an idealized algorithm and then a practical algorithm!
Ideal Algorithm
Section titled “Ideal Algorithm”Use a hash function where is the continuous interval.
When we say the hash function is “random,” we mean that for any given input, its output value is equally likely to be any point in — like throwing a dart at the interval at random. But the hash function is also consistent: if element 5 hashes to 0.31, it will always hash to 0.31 no matter how many times it appears in the stream. The randomness is in how the function was chosen at the start, not in how it evaluates each time.
- We hash each to get
- We only maintain the smallest hash value we got so far, i.e.,
- To get the number of distinct elements so far, we output
To see why the output makes sense: suppose the stream contains only 2 distinct elements (say, 1 and 8 appearing repeatedly). Since repeated elements always hash to the same value, you only ever compute 2 distinct hash values throughout the entire stream. The expected minimum of 2 uniform variables is , so the expected output is . More generally, for distinct elements, the expected minimum hash value is , giving an expected output of . The proof below establishes this.
To analyze this algorithm, we need to understand how the minimum of several uniform random variables behaves, since each hash value acts like a uniform random variable. The following sections build up that understanding.
Continuous Random Variable
Section titled “Continuous Random Variable”A continuous random variable is a random variable that takes values in a continuum.
- It can take on any value in a specified interval, but because it can take so many values and its probability still needs to add up to 1, most of those values have a probability of 0.
- We can define a continuous random variable based on density or distribution. Density is the probability that your random variable lies in the small interval .
Uniform Random Variable
Section titled “Uniform Random Variable”A uniform random variable is a continuous random variable where can take any value in , with every part of the interval equally likely. This equal-likelihood is what the density for all encodes: no region is favored over any other of equal length.
- Density:
- Distribution:
Expectation of Uniform Random Variables
Section titled “Expectation of Uniform Random Variables”The general formula is .
- For our discussed uniform random variable,
- If there are two uniform random variables, or you threw two darts on the interval , then the expected value of the smaller dot is and the expected value of the larger dot is . These can be rewritten as and .
Proof that the Output is Correct
Section titled “Proof that the Output is Correct”We showed the output was . We will claim that if , then
Proof: Let = number of distinct elements.
A key observation: repeated elements always hash to the same value, so they don’t produce new hash values. Even if an element appears hundreds of times in the stream, it contributes exactly one hash value to the pool. This means across the entire stream, you compute exactly distinct hash values — one for each distinct element — which is why the product below runs to rather than .
We want to compute . One useful formula for the expectation of a non-negative random variable is , which is the continuous analog of the discrete formula .
So it suffices to compute .
If the minimum of all hash values is greater than , that means every hash value is greater than :
The hash values are independent (each element’s hash is chosen independently), so the probability that all are above is the product of the individual probabilities. For any one uniform random variable, (since the length of the interval to the right of is ). Multiplying by itself times gives . Substituting into :
Therefore .
We can also substitute this expected value for in the previously mentioned output, which will return the number of distinct elements:
Why “Ideal”?
Section titled “Why “Ideal”?”The algorithm is called “ideal” because the hash function it requires — one that maps to the continuous interval — cannot actually be implemented. Storing a real number with full precision would require infinitely many bits, which defeats the purpose of a space-efficient streaming algorithm. The Flajolet-Martin algorithm described next resolves this by hashing to integers and working with their bit representations instead.
Flajolet Algorithm
Section titled “Flajolet Algorithm”The Algorithm
Section titled “The Algorithm”Here is an overview of the algorithm:
-
- Choose a hash function that maps elements to integers from to .
- For each element in the stream:
- Compute
- Represent in binary notation
- Find the position of the least significant bit in this bit vector
- Keep track of the maximum encountered so far
- After all elements are processed, output
The Least Significant Bit
Section titled “The Least Significant Bit”The Least Significant Bit (LSB) position, as used here, refers to the position of the rightmost 1-bit — equivalently, the number of trailing zeros in the binary representation. (In standard terminology this is sometimes called the “lowest set bit” position.)
Example. Consider the bit vector : reading positions from the right starting at 0, the rightmost 1 is at position 3, so the LSB position is 3.
The counter tracks the largest LSB position seen across all elements so far in the stream.
The Output
Section titled “The Output”Since a hash function is chosen randomly, we can think of each hash value as a random bit vector. It follows that each bit can be a 0 or 1 with equal probability, like a coin flip.
If the stream has distinct elements, then we compute hash values in total (repeated elements hash to the same value).
We want to find a pattern in the position of the LSB for these values. Since each bit in a random hash value is independently 0 or 1 with equal probability, we can count how many elements land at each LSB position:
- About of them end in (last bit is 1) — LSB exactly at position 0.
- About of them end in (second-to-last bit is 1, last bit is 0) — LSB exactly at position 1.
- About of them end in — LSB exactly at position 2.
- In general, about elements have their LSB exactly at position .
Each position is half as common as the one before it: the probability of ending in exactly zeros followed by a 1 is .
As increases, the expected number of elements with LSB exactly at position gets smaller and smaller. There is a transition point where this expected count crosses 1:
- For positions well below the transition, many elements land there, so the running maximum will certainly rise at least that high.
- For positions well above the transition, essentially no elements land there, so the maximum won’t reach that far.
The maximum therefore concentrates right near the transition — the largest where we still expect at least one element. Setting and solving gives , which is exactly the output of the algorithm.
It should be noted that the algorithm has a lot of variance. It is not exactly reliable in that random hashing can produce inconsistent results. For example, our output is always a power of 2, so if the true answer is 6, the algorithm can never output that exactly.
Guarantees
Section titled “Guarantees”The output of the algorithm is between and :
This is a factor-32 approximation: we can ensure the output is never more than a factor of 32 away from the true answer. The proof of this guarantee is deferred.
Improvements
Section titled “Improvements”We want to get from a factor-32 approximation down to a relative error approximation. The idea is to run many copies of the algorithm at different granularity levels, and then ask the appropriate copy for the answer.