Lecture 21-22 (04/27/2026) - Finish Counting Distinct Elements: Flajolet-Martin Factor-32, Epsilon; Online Algorithms | CSCI 328

Scribes: Juno Joseph, Raquel Gelbtuch, and Olivia Xu

Topics covered:

Review of the Distinct Elements Problem
Idealized hashing algorithm using minimum hash values
Limitations of infinite precision hashing
Flajolet-Martin / HyperLogLog intuition
Least Significant Bit position tracking
Probabilistic analysis of trailing zeros
Approximation guarantees for distinct counting
Markov and Chebyshev bounds in algorithm analysis
Introduction to Online Algorithms
Competitive analysis
Ski Rental problem
Pizza Finding / Linear Search problem

Review: Distinct Elements Problem

The distinct elements problem asks the following question:

Given a stream of elements, how many distinct elements appear in the stream?

The challenge is that the stream may be extremely large, so storing every element may be too expensive. Instead, we want a randomized streaming algorithm that uses small space and gives a good approximation.

Idealized Algorithm Using Hashing

Suppose we have a perfectly random hash function:

h(x) : x \rightarrow [0,1]

which maps every distinct element uniformly to a real number in the interval $[0,1]$ .

The algorithm is:

Hash every distinct element.
Maintain the smallest hash value observed.
Output:

\frac{1}{\min h(x)} - 1

Why Does This Work?

Suppose there are $T$ distinct elements.

If we throw $T$ random points uniformly into the interval $[0,1]$ , then:

E[\min h(x)] = \frac{1}{T+1}

For example:

If $T=2$ , then $E[\min h(x)] = \frac{1}{3}$ .
If $T=3$ , then $E[\min h(x)] = \frac{1}{4}$ .

So:

T \approx \frac{1}{\min h(x)} - 1

This motivates the estimator.

Limitation of the Idealized Algorithm

This algorithm is not realistic because it hashes to real numbers in $[0,1]$ , which requires infinite precision. Therefore, we need a version that uses finite bit representations.

HyperLogLog / Flajolet-Martin Intuition

To mimic the idealized algorithm, we instead hash stream elements into integers.

Suppose:

h(x) \in \{0, 1, \dots, n-1\}

For every arriving stream element $x_i$ :

Compute its hash value.
Convert the hash into binary form.
Find its least significant bit position — the first position from the right containing a $1$ .

Positions are counted starting at $0$ from the right.

For example, in the bit string $1011000$ , the least significant $1$ occurs at position $3$ .

Algorithm

The algorithm keeps track of $P$ , where $P$ is the largest least-significant-bit position observed so far.

At the end of the stream, the algorithm outputs:

2^{P+1}

Why Does This Work?

The intuition comes from trailing zeros.

About half of bit strings end in $1$ .
About one fourth end in $10$ .
About one eighth end in $100$ .

In general, the number of elements whose hash has least significant bit at position $P$ is about:

\frac{T}{2^{P+1}}

The largest $P$ we expect to see is when this quantity is roughly $1$ :

\frac{T}{2^{P+1}} \approx 1

Therefore:

T \approx 2^{P+1}

So the algorithm outputs $2^{P+1}$ as its estimate for the number of distinct elements.

Approximation Guarantee of the Basic Algorithm

Let $T$ be the true number of distinct elements in the stream. Let $X_m$ be the largest least significant bit position observed by the algorithm.

The output is $2^{X_m+1}$ .

The first guarantee is that this basic algorithm gives a $32$ -factor approximation with probability at least $\frac{2}{3}$ :

\frac{T}{32} \leq 2^{X_m+1} \leq 32T

Although a factor of $32$ is not very accurate, this algorithm is useful as a building block for stronger algorithms such as HyperLogLog.

Why the Factor 32 Appears

Ideally, we would like the counter to be close to $\log T$ .

The key claim is:

\log T - 5 \leq X_m + 1 \leq \log T + 5

with probability at least $\frac{2}{3}$ .

If this holds, then raising all sides to the power of $2$ gives:

2^{\log T - 5} \leq 2^{X_m+1} \leq 2^{\log T + 5}

Since $2^{\log T} = T$ and $2^5 = 32$ , we get:

\frac{T}{32} \leq 2^{X_m+1} \leq 32T

So proving the algorithm is a $32$ -approximation reduces to proving that $X_m + 1$ is within $5$ of $\log T$ .

Random Variables for the Analysis

For each position $j$ , define $Z_j$ to be the number of distinct elements whose least significant bit occurs exactly at position $j$ .

For example:

$Z_0$ counts elements whose hash ends in $1$ .
$Z_1$ counts elements whose hash ends in $10$ .
$Z_2$ counts elements whose hash ends in $100$ .

For a fixed element:

\Pr(\text{least significant bit is at position } j) = \frac{1}{2^{j+1}}

Therefore:

E[Z_j] = \frac{T}{2^{j+1}}

We also define $Z_{>j}$ to be the number of elements whose least significant bit occurs at a position greater than $j$ — meaning farther left than position $j$ .

Using a geometric series:

E[Z_{>j}] < \frac{T}{2^{j+1}}

The important intuition is that these probabilities decrease geometrically as we move farther left in the bit representation.

Bad Events

The algorithm fails to give a $32$ -approximation if $X_m + 1$ is not within $[\log T - 5,\ \log T + 5]$ .

There are two possible bad events:

Overestimate: the algorithm sees a least significant bit position that is too far left.
Underestimate: the algorithm never sees a least significant bit position far enough left.

Let $J_+ = \lfloor \log T \rfloor + 5$ and $J_- = \lfloor \log T \rfloor - 5$ .

The overestimate event occurs if at least one element has least significant bit position greater than $J_+$ .

The underestimate event occurs if no element has least significant bit position greater than $J_-$ .

Bounding the Overestimate Event Using Markov’s Inequality

We want to bound:

\Pr(Z_{>J_+} \geq 1)

First:

E[Z_{>J_+}] < \frac{T}{2^{J_+ + 1}}

Since $J_+ = \lfloor \log T \rfloor + 5$ :

E[Z_{>J_+}] < \frac{T}{2^{\lfloor \log T \rfloor + 6}}

Using $2^{\lfloor \log T \rfloor} \geq \frac{T}{2}$ :

E[Z_{>J_+}] \leq \frac{1}{32}

Now apply Markov’s inequality:

\Pr(X \geq a) \leq \frac{E[X]}{a}

Using $X = Z_{>J_+}$ and $a = 1$ :

\Pr(Z_{>J_+} \geq 1) \leq \frac{1}{32}

So the probability of a large overestimate is at most $\frac{1}{32}$ .

Bounding the Underestimate Event Using Chebyshev’s Inequality

The underestimate event happens when $Z_{>J_-} = 0$ .

To bound this, we use Chebyshev’s inequality:

\Pr(|X - E[X]| \geq a) \leq \frac{\mathrm{Var}(X)}{a^2}

Write $Z_j$ as a sum of indicator random variables:

Z_j = Y_1 + Y_2 + \cdots + Y_T

where:

Y_i = \begin{cases} 1 & \text{if the } i\text{-th distinct element has least significant bit at position } j \\ 0 & \text{otherwise} \end{cases}

Each $Y_i$ is Bernoulli with $\Pr(Y_i = 1) = \frac{1}{2^{j+1}}$ .

For a Bernoulli random variable with probability $p$ :

E[Y_i] = p \qquad \mathrm{Var}(Y_i) = p(1-p)

Since $p(1-p) \leq p$ :

\mathrm{Var}(Y_i) \leq E[Y_i]

Because the hash values are independent:

\mathrm{Var}(Z_j) = \sum_{i=1}^{T} \mathrm{Var}(Y_i)

Thus:

\mathrm{Var}(Z_j) \leq E[Z_j]

The same reasoning applies to $Z_{>j}$ .

For $J_-$ , we first compute $E[Z_{>J_-}]$ to see what Chebyshev gives us.

Each position $k > J_-$ contributes $E[Z_k] = T/2^{k+1}$ in expectation, so:

E[Z_{>J_-}] = \sum_{k=J_-+1}^{\infty} \frac{T}{2^{k+1}} = \frac{T}{2^{J_-+1}}

Substituting $J_- = \lfloor \log T \rfloor - 5$ :

E[Z_{>J_-}] = \frac{T}{2^{\lfloor \log T \rfloor - 4}} = \frac{T}{2^{\lfloor \log T \rfloor}} \cdot 2^4 \geq 16

The last step uses $2^{\lfloor \log T \rfloor} \leq T$ , which gives $T / 2^{\lfloor \log T \rfloor} \geq 1$ .

Now apply Chebyshev. The underestimate event requires $Z_{>J_-}$ to deviate from its mean by at least $E[Z_{>J_-}]$ :

\Pr(Z_{>J_-} = 0) \leq \Pr\!\left(|Z_{>J_-} - E[Z_{>J_-}]| \geq E[Z_{>J_-}]\right) \leq \frac{\mathrm{Var}(Z_{>J_-})}{(E[Z_{>J_-}])^2} \leq \frac{E[Z_{>J_-}]}{(E[Z_{>J_-}])^2} = \frac{1}{E[Z_{>J_-}]} \leq \frac{1}{16}

In essence:

\Pr(Z_{>J_-} = 0) \leq \frac{1}{16}

So the probability of the underestimate event is at most $\frac{1}{16}$ .

Combining the Bad Events

The algorithm fails only if either the overestimate event or the underestimate event occurs.

Using the union bound:

\Pr(\text{failure}) \leq \Pr(\text{overestimate}) + \Pr(\text{underestimate})

Substituting the bounds:

\Pr(\text{failure}) \leq \frac{1}{32} + \frac{1}{16} = \frac{3}{32}

Since $\frac{3}{32} < \frac{1}{3}$ :

\Pr(\text{success}) \geq \frac{2}{3}

Therefore, the algorithm gives a $32$ -approximation with probability at least $\frac{2}{3}$ .

Improving the Algorithm

The basic estimator is not accurate enough by itself. A factor of $32$ is too large for most applications.

Improving the algorithm happens in two stages: first reduce the failure probability, then shrink the approximation factor from $32$ down to $(1 \pm \varepsilon)$ .

Boosting the Success Probability

To push the success probability from $\frac{2}{3}$ to $1 - \delta$ , run $O(\log \frac{1}{\delta})$ independent copies of the basic estimator in parallel and take the median of their outputs. A Chernoff-bound argument shows the median is correct with probability at least $1 - \delta$ . This is the same boosting technique from earlier in the course and works as a black box — you do not need to know the algorithm’s internals to apply it.

Improving the Approximation Factor

Getting an $\varepsilon$ -approximation takes a different idea. The starting observation is: if the stream has very few distinct elements, we do not need any clever algorithm — we can just store them all directly.

Algorithm for Small Sets (AFSS): Maintain an array of size $\frac{10}{\varepsilon^2}$ . As elements arrive, add each new distinct element to the array. Once the array is full, stop tracking new elements. When queried, output the exact count of distinct elements stored.

AFSS is perfectly accurate whenever the number of distinct elements is at most $\frac{10}{\varepsilon^2}$ , and it uses $O\!\left(\frac{\log n}{\varepsilon^2}\right)$ bits (each element from a universe of size $n$ takes $\log n$ bits).

The Combined Algorithm

The full algorithm runs A1 (the basic 32-approximation algorithm) and $\log n$ copies of AFSS in parallel:

A1 sees the entire stream and tracks the global maximum LSB position.
AFSS[0], AFSS[1], …, AFSS[ $\log n$ ] each see only a filtered portion of the stream.

For each arriving element $x_i$ :

Hash $x_i$ and find its LSB position $j$ .
Feed $x_i$ to A1 (which updates its maximum LSB counter).
Feed $x_i$ to AFSS[ $j$ ] only — not to any other copy.

The filtering is the key idea. About $T/2^{j+1}$ distinct elements hash to LSB position $j$ . The copy that matters most is AFSS[ $P$ ], where $P \approx \log T$ is the position A1 currently reports. At that level the number of distinct elements feeding into AFSS[ $P$ ] is roughly a small constant — well within the array’s capacity.

Query: When asked for an estimate:

Ask A1 for the current maximum LSB position $j$ .
Ask AFSS[ $j$ ] for its distinct element count $C$ .
Output $C \cdot 2^{j+1}$ .

This output makes sense for the same reason A1’s estimator does: $C$ counts the elements at the rarest hashing level, and each such element represents $\approx 2^{j+1}$ distinct elements in expectation. The difference is that AFSS gives an exact count $C$ rather than the rough estimate A1 would produce.

Space Complexity

Each AFSS copy uses $O\!\left(\frac{\log n}{\varepsilon^2}\right)$ bits, and there are $\log n$ copies:

O\!\left(\frac{\log^2 n}{\varepsilon^2}\right)

With probability boosting to $1 - \delta$ , multiply by $\log \frac{1}{\delta}$ . A result by Jelani Nelson and others later showed the space can be reduced to $O\!\left(\frac{1}{\varepsilon^2} + \log n\right)$ , which is also optimal.

Introduction to Online Algorithms

An online algorithm is an algorithm that must make decisions without knowing the future.

This is different from an offline algorithm, which has access to the entire input in advance.

Online vs. Offline Algorithms

An offline algorithm sees the whole input before making decisions.

An online algorithm receives the input piece by piece and must make decisions immediately.

Once an online algorithm makes a decision, it usually cannot go back and change it.

Why Online Algorithms Matter

Online algorithms appear in many real-world settings, including:

renting or buying equipment,
caching web pages,
scheduling jobs,
assigning resources,
accepting or rejecting requests,
making financial decisions without knowing future prices.

Competitive Analysis

Since online algorithms do not know the future, we compare them to an ideal offline algorithm.

The offline algorithm knows the entire input in advance and can make the best possible decisions. This ideal algorithm is called $\text{OPT}$ .

The performance of an online algorithm is measured using the competitive ratio.

An online algorithm is $c$ -competitive if:

\text{Cost of online algorithm} \leq c \cdot \text{Cost of OPT}

for every possible input sequence.

The smaller the competitive ratio, the better the online algorithm.

The Ski Rental Problem

The ski rental problem is a classic example of online algorithms.

Suppose renting skis costs $1$ per day, while buying skis costs $B$ . The skier does not know how many days they will ski.

The question is: when should the skier stop renting and buy skis?

Offline Optimal Solution

If the skier knew in advance that they would ski for $d$ days, the optimal offline algorithm would choose:

\text{OPT} = \min(d, B)

This means:

If $d < B$ , it is cheaper to rent every day.
If $d \geq B$ , it is cheaper to buy.

Online Strategy

A natural online strategy is: rent for the first $B$ days, then buy.

The online algorithm does not know whether skiing will stop early or continue for a long time, so it waits until the rental cost equals the buying cost.

Cost of the Online Algorithm

If skiing stops before day $B$ , the online algorithm only rents, so its cost is $d$ , which matches the optimal offline cost.

If skiing continues for at least $B$ days, the online algorithm rents for $B$ days and then buys:

B + B = 2B

The offline algorithm would have bought immediately, paying $B$ . Thus:

\frac{\text{Online Cost}}{\text{OPT}} = \frac{2B}{B} = 2

So this online strategy is $2$ -competitive.

Pizza Finding / Linear Search Problem

The pizza finding problem is an example of an online search problem.

Imagine standing at a starting point on a long road. There is a pizza shop somewhere on the road, but we do not know whether it is to the left or to the right, and we do not know how far away it is.

The goal is to find the pizza shop while walking as little total distance as possible.

Why This Is an Online Problem

This is an online problem because the algorithm does not know the location of the pizza shop ahead of time. It must decide which direction to search and how far to walk without knowing whether that choice is correct.

An offline algorithm that already knows where the pizza shop is would walk directly to it. If the pizza shop is distance $d$ away, then $\text{OPT} = d$ .

Doubling Strategy

A strong online strategy is the doubling trick. The algorithm searches in alternating directions, doubling the search distance each time:

1,\ 2,\ 4,\ 8,\ 16,\ \dots

For example:

Go distance $1$ to the right, then return.
Go distance $2$ to the left, then return.
Go distance $4$ to the right, then return.
Go distance $8$ to the left, then return.
Continue alternating directions and doubling the distance.

This guarantees that eventually the algorithm searches far enough in the correct direction to find the pizza shop.

Competitive Ratio

Suppose the pizza shop is distance $d$ from the starting point. The offline optimal algorithm pays $\text{OPT} = d$ .

The online algorithm may waste distance searching in the wrong direction and returning to the starting point. However, because the search distance doubles each time, the total wasted distance is bounded by a constant factor of $d$ . Let’s actually work out that constant.

Setting up the worst case

Number the rounds $i = 0, 1, 2, \ldots$ where round $i$ walks distance $2^i$ in its search direction and then returns to the start. Directions alternate, so (say) even-numbered rounds go right and odd-numbered rounds go left.

Assume WLOG the pizza is to the right at distance $d$ . We finally find the pizza in some round $k$ where:

$k$ is even (it’s a right-direction round), and
$2^k \ge d$ (we walk far enough this time to reach the pizza), but
$2^{k-2} < d$ (the previous right-direction round didn’t go far enough)

The adversary picks $d$ to make us as inefficient as possible. The worst $d$ is just barely larger than $2^{k-2}$ — if it were any smaller, we would’ve already found the pizza in round $k - 2$ and not paid for rounds $k-1$ and $k$ .

Computing total cost

The algorithm’s total walking distance breaks into two parts:

Rounds $0$ through $k - 1$ : each round is a complete round-trip of length $2 \cdot 2^i$ (walk out, walk back). Summing the geometric series:
$\sum_{i=0}^{k-1} 2 \cdot 2^i = 2 \cdot \frac{2^k - 1}{2 - 1} = 2^{k+1} - 2$
Round $k$ : we walk distance $d$ in the right direction and stop when we hit the pizza (no return needed). This contributes $d$ .

So the total cost is:

\text{cost}(\text{ALG}) = (2^{k+1} - 2) + d

Taking the ratio

The competitive ratio is:

\frac{\text{cost}(\text{ALG})}{\text{OPT}} = \frac{(2^{k+1} - 2) + d}{d}

Adversary’s worst-case choice: $d \to 2^{k-2}$ (just barely past the previous right-direction walk, so we’re forced through rounds $k-1$ and $k$ ). Substituting:

\frac{\text{cost}(\text{ALG})}{\text{OPT}} \to \frac{2^{k+1} - 2 + 2^{k-2}}{2^{k-2}} = \frac{2^{k+1}}{2^{k-2}} - \frac{2}{2^{k-2}} + 1 = 8 - 2^{3-k} + 1 = 9 - 2^{3-k}

As $k$ grows (i.e., for far-away pizza shops), the $2^{3-k}$ correction vanishes and the ratio approaches:

\boxed{\lim_{k \to \infty} \frac{\text{cost}(\text{ALG})}{\text{OPT}} = 9}

So the doubling strategy is $9$ -competitive:

\text{cost}(\text{ALG}) \leq 9 \cdot \text{OPT}

Main Idea

The important lesson is that even without knowing where the target is, the online algorithm can still stay within a constant factor of the optimal offline solution by expanding its search exponentially.

This is why the doubling trick is useful in online algorithms: it prevents the algorithm from wasting too much time on small searches while still guaranteeing that the target will eventually be found.

Main Takeaways

The distinct elements problem asks us to estimate the number of unique items in a stream.
The idealized minimum-hash algorithm estimates $T$ using the smallest hash value, but it requires infinite precision.
The practical bit-based approach hashes elements to binary strings and tracks the largest least significant bit position.
Seeing many trailing zeros is rare, so it suggests that many distinct elements have appeared.
The basic estimator outputs $2^{P+1}$ .
The basic estimator gives a $32$ -approximation with probability at least $\frac{2}{3}$ .
Markov’s inequality is used to bound the probability of overestimating.
Chebyshev’s inequality is used to bound the probability of underestimating.
HyperLogLog improves the basic estimator by using multiple estimates and combining them.
Online algorithms make decisions without knowing the future.
Competitive analysis compares an online algorithm to the optimal offline algorithm.
The ski rental problem is a classic example, and the rent-then-buy strategy is $2$ -competitive.
The pizza finding problem is an online linear search problem, and the doubling strategy gives a $9$ -competitive algorithm.