Skip to content

Lecture 21-22 (04/27/2026) - Finish Counting Distinct Elements: Flajolet-Martin Factor-32, Epsilon; Online Algorithms

Scribes: Juno Joseph, Raquel Gelbtuch, and Olivia Xu

Topics covered:

  • Review of the Distinct Elements Problem
  • Idealized hashing algorithm using minimum hash values
  • Limitations of infinite precision hashing
  • Flajolet-Martin / HyperLogLog intuition
  • Least Significant Bit position tracking
  • Probabilistic analysis of trailing zeros
  • Approximation guarantees for distinct counting
  • Markov and Chebyshev bounds in algorithm analysis
  • Introduction to Online Algorithms
  • Competitive analysis
  • Ski Rental problem
  • Pizza Finding / Linear Search problem

The distinct elements problem asks the following question:

Given a stream of elements, how many distinct elements appear in the stream?

The challenge is that the stream may be extremely large, so storing every element may be too expensive. Instead, we want a randomized streaming algorithm that uses small space and gives a good approximation.

Suppose we have a perfectly random hash function:

h(x):x[0,1]h(x) : x \rightarrow [0,1]

which maps every distinct element uniformly to a real number in the interval [0,1][0,1].

The algorithm is:

  1. Hash every distinct element.
  2. Maintain the smallest hash value observed.
  3. Output:
1minh(x)1\frac{1}{\min h(x)} - 1

Suppose there are TT distinct elements.

If we throw TT random points uniformly into the interval [0,1][0,1], then:

E[minh(x)]=1T+1E[\min h(x)] = \frac{1}{T+1}

For example:

  • If T=2T=2, then E[minh(x)]=13E[\min h(x)] = \frac{1}{3}.
  • If T=3T=3, then E[minh(x)]=14E[\min h(x)] = \frac{1}{4}.

So:

T1minh(x)1T \approx \frac{1}{\min h(x)} - 1

This motivates the estimator.

This algorithm is not realistic because it hashes to real numbers in [0,1][0,1], which requires infinite precision. Therefore, we need a version that uses finite bit representations.

To mimic the idealized algorithm, we instead hash stream elements into integers.

Suppose:

h(x){0,1,,n1}h(x) \in \{0, 1, \dots, n-1\}

For every arriving stream element xix_i:

  1. Compute its hash value.
  2. Convert the hash into binary form.
  3. Find its least significant bit position — the first position from the right containing a 11.

Positions are counted starting at 00 from the right.

For example, in the bit string 10110001011000, the least significant 11 occurs at position 33.

The algorithm keeps track of PP, where PP is the largest least-significant-bit position observed so far.

At the end of the stream, the algorithm outputs:

2P+12^{P+1}

The intuition comes from trailing zeros.

  • About half of bit strings end in 11.
  • About one fourth end in 1010.
  • About one eighth end in 100100.

In general, the number of elements whose hash has least significant bit at position PP is about:

T2P+1\frac{T}{2^{P+1}}

The largest PP we expect to see is when this quantity is roughly 11:

T2P+11\frac{T}{2^{P+1}} \approx 1

Therefore:

T2P+1T \approx 2^{P+1}

So the algorithm outputs 2P+12^{P+1} as its estimate for the number of distinct elements.

Approximation Guarantee of the Basic Algorithm

Section titled “Approximation Guarantee of the Basic Algorithm”

Let TT be the true number of distinct elements in the stream. Let XmX_m be the largest least significant bit position observed by the algorithm.

The output is 2Xm+12^{X_m+1}.

The first guarantee is that this basic algorithm gives a 3232-factor approximation with probability at least 23\frac{2}{3}:

T322Xm+132T\frac{T}{32} \leq 2^{X_m+1} \leq 32T

Although a factor of 3232 is not very accurate, this algorithm is useful as a building block for stronger algorithms such as HyperLogLog.

Ideally, we would like the counter to be close to logT\log T.

The key claim is:

logT5Xm+1logT+5\log T - 5 \leq X_m + 1 \leq \log T + 5

with probability at least 23\frac{2}{3}.

If this holds, then raising all sides to the power of 22 gives:

2logT52Xm+12logT+52^{\log T - 5} \leq 2^{X_m+1} \leq 2^{\log T + 5}

Since 2logT=T2^{\log T} = T and 25=322^5 = 32, we get:

T322Xm+132T\frac{T}{32} \leq 2^{X_m+1} \leq 32T

So proving the algorithm is a 3232-approximation reduces to proving that Xm+1X_m + 1 is within 55 of logT\log T.

For each position jj, define ZjZ_j to be the number of distinct elements whose least significant bit occurs exactly at position jj.

For example:

  • Z0Z_0 counts elements whose hash ends in 11.
  • Z1Z_1 counts elements whose hash ends in 1010.
  • Z2Z_2 counts elements whose hash ends in 100100.

For a fixed element:

Pr(least significant bit is at position j)=12j+1\Pr(\text{least significant bit is at position } j) = \frac{1}{2^{j+1}}

Therefore:

E[Zj]=T2j+1E[Z_j] = \frac{T}{2^{j+1}}

We also define Z>jZ_{>j} to be the number of elements whose least significant bit occurs at a position greater than jj — meaning farther left than position jj.

Using a geometric series:

E[Z>j]<T2j+1E[Z_{>j}] < \frac{T}{2^{j+1}}

The important intuition is that these probabilities decrease geometrically as we move farther left in the bit representation.

The algorithm fails to give a 3232-approximation if Xm+1X_m + 1 is not within [logT5, logT+5][\log T - 5,\ \log T + 5].

There are two possible bad events:

  1. Overestimate: the algorithm sees a least significant bit position that is too far left.
  2. Underestimate: the algorithm never sees a least significant bit position far enough left.

Let J+=logT+5J_+ = \lfloor \log T \rfloor + 5 and J=logT5J_- = \lfloor \log T \rfloor - 5.

The overestimate event occurs if at least one element has least significant bit position greater than J+J_+.

The underestimate event occurs if no element has least significant bit position greater than JJ_-.

Bounding the Overestimate Event Using Markov’s Inequality

Section titled “Bounding the Overestimate Event Using Markov’s Inequality”

We want to bound:

Pr(Z>J+1)\Pr(Z_{>J_+} \geq 1)

First:

E[Z>J+]<T2J++1E[Z_{>J_+}] < \frac{T}{2^{J_+ + 1}}

Since J+=logT+5J_+ = \lfloor \log T \rfloor + 5:

E[Z>J+]<T2logT+6E[Z_{>J_+}] < \frac{T}{2^{\lfloor \log T \rfloor + 6}}

Using 2logTT22^{\lfloor \log T \rfloor} \geq \frac{T}{2}:

E[Z>J+]132E[Z_{>J_+}] \leq \frac{1}{32}

Now apply Markov’s inequality:

Pr(Xa)E[X]a\Pr(X \geq a) \leq \frac{E[X]}{a}

Using X=Z>J+X = Z_{>J_+} and a=1a = 1:

Pr(Z>J+1)132\Pr(Z_{>J_+} \geq 1) \leq \frac{1}{32}

So the probability of a large overestimate is at most 132\frac{1}{32}.

Bounding the Underestimate Event Using Chebyshev’s Inequality

Section titled “Bounding the Underestimate Event Using Chebyshev’s Inequality”

The underestimate event happens when Z>J=0Z_{>J_-} = 0.

To bound this, we use Chebyshev’s inequality:

Pr(XE[X]a)Var(X)a2\Pr(|X - E[X]| \geq a) \leq \frac{\mathrm{Var}(X)}{a^2}

Write ZjZ_j as a sum of indicator random variables:

Zj=Y1+Y2++YTZ_j = Y_1 + Y_2 + \cdots + Y_T

where:

Yi={1if the i-th distinct element has least significant bit at position j0otherwiseY_i = \begin{cases} 1 & \text{if the } i\text{-th distinct element has least significant bit at position } j \\ 0 & \text{otherwise} \end{cases}

Each YiY_i is Bernoulli with Pr(Yi=1)=12j+1\Pr(Y_i = 1) = \frac{1}{2^{j+1}}.

For a Bernoulli random variable with probability pp:

E[Yi]=pVar(Yi)=p(1p)E[Y_i] = p \qquad \mathrm{Var}(Y_i) = p(1-p)

Since p(1p)pp(1-p) \leq p:

Var(Yi)E[Yi]\mathrm{Var}(Y_i) \leq E[Y_i]

Because the hash values are independent:

Var(Zj)=i=1TVar(Yi)\mathrm{Var}(Z_j) = \sum_{i=1}^{T} \mathrm{Var}(Y_i)

Thus:

Var(Zj)E[Zj]\mathrm{Var}(Z_j) \leq E[Z_j]

The same reasoning applies to Z>jZ_{>j}.

For JJ_-, we first compute E[Z>J]E[Z_{>J_-}] to see what Chebyshev gives us.

Each position k>Jk > J_- contributes E[Zk]=T/2k+1E[Z_k] = T/2^{k+1} in expectation, so:

E[Z>J]=k=J+1T2k+1=T2J+1E[Z_{>J_-}] = \sum_{k=J_-+1}^{\infty} \frac{T}{2^{k+1}} = \frac{T}{2^{J_-+1}}

Substituting J=logT5J_- = \lfloor \log T \rfloor - 5:

E[Z>J]=T2logT4=T2logT2416E[Z_{>J_-}] = \frac{T}{2^{\lfloor \log T \rfloor - 4}} = \frac{T}{2^{\lfloor \log T \rfloor}} \cdot 2^4 \geq 16

The last step uses 2logTT2^{\lfloor \log T \rfloor} \leq T, which gives T/2logT1T / 2^{\lfloor \log T \rfloor} \geq 1.

Now apply Chebyshev. The underestimate event requires Z>JZ_{>J_-} to deviate from its mean by at least E[Z>J]E[Z_{>J_-}]:

Pr(Z>J=0)Pr ⁣(Z>JE[Z>J]E[Z>J])Var(Z>J)(E[Z>J])2E[Z>J](E[Z>J])2=1E[Z>J]116\Pr(Z_{>J_-} = 0) \leq \Pr\!\left(|Z_{>J_-} - E[Z_{>J_-}]| \geq E[Z_{>J_-}]\right) \leq \frac{\mathrm{Var}(Z_{>J_-})}{(E[Z_{>J_-}])^2} \leq \frac{E[Z_{>J_-}]}{(E[Z_{>J_-}])^2} = \frac{1}{E[Z_{>J_-}]} \leq \frac{1}{16}

In essence:

Pr(Z>J=0)116\Pr(Z_{>J_-} = 0) \leq \frac{1}{16}

So the probability of the underestimate event is at most 116\frac{1}{16}.

The algorithm fails only if either the overestimate event or the underestimate event occurs.

Using the union bound:

Pr(failure)Pr(overestimate)+Pr(underestimate)\Pr(\text{failure}) \leq \Pr(\text{overestimate}) + \Pr(\text{underestimate})

Substituting the bounds:

Pr(failure)132+116=332\Pr(\text{failure}) \leq \frac{1}{32} + \frac{1}{16} = \frac{3}{32}

Since 332<13\frac{3}{32} < \frac{1}{3}:

Pr(success)23\Pr(\text{success}) \geq \frac{2}{3}

Therefore, the algorithm gives a 3232-approximation with probability at least 23\frac{2}{3}.

The basic estimator is not accurate enough by itself. A factor of 3232 is too large for most applications.

Improving the algorithm happens in two stages: first reduce the failure probability, then shrink the approximation factor from 3232 down to (1±ε)(1 \pm \varepsilon).

To push the success probability from 23\frac{2}{3} to 1δ1 - \delta, run O(log1δ)O(\log \frac{1}{\delta}) independent copies of the basic estimator in parallel and take the median of their outputs. A Chernoff-bound argument shows the median is correct with probability at least 1δ1 - \delta. This is the same boosting technique from earlier in the course and works as a black box — you do not need to know the algorithm’s internals to apply it.

Getting an ε\varepsilon-approximation takes a different idea. The starting observation is: if the stream has very few distinct elements, we do not need any clever algorithm — we can just store them all directly.

Algorithm for Small Sets (AFSS): Maintain an array of size 10ε2\frac{10}{\varepsilon^2}. As elements arrive, add each new distinct element to the array. Once the array is full, stop tracking new elements. When queried, output the exact count of distinct elements stored.

AFSS is perfectly accurate whenever the number of distinct elements is at most 10ε2\frac{10}{\varepsilon^2}, and it uses O ⁣(lognε2)O\!\left(\frac{\log n}{\varepsilon^2}\right) bits (each element from a universe of size nn takes logn\log n bits).

The full algorithm runs A1 (the basic 32-approximation algorithm) and logn\log n copies of AFSS in parallel:

  • A1 sees the entire stream and tracks the global maximum LSB position.
  • AFSS[0], AFSS[1], …, AFSS[logn\log n] each see only a filtered portion of the stream.

For each arriving element xix_i:

  • Hash xix_i and find its LSB position jj.
  • Feed xix_i to A1 (which updates its maximum LSB counter).
  • Feed xix_i to AFSS[jj] only — not to any other copy.

The filtering is the key idea. About T/2j+1T/2^{j+1} distinct elements hash to LSB position jj. The copy that matters most is AFSS[PP], where PlogTP \approx \log T is the position A1 currently reports. At that level the number of distinct elements feeding into AFSS[PP] is roughly a small constant — well within the array’s capacity.

Query: When asked for an estimate:

  • Ask A1 for the current maximum LSB position jj.
  • Ask AFSS[jj] for its distinct element count CC.
  • Output C2j+1C \cdot 2^{j+1}.

This output makes sense for the same reason A1’s estimator does: CC counts the elements at the rarest hashing level, and each such element represents 2j+1\approx 2^{j+1} distinct elements in expectation. The difference is that AFSS gives an exact count CC rather than the rough estimate A1 would produce.

Each AFSS copy uses O ⁣(lognε2)O\!\left(\frac{\log n}{\varepsilon^2}\right) bits, and there are logn\log n copies:

O ⁣(log2nε2)O\!\left(\frac{\log^2 n}{\varepsilon^2}\right)

With probability boosting to 1δ1 - \delta, multiply by log1δ\log \frac{1}{\delta}. A result by Jelani Nelson and others later showed the space can be reduced to O ⁣(1ε2+logn)O\!\left(\frac{1}{\varepsilon^2} + \log n\right), which is also optimal.

An online algorithm is an algorithm that must make decisions without knowing the future.

This is different from an offline algorithm, which has access to the entire input in advance.

An offline algorithm sees the whole input before making decisions.

An online algorithm receives the input piece by piece and must make decisions immediately.

Once an online algorithm makes a decision, it usually cannot go back and change it.

Online algorithms appear in many real-world settings, including:

  • renting or buying equipment,
  • caching web pages,
  • scheduling jobs,
  • assigning resources,
  • accepting or rejecting requests,
  • making financial decisions without knowing future prices.

Since online algorithms do not know the future, we compare them to an ideal offline algorithm.

The offline algorithm knows the entire input in advance and can make the best possible decisions. This ideal algorithm is called OPT\text{OPT}.

The performance of an online algorithm is measured using the competitive ratio.

An online algorithm is cc-competitive if:

Cost of online algorithmcCost of OPT\text{Cost of online algorithm} \leq c \cdot \text{Cost of OPT}

for every possible input sequence.

The smaller the competitive ratio, the better the online algorithm.

The ski rental problem is a classic example of online algorithms.

Suppose renting skis costs 11 per day, while buying skis costs BB. The skier does not know how many days they will ski.

The question is: when should the skier stop renting and buy skis?

If the skier knew in advance that they would ski for dd days, the optimal offline algorithm would choose:

OPT=min(d,B)\text{OPT} = \min(d, B)

This means:

  • If d<Bd < B, it is cheaper to rent every day.
  • If dBd \geq B, it is cheaper to buy.

A natural online strategy is: rent for the first BB days, then buy.

The online algorithm does not know whether skiing will stop early or continue for a long time, so it waits until the rental cost equals the buying cost.

If skiing stops before day BB, the online algorithm only rents, so its cost is dd, which matches the optimal offline cost.

If skiing continues for at least BB days, the online algorithm rents for BB days and then buys:

B+B=2BB + B = 2B

The offline algorithm would have bought immediately, paying BB. Thus:

Online CostOPT=2BB=2\frac{\text{Online Cost}}{\text{OPT}} = \frac{2B}{B} = 2

So this online strategy is 22-competitive.

The pizza finding problem is an example of an online search problem.

Imagine standing at a starting point on a long road. There is a pizza shop somewhere on the road, but we do not know whether it is to the left or to the right, and we do not know how far away it is.

The goal is to find the pizza shop while walking as little total distance as possible.

This is an online problem because the algorithm does not know the location of the pizza shop ahead of time. It must decide which direction to search and how far to walk without knowing whether that choice is correct.

An offline algorithm that already knows where the pizza shop is would walk directly to it. If the pizza shop is distance dd away, then OPT=d\text{OPT} = d.

A strong online strategy is the doubling trick. The algorithm searches in alternating directions, doubling the search distance each time:

1, 2, 4, 8, 16, 1,\ 2,\ 4,\ 8,\ 16,\ \dots

For example:

  1. Go distance 11 to the right, then return.
  2. Go distance 22 to the left, then return.
  3. Go distance 44 to the right, then return.
  4. Go distance 88 to the left, then return.
  5. Continue alternating directions and doubling the distance.

This guarantees that eventually the algorithm searches far enough in the correct direction to find the pizza shop.

Suppose the pizza shop is distance dd from the starting point. The offline optimal algorithm pays OPT=d\text{OPT} = d.

The online algorithm may waste distance searching in the wrong direction and returning to the starting point. However, because the search distance doubles each time, the total wasted distance is bounded by a constant factor of dd. Let’s actually work out that constant.

Number the rounds i=0,1,2,i = 0, 1, 2, \ldots where round ii walks distance 2i2^i in its search direction and then returns to the start. Directions alternate, so (say) even-numbered rounds go right and odd-numbered rounds go left.

Assume WLOG the pizza is to the right at distance dd. We finally find the pizza in some round kk where:

  • kk is even (it’s a right-direction round), and
  • 2kd2^k \ge d (we walk far enough this time to reach the pizza), but
  • 2k2<d2^{k-2} < d (the previous right-direction round didn’t go far enough)

The adversary picks dd to make us as inefficient as possible. The worst dd is just barely larger than 2k22^{k-2} — if it were any smaller, we would’ve already found the pizza in round k2k - 2 and not paid for rounds k1k-1 and kk.

The algorithm’s total walking distance breaks into two parts:

  1. Rounds 00 through k1k - 1: each round is a complete round-trip of length 22i2 \cdot 2^i (walk out, walk back). Summing the geometric series:

    i=0k122i=22k121=2k+12\sum_{i=0}^{k-1} 2 \cdot 2^i = 2 \cdot \frac{2^k - 1}{2 - 1} = 2^{k+1} - 2
  2. Round kk: we walk distance dd in the right direction and stop when we hit the pizza (no return needed). This contributes dd.

So the total cost is:

cost(ALG)=(2k+12)+d\text{cost}(\text{ALG}) = (2^{k+1} - 2) + d

The competitive ratio is:

cost(ALG)OPT=(2k+12)+dd\frac{\text{cost}(\text{ALG})}{\text{OPT}} = \frac{(2^{k+1} - 2) + d}{d}

Adversary’s worst-case choice: d2k2d \to 2^{k-2} (just barely past the previous right-direction walk, so we’re forced through rounds k1k-1 and kk). Substituting:

cost(ALG)OPT2k+12+2k22k2=2k+12k222k2+1=823k+1=923k\frac{\text{cost}(\text{ALG})}{\text{OPT}} \to \frac{2^{k+1} - 2 + 2^{k-2}}{2^{k-2}} = \frac{2^{k+1}}{2^{k-2}} - \frac{2}{2^{k-2}} + 1 = 8 - 2^{3-k} + 1 = 9 - 2^{3-k}

As kk grows (i.e., for far-away pizza shops), the 23k2^{3-k} correction vanishes and the ratio approaches:

limkcost(ALG)OPT=9\boxed{\lim_{k \to \infty} \frac{\text{cost}(\text{ALG})}{\text{OPT}} = 9}

So the doubling strategy is 99-competitive:

cost(ALG)9OPT\text{cost}(\text{ALG}) \leq 9 \cdot \text{OPT}

The important lesson is that even without knowing where the target is, the online algorithm can still stay within a constant factor of the optimal offline solution by expanding its search exponentially.

This is why the doubling trick is useful in online algorithms: it prevents the algorithm from wasting too much time on small searches while still guaranteeing that the target will eventually be found.

  • The distinct elements problem asks us to estimate the number of unique items in a stream.
  • The idealized minimum-hash algorithm estimates TT using the smallest hash value, but it requires infinite precision.
  • The practical bit-based approach hashes elements to binary strings and tracks the largest least significant bit position.
  • Seeing many trailing zeros is rare, so it suggests that many distinct elements have appeared.
  • The basic estimator outputs 2P+12^{P+1}.
  • The basic estimator gives a 3232-approximation with probability at least 23\frac{2}{3}.
  • Markov’s inequality is used to bound the probability of overestimating.
  • Chebyshev’s inequality is used to bound the probability of underestimating.
  • HyperLogLog improves the basic estimator by using multiple estimates and combining them.
  • Online algorithms make decisions without knowing the future.
  • Competitive analysis compares an online algorithm to the optimal offline algorithm.
  • The ski rental problem is a classic example, and the rent-then-buy strategy is 22-competitive.
  • The pizza finding problem is an online linear search problem, and the doubling strategy gives a 99-competitive algorithm.