Lecture 19 (04/15/2026) - AMS Sampling: Guarantee Boosting; Counting Distinct Elements; Uniform RV | CSCI 328

Scribes: Vinesh Seepersaud, Anastasiia Tcyrenzhapova

Summary

Frequency moment estimation in streams
AMS sampling for estimating $F_2$
Boosting expectation to an $(\varepsilon, \delta)$ guarantee
Using a non-Bernoulli Chernoff bound
Proof of a key lemma used to bound the number of copies

Frequency Moment Estimation

Let the stream be

S = \{x_1, x_2, \dots, x_m\}, \qquad x_i \in \{1, 2, \dots, n\}.

For each item $i$ , let $f(i)$ be its frequency. The goal is to estimate

F_2 = \sum_{i=1}^{n} f(i)^2.

More generally, one may estimate higher moments such as $\sum_i f(i)^k$ .

AMS Sampling

The AMS (Alon-Matias-Szegedy) estimator works as follows:

Sample a position $i$ uniformly at random from the stream.
Let the sampled item be $x_i$ .
Let $r$ be the number of occurrences of $x_i$ at or after position $i$ .
Output $X = m\big(r^2 - (r-1)^2\big) = m(2r-1).$

For the $k$ -th moment, the estimator becomes

X = m\big(r^k - (r-1)^k\big).

A key fact is $\mathbb{E}[X] = F_2$ , so one run of AMS gives an unbiased estimator for the second frequency moment.

To see why, think about what happens when we sample position $i$ and it lands on item $a$ . Say this is the $j$ -th occurrence of $a$ in the stream ( $j = 1$ being the first). Then there are $r = f(a) - j + 1$ occurrences of $a$ from position $i$ onward, so $X = m(2r - 1)$ . Since each position is equally likely to be sampled (with probability $\frac{1}{m}$ ), the contribution to $\mathbb{E}[X]$ from item $a$ is:

\frac{1}{m} \sum_{j=1}^{f(a)} m \bigl(2(f(a) - j + 1) - 1\bigr) = \sum_{r=1}^{f(a)} (2r - 1) = f(a)^2.

The last equality is the fact that the sum of the first $k$ odd numbers equals $k^2$ : $\sum_{r=1}^{k}(2r-1) = 2\cdot\frac{k(k+1)}{2} - k = k^2$ . Summing the contributions over all distinct items gives $\mathbb{E}[X] = \sum_a f(a)^2 = F_2$ .

Boosting to an $(\varepsilon, \delta)$ Guarantee

The $(\varepsilon, \delta)$ Guarantee

We want an output $Y$ such that

\Pr\big[|Y - F_2| \geq \varepsilon F_2\big] \leq \delta.

This means that with probability at least $1 - \delta$ , the estimate has relative error at most $\varepsilon$ .

Averaging Independent Copies

Run the AMS estimator independently $t$ times, producing $X_1, X_2, \dots, X_t$ . Define

Y = \frac{X_1 + X_2 + \cdots + X_t}{t}.

By linearity of expectation, $\mathbb{E}[Y] = F_2$ .

Chernoff Bound for Non-Bernoulli Variables

Because each $X_i$ is not Bernoulli, we use a generalized Chernoff bound. If $X_1, \dots, X_t$ are i.i.d. random variables in $[0, C]$ and $Y = \frac{1}{t} \sum_{i=1}^{t} X_i$ , then

\Pr\big[|Y - \mathbb{E}[Y]| \geq \varepsilon \mathbb{E}[Y]\big] \leq 2\exp\left(-\frac{\varepsilon^2 \mathbb{E}[Y] \, t}{3C}\right).

Bounding the Output Range

Since $X = m(2r - 1)$ and $f^*= \max_{1 \leq i \leq n} f(i)$ , we have $r \leq f^*$ , hence $X \leq 2mf^*$ . So we may take $C = 2mf^*$ .

Substituting $\mathbb{E}[Y] = F_2$ and $C = 2mf^*$ gives

\Pr\big[|Y - F_2| \geq \varepsilon F_2\big] \leq 2\exp\left(-\frac{\varepsilon^2 F_2 \, t}{6mf^*}\right).

To make this at most $\delta$ , it is enough that

t \geq \frac{6mf^*}{\varepsilon^2 F_2} \ln\left(\frac{2}{\delta}\right).

The issue is that both $f^*$ (the max frequency) and $F_2$ (the quantity we’re trying to estimate in the first place) are unknown while the stream is being processed. So even though we know the right formula for $t$ , we can’t compute it yet.

Key Lemma

To get a computable bound, we replace the unknown ratio $\frac{mf^*}{F_2}$ with an upper bound involving only things we know ahead of time. The key fact is:

\frac{mf^*}{F_2} \leq \sqrt{n}.

Since $n$ — the universe size — is known before the stream begins, this gives us a concrete value to use. Substituting $\sqrt{n}$ for $\frac{mf^*}{F_2}$ may overestimate the true minimum number of copies needed, but that is fine: we just run a few more copies than strictly necessary, and the guarantee still holds.

Using this, it suffices to choose $t \geq \frac{6\sqrt{n}}{\varepsilon^2} \ln\!\left(\frac{2}{\delta}\right)$ . Therefore, running

O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \log \frac{1}{\delta}\right)

independent copies of AMS and averaging them yields an $(\varepsilon, \delta)$ -approximation for $F_2$ .

Proof of the Key Lemma

We prove that $\dfrac{mf^*}{F_2} \leq \sqrt{n}$ .

First note that

F_2 = \sum_{i=1}^{n} f(i)^2, \qquad \sum_{i=1}^{n} f(i) = m.

We need two lower bounds on $F_2$ . The first comes from asking: among all ways to distribute $m$ total frequency across $n$ items, which minimizes $\sum f(i)^2$ ? The answer is to spread frequency evenly — give each item $m/n$ . This is the same principle as: for fixed sum, the sum of squares is minimized when all values are equal. The minimum is then $n \cdot (m/n)^2 = m^2/n$ . So:

F_2 \geq \frac{m^2}{n}.

The second bound is simpler: one of the terms in $F_2$ is $(f^*)^2$ , so trivially $F_2 \geq (f^*)^2$ .

Now consider two cases, depending on whether $f^*$ is “small” or “large” relative to $m/\sqrt{n}$ . Why that threshold? It’s where both bounds on $F_2$ become equal: if $f^* = m/\sqrt{n}$ , then $(f^*)^2 = m^2/n$ , so the two lower bounds coincide. Below that threshold the first bound ( $F_2 \geq m^2/n$ ) does the work; above it the second bound ( $F_2 \geq (f^*)^2$ ) does the work.

Case 1. Suppose $f^*$ is small:

(f^*)^2 \leq \frac{m^2}{n} \implies f^* \leq \frac{m}{\sqrt{n}}.

Multiplying both sides by $\frac{m}{F_2} > 0$ :

\frac{mf^*}{F_2} \leq \frac{m^2}{\sqrt{n}\, F_2}.

Now substitute the lower bound $F_2 \geq m^2/n$ into the denominator (a larger denominator makes the fraction smaller, so replacing $F_2$ with the smaller value $m^2/n$ only increases the right-hand side — keeping the inequality valid):

\frac{mf^*}{F_2} \leq \frac{m^2}{\sqrt{n}(m^2/n)} = \frac{m^2 \cdot n}{\sqrt{n} \cdot m^2} = \sqrt{n}.

Case 2. Suppose $f^*$ is large:

(f^*)^2 > \frac{m^2}{n} \implies f^* > \frac{m}{\sqrt{n}}.

Since $F_2 \geq (f^*)^2$ , we have $\frac{1}{F_2} \leq \frac{1}{(f^*)^2}$ . Multiplying both sides by $mf^*$ gives

\frac{mf^*}{F_2} \leq \frac{m}{f^*}.

Substituting the lower bound $f^* > m/\sqrt{n}$ into the denominator:

\frac{m}{f^*} \leq \sqrt{n},

and so $\dfrac{mf^*}{F_2} \leq \sqrt{n}$ .

Conclusion

This lecture followed the standard boosting pattern for streaming algorithms:

Design an unbiased estimator. Find a one-shot randomized algorithm that returns the right answer in expectation, but may be far off on any individual run.
Repeat independently. Run $t$ copies in parallel and average their outputs.
Determine how many copies suffice. Apply a concentration bound (here, the non-Bernoulli Chernoff bound) to find the minimum $t$ that achieves the $(\varepsilon, \delta)$ guarantee.

For AMS sampling, a subtlety arose in step 3: the minimum $t$ involved unknown quantities ( $f^*$ and $F_2$ ). The Key Lemma resolved this by bounding $mf^*/F_2 \leq \sqrt{n}$ , replacing unknown values with the known universe size $n$ .

Running $O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \ln\!\left(\frac{1}{\delta}\right)\right)$ copies of AMS sampling and averaging gives an output $Y$ satisfying

\Pr\!\left(\left|Y - \sum f^2(i)\right| \geq \varepsilon \sum f^2(i)\right) \leq \delta.

This was for $\sum f^2(i)$ . What about $\sum f^k(i)$ ?

Example: Running $O\!\left(\frac{n^{1 - 1/k}}{\varepsilon^2} \ln\!\left(\frac{1}{\delta}\right)\right)$ copies suffices for estimating $\sum f^k(i)$ . Note that when $k=2$ this becomes $n^{1-1/2} = \sqrt{n}$ , matching the result above. For $k > 2$ , the exponent $1 - 1/k$ grows toward 1, meaning more copies are needed for higher moments.

Counting Distinct Elements

Context: You’re working for Amazon, you have a stream of purchases made by people, and at any moment you want to answer how many distinct products the company has sold today.

Given a stream $\{x_1, \dots, x_n\}$ with $x_i \in \{1, \dots, n\}$ , output $F_0$ - how many distinct elements have appeared in the stream so far.

For example, for this stream $S = \{5, 8, 2, 2, 5, 8, 8, 5, 3, 3\}$ , the output should be 4 because we have 4 different distinct items in the stream: $F_0 = 4$ .

The naive approach: start with an empty set, and whenever you see the next item in the stream, check whether it’s in the set. If it is not, add it. In the end, the set will have only distinct elements and require space that is at least $F_0$ . So given the stream $S = \{x_1, \dots, x_n\}$ where $x_i \in \{1, \dots, n\}$ , we’ll need at least $F_0 \log n$ bits.

The Flajolet-Martin algorithm addresses this problem:

First: “Idealized algorithm”
Next: Practical algorithm

Ideal Algorithm

You have a hash function that takes a number and gives you a uniformly random number between 0 and 1:

h : \{1, \dots, n\} \to [0, 1]

where $n$ is the upper bound on the number of distinct elements and $[0, 1]$ is the continuous interval. Hash each $x_i$ to a value $0 < h(x_i) < 1$ . We only maintain the smallest hash seen so far.

Output:

\frac{1}{\min h(x_i)} - 1

We only maintain the $\min h(x_i)$ — the smallest hash value seen so far. The intuition behind this output formula: if there are $d$ distinct elements, each gets an independent uniform hash in $[0,1]$ , so the minimum of $d$ such values has expected value $\frac{1}{d+1}$ (derived in the Uniform Random Variable section below). Inverting gives approximately $d+1$ , and subtracting 1 gives $d = F_0$ . So $\frac{1}{\min h} - 1$ is an unbiased estimator for $F_0$ .

Uniform Random Variable

Continuous random variable $X \in [0, 1]$ .

Density:

f(x) = 1 \quad \forall\, x \in [0, 1].

Cumulative distribution function:

F(x) = \Pr(X \leq x) = \begin{cases} 0 & x \leq 0 \quad \text{(the chance that r.v. is smaller than 0 is 0)} \\ x & 0 \leq x \leq 1 \quad \text{(}\Pr(X \le x) = \frac{x}{1} = x\text{)} \\ 1 & x > 1 \quad \text{(if } x > 1\text{, then } X \le x \text{ always holds because } X \le 1\text{)} \end{cases}

For a uniform r.v. we do not talk about probability of a single point but only about probability of a small segment (a single point always has probability of zero).

The expectation of a continuous r.v. is defined via its density:

\mathbb{E}[X] = \int_{0}^{1} x f(x) \, dx = \int_{0}^{1} x \, dx = \left.\frac{x^2}{2}\right|_{0}^{1} = \frac{1}{2}

As we close this lecture, we are posed the question: Suppose you throw two darts, $X_1$ and $X_2$ .

What is the expected position of the smaller of the two darts? $\mathbb{E}[\min(X_1, X_2)] = 1/3$

What is the expected position of the bigger of the two darts? $\mathbb{E}[\max(X_1, X_2)] = 2/3$

We’ll explore this further in the next lecture.