Skip to content

Lecture 19 (04/15/2026) - AMS Sampling: Guarantee Boosting; Counting Distinct Elements; Uniform RV

Scribes: Vinesh Seepersaud, Anastasiia Tcyrenzhapova

  • Frequency moment estimation in streams
  • AMS sampling for estimating F2F_2
  • Boosting expectation to an (ε,δ)(\varepsilon, \delta) guarantee
  • Using a non-Bernoulli Chernoff bound
  • Proof of a key lemma used to bound the number of copies

Let the stream be

S={x1,x2,,xm},xi{1,2,,n}.S = \{x_1, x_2, \dots, x_m\}, \qquad x_i \in \{1, 2, \dots, n\}.

For each item ii, let f(i)f(i) be its frequency. The goal is to estimate

F2=i=1nf(i)2.F_2 = \sum_{i=1}^{n} f(i)^2.

More generally, one may estimate higher moments such as if(i)k\sum_i f(i)^k.

The AMS (Alon-Matias-Szegedy) estimator works as follows:

  • Sample a position ii uniformly at random from the stream.
  • Let the sampled item be xix_i.
  • Let rr be the number of occurrences of xix_i at or after position ii.
  • Output X=m(r2(r1)2)=m(2r1).X = m\big(r^2 - (r-1)^2\big) = m(2r-1).

For the kk-th moment, the estimator becomes

X=m(rk(r1)k).X = m\big(r^k - (r-1)^k\big).

A key fact is E[X]=F2\mathbb{E}[X] = F_2, so one run of AMS gives an unbiased estimator for the second frequency moment.

To see why, think about what happens when we sample position ii and it lands on item aa. Say this is the jj-th occurrence of aa in the stream (j=1j = 1 being the first). Then there are r=f(a)j+1r = f(a) - j + 1 occurrences of aa from position ii onward, so X=m(2r1)X = m(2r - 1). Since each position is equally likely to be sampled (with probability 1m\frac{1}{m}), the contribution to E[X]\mathbb{E}[X] from item aa is:

1mj=1f(a)m(2(f(a)j+1)1)=r=1f(a)(2r1)=f(a)2.\frac{1}{m} \sum_{j=1}^{f(a)} m \bigl(2(f(a) - j + 1) - 1\bigr) = \sum_{r=1}^{f(a)} (2r - 1) = f(a)^2.

The last equality is the fact that the sum of the first kk odd numbers equals k2k^2: r=1k(2r1)=2k(k+1)2k=k2\sum_{r=1}^{k}(2r-1) = 2\cdot\frac{k(k+1)}{2} - k = k^2. Summing the contributions over all distinct items gives E[X]=af(a)2=F2\mathbb{E}[X] = \sum_a f(a)^2 = F_2.

Boosting to an (ε,δ)(\varepsilon, \delta) Guarantee

Section titled “Boosting to an (ε,δ)(\varepsilon, \delta)(ε,δ) Guarantee”

The (ε,δ)(\varepsilon, \delta) Guarantee

Section titled “The (ε,δ)(\varepsilon, \delta)(ε,δ) Guarantee”

We want an output YY such that

Pr[YF2εF2]δ.\Pr\big[|Y - F_2| \geq \varepsilon F_2\big] \leq \delta.

This means that with probability at least 1δ1 - \delta, the estimate has relative error at most ε\varepsilon.

Run the AMS estimator independently tt times, producing X1,X2,,XtX_1, X_2, \dots, X_t. Define

Y=X1+X2++Xtt.Y = \frac{X_1 + X_2 + \cdots + X_t}{t}.

By linearity of expectation, E[Y]=F2\mathbb{E}[Y] = F_2.

Chernoff Bound for Non-Bernoulli Variables

Section titled “Chernoff Bound for Non-Bernoulli Variables”

Because each XiX_i is not Bernoulli, we use a generalized Chernoff bound. If X1,,XtX_1, \dots, X_t are i.i.d. random variables in [0,C][0, C] and Y=1ti=1tXiY = \frac{1}{t} \sum_{i=1}^{t} X_i, then

Pr[YE[Y]εE[Y]]2exp(ε2E[Y]t3C).\Pr\big[|Y - \mathbb{E}[Y]| \geq \varepsilon \mathbb{E}[Y]\big] \leq 2\exp\left(-\frac{\varepsilon^2 \mathbb{E}[Y] \, t}{3C}\right).

Since X=m(2r1)X = m(2r - 1) and f=max1inf(i)f^*= \max_{1 \leq i \leq n} f(i), we have rfr \leq f^*, hence X2mfX \leq 2mf^*. So we may take C=2mfC = 2mf^*.

Substituting E[Y]=F2\mathbb{E}[Y] = F_2 and C=2mfC = 2mf^* gives

Pr[YF2εF2]2exp(ε2F2t6mf).\Pr\big[|Y - F_2| \geq \varepsilon F_2\big] \leq 2\exp\left(-\frac{\varepsilon^2 F_2 \, t}{6mf^*}\right).

To make this at most δ\delta, it is enough that

t6mfε2F2ln(2δ).t \geq \frac{6mf^*}{\varepsilon^2 F_2} \ln\left(\frac{2}{\delta}\right).

The issue is that both ff^* (the max frequency) and F2F_2 (the quantity we’re trying to estimate in the first place) are unknown while the stream is being processed. So even though we know the right formula for tt, we can’t compute it yet.

To get a computable bound, we replace the unknown ratio mfF2\frac{mf^*}{F_2} with an upper bound involving only things we know ahead of time. The key fact is:

mfF2n.\frac{mf^*}{F_2} \leq \sqrt{n}.

Since nn — the universe size — is known before the stream begins, this gives us a concrete value to use. Substituting n\sqrt{n} for mfF2\frac{mf^*}{F_2} may overestimate the true minimum number of copies needed, but that is fine: we just run a few more copies than strictly necessary, and the guarantee still holds.

Using this, it suffices to choose t6nε2ln ⁣(2δ)t \geq \frac{6\sqrt{n}}{\varepsilon^2} \ln\!\left(\frac{2}{\delta}\right). Therefore, running

O ⁣(nε2log1δ)O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \log \frac{1}{\delta}\right)

independent copies of AMS and averaging them yields an (ε,δ)(\varepsilon, \delta)-approximation for F2F_2.

We prove that mfF2n\dfrac{mf^*}{F_2} \leq \sqrt{n}.

First note that

F2=i=1nf(i)2,i=1nf(i)=m.F_2 = \sum_{i=1}^{n} f(i)^2, \qquad \sum_{i=1}^{n} f(i) = m.

We need two lower bounds on F2F_2. The first comes from asking: among all ways to distribute mm total frequency across nn items, which minimizes f(i)2\sum f(i)^2? The answer is to spread frequency evenly — give each item m/nm/n. This is the same principle as: for fixed sum, the sum of squares is minimized when all values are equal. The minimum is then n(m/n)2=m2/nn \cdot (m/n)^2 = m^2/n. So:

F2m2n.F_2 \geq \frac{m^2}{n}.

The second bound is simpler: one of the terms in F2F_2 is (f)2(f^*)^2, so trivially F2(f)2F_2 \geq (f^*)^2.

Now consider two cases, depending on whether ff^* is “small” or “large” relative to m/nm/\sqrt{n}. Why that threshold? It’s where both bounds on F2F_2 become equal: if f=m/nf^* = m/\sqrt{n}, then (f)2=m2/n(f^*)^2 = m^2/n, so the two lower bounds coincide. Below that threshold the first bound (F2m2/nF_2 \geq m^2/n) does the work; above it the second bound (F2(f)2F_2 \geq (f^*)^2) does the work.

Case 1. Suppose ff^* is small:

(f)2m2n    fmn.(f^*)^2 \leq \frac{m^2}{n} \implies f^* \leq \frac{m}{\sqrt{n}}.

Multiplying both sides by mF2>0\frac{m}{F_2} > 0:

mfF2m2nF2.\frac{mf^*}{F_2} \leq \frac{m^2}{\sqrt{n}\, F_2}.

Now substitute the lower bound F2m2/nF_2 \geq m^2/n into the denominator (a larger denominator makes the fraction smaller, so replacing F2F_2 with the smaller value m2/nm^2/n only increases the right-hand side — keeping the inequality valid):

mfF2m2n(m2/n)=m2nnm2=n.\frac{mf^*}{F_2} \leq \frac{m^2}{\sqrt{n}(m^2/n)} = \frac{m^2 \cdot n}{\sqrt{n} \cdot m^2} = \sqrt{n}.

Case 2. Suppose ff^* is large:

(f)2>m2n    f>mn.(f^*)^2 > \frac{m^2}{n} \implies f^* > \frac{m}{\sqrt{n}}.

Since F2(f)2F_2 \geq (f^*)^2, we have 1F21(f)2\frac{1}{F_2} \leq \frac{1}{(f^*)^2}. Multiplying both sides by mfmf^* gives

mfF2mf.\frac{mf^*}{F_2} \leq \frac{m}{f^*}.

Substituting the lower bound f>m/nf^* > m/\sqrt{n} into the denominator:

mfn,\frac{m}{f^*} \leq \sqrt{n},

and so mfF2n\dfrac{mf^*}{F_2} \leq \sqrt{n}.

This lecture followed the standard boosting pattern for streaming algorithms:

  • Design an unbiased estimator. Find a one-shot randomized algorithm that returns the right answer in expectation, but may be far off on any individual run.
  • Repeat independently. Run tt copies in parallel and average their outputs.
  • Determine how many copies suffice. Apply a concentration bound (here, the non-Bernoulli Chernoff bound) to find the minimum tt that achieves the (ε,δ)(\varepsilon, \delta) guarantee.

For AMS sampling, a subtlety arose in step 3: the minimum tt involved unknown quantities (ff^* and F2F_2). The Key Lemma resolved this by bounding mf/F2nmf^*/F_2 \leq \sqrt{n}, replacing unknown values with the known universe size nn.

Running O ⁣(nε2ln ⁣(1δ))O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \ln\!\left(\frac{1}{\delta}\right)\right) copies of AMS sampling and averaging gives an output YY satisfying

Pr ⁣(Yf2(i)εf2(i))δ.\Pr\!\left(\left|Y - \sum f^2(i)\right| \geq \varepsilon \sum f^2(i)\right) \leq \delta.

This was for f2(i)\sum f^2(i). What about fk(i)\sum f^k(i)?

Example: Running O ⁣(n11/kε2ln ⁣(1δ))O\!\left(\frac{n^{1 - 1/k}}{\varepsilon^2} \ln\!\left(\frac{1}{\delta}\right)\right) copies suffices for estimating fk(i)\sum f^k(i). Note that when k=2k=2 this becomes n11/2=nn^{1-1/2} = \sqrt{n}, matching the result above. For k>2k > 2, the exponent 11/k1 - 1/k grows toward 1, meaning more copies are needed for higher moments.

Context: You’re working for Amazon, you have a stream of purchases made by people, and at any moment you want to answer how many distinct products the company has sold today.

Given a stream {x1,,xn}\{x_1, \dots, x_n\} with xi{1,,n}x_i \in \{1, \dots, n\}, output F0F_0 - how many distinct elements have appeared in the stream so far.

For example, for this stream S={5,8,2,2,5,8,8,5,3,3}S = \{5, 8, 2, 2, 5, 8, 8, 5, 3, 3\}, the output should be 4 because we have 4 different distinct items in the stream: F0=4F_0 = 4.

The naive approach: start with an empty set, and whenever you see the next item in the stream, check whether it’s in the set. If it is not, add it. In the end, the set will have only distinct elements and require space that is at least F0F_0. So given the stream S={x1,,xn}S = \{x_1, \dots, x_n\} where xi{1,,n}x_i \in \{1, \dots, n\}, we’ll need at least F0lognF_0 \log n bits.

The Flajolet-Martin algorithm addresses this problem:

  • First: “Idealized algorithm”
  • Next: Practical algorithm

You have a hash function that takes a number and gives you a uniformly random number between 0 and 1:

h:{1,,n}[0,1]h : \{1, \dots, n\} \to [0, 1]

where nn is the upper bound on the number of distinct elements and [0,1][0, 1] is the continuous interval. Hash each xix_i to a value 0<h(xi)<10 < h(x_i) < 1. We only maintain the smallest hash seen so far.

Output:

1minh(xi)1\frac{1}{\min h(x_i)} - 1

We only maintain the minh(xi)\min h(x_i) — the smallest hash value seen so far. The intuition behind this output formula: if there are dd distinct elements, each gets an independent uniform hash in [0,1][0,1], so the minimum of dd such values has expected value 1d+1\frac{1}{d+1} (derived in the Uniform Random Variable section below). Inverting gives approximately d+1d+1, and subtracting 1 gives d=F0d = F_0. So 1minh1\frac{1}{\min h} - 1 is an unbiased estimator for F0F_0.

Continuous random variable X[0,1]X \in [0, 1].

Density:

f(x)=1x[0,1].f(x) = 1 \quad \forall\, x \in [0, 1].

Cumulative distribution function:

F(x)=Pr(Xx)={0x0(the chance that r.v. is smaller than 0 is 0)x0x1(Pr(Xx)=x1=x)1x>1(if x>1, then Xx always holds because X1)F(x) = \Pr(X \leq x) = \begin{cases} 0 & x \leq 0 \quad \text{(the chance that r.v. is smaller than 0 is 0)} \\ x & 0 \leq x \leq 1 \quad \text{(}\Pr(X \le x) = \frac{x}{1} = x\text{)} \\ 1 & x > 1 \quad \text{(if } x > 1\text{, then } X \le x \text{ always holds because } X \le 1\text{)} \end{cases}

For a uniform r.v. we do not talk about probability of a single point but only about probability of a small segment (a single point always has probability of zero).

The expectation of a continuous r.v. is defined via its density:

E[X]=01xf(x)dx=01xdx=x2201=12\mathbb{E}[X] = \int_{0}^{1} x f(x) \, dx = \int_{0}^{1} x \, dx = \left.\frac{x^2}{2}\right|_{0}^{1} = \frac{1}{2}

As we close this lecture, we are posed the question: Suppose you throw two darts, X1X_1 and X2X_2.

What is the expected position of the smaller of the two darts? E[min(X1,X2)]=1/3\mathbb{E}[\min(X_1, X_2)] = 1/3

What is the expected position of the bigger of the two darts? E[max(X1,X2)]=2/3\mathbb{E}[\max(X_1, X_2)] = 2/3

We’ll explore this further in the next lecture.