Skip to content

Lecture 18 (04/13/2026) - Frequency Moments; AMS Sampling Algorithm

Scribes: Simrandeep Singh and Amrina Qayyum

In the streaming algorithms part of the course, we have already seen:

  • sampling,
  • counting,
  • approximate median,
  • heavy hitters using Count-Min Sketch.

The remaining topics in the streaming section are:

  • frequency estimation,
  • counting distinct elements.

After that, the course will move to other topics.

A course summary shown in class listed the following topics:

  • Dictionary problem:
    • HWC
    • FKS
    • linear probing
    • cuckoo hashing
  • Approximate membership problem:
    • Bloom filter
  • Streaming algorithms:
    • sampling
    • counting: Morris, Morris+, Morris++
    • approximate median
    • heavy hitters: CMS with (ε,δ)(\varepsilon,\delta) guarantee
    • frequency estimation
    • HyperLogLog (counting distinct elements)
  • External memory algorithms
  • Nearest neighbor, dimensionality reduction
  • Gen AI models, privacy (differential privacy)

The probability tools covered in the course: Bernoulli, Geometric, Coupon Collector, Expectation, Variance, Markov, Chebyshev, Chernoff, Balls & Bins.

In this lecture, we covered the following topics:

  • Heavy Hitters,
  • Count-Min Sketch,
  • Bloom Filter Discussion,
  • Frequency Moment Estimation,
  • AMS Sampling,
  • Chernoff Bound for Non-Bernoulli Variables.

This lecture began with a brief review of heavy hitters and the Count-Min Sketch. The discussion then moved to practical questions and limitations related to Bloom filters, including locality of memory access, adaptive handling of false positives, counting variants, and range- or distance-based queries. The second main part of the lecture introduced frequency moment estimation, with particular emphasis on the second frequency moment. The AMS sampling method was presented for estimating this quantity, followed by a proof that the estimator is correct in expectation. The lecture ended with the idea of boosting accuracy by repetition, a connection to Exercise 4.9, and a Chernoff-style concentration bound for non-Bernoulli random variables. The remaining issue of the unknown maximum frequency was postponed to the next class, which will continue with HyperLogLog for counting distinct elements.

Consider a stream

S={10,2,2,5,1,2,10,5,5,5,3,1,}.S=\{10,2,2,5,1,2,10,5,5,5,3,1,\ldots\}.

Assume that the stream contains numbers from {1,,n}\{1,\ldots,n\}.

For any 1in1\le i\le n, define

F(i)=frequency of i in S,F(i)=\text{frequency of } i \text{ in } S,

that is, the number of times item ii appears in the stream.

For example,

F(2)=3,F(5)=4.F(2)=3,\qquad F(5)=4.

The goal is to keep track of the top-KK most frequent items, called the heavy hitters.

If no error is allowed, then even for K=1K=1 we may need to store the full stream. For this reason, the lecture allows approximation.

It is enough to build a data structure that answers queries of the form

"What is F(i) for a given i?"\text{"What is }F(i)\text{ for a given }i\text{?"}

Then we maintain a min-heap of size KK.

When a new item ii' appears:

  • estimate F(i)F(i') using the data structure,
  • compare this estimate with the minimum-frequency item in the heap,
  • if the new estimate is larger, remove the minimum item and insert the new one,
  • if the item is already in the heap, update its value in the heap.

Thus, the method uses space proportional to KK, rather than storing frequencies for all items.

Let

S={x1,x2,,xm},xi[n]={1,,n}.S=\{x_1,x_2,\ldots,x_m\},\qquad x_i\in[n]=\{1,\ldots,n\}.

The goal is to store the stream so as to answer frequency queries:

"What is F(i)?"\text{"What is }F(i)\text{?"}

The Count-Min Sketch has an (ε,δ)(\varepsilon,\delta) guarantee. Take

c=eε,r=ln ⁣(1δ).c=\frac{e}{\varepsilon},\qquad r=\ln\!\left(\frac{1}{\delta}\right).

For each 1jr1\le j\le r, choose a hash function

hj:{1,,n}{1,,c}.h_j:\{1,\ldots,n\}\to\{1,\ldots,c\}.

The structure is an r×cr\times c matrix of counters.

When an item xx arrives:

  • compute h1(x),h2(x),,hr(x)h_1(x),h_2(x),\ldots,h_r(x),
  • increment the counters in those rr cells by 11.

To estimate the frequency of an item yy:

  • compute h1(y),h2(y),,hr(y)h_1(y),h_2(y),\ldots,h_r(y),
  • look at the corresponding rr counters,
  • output the minimum of those values.

Call the returned value F^(y)\widehat{F}(y).

Count-Min Sketch never underestimates:

F^(y)F(y).\widehat{F}(y)\ge F(y).

Hence it only overestimates frequencies.

Let mm be the length of the stream. Then

Pr ⁣(F^(y)F(y)εm)δ.\Pr\!\bigl(\widehat{F}(y)-F(y)\ge \varepsilon m\bigr)\le \delta.

Equivalently,

Pr ⁣(F^(y)F(y)+εm)1δ.\Pr\!\bigl(\widehat{F}(y)\le F(y)+\varepsilon m\bigr)\ge 1-\delta.

This is an additive error guarantee with respect to the stream length mm, not a relative error guarantee.

Fix one row, and let CTR\mathrm{CTR} be the corresponding counter for item yy in that row. Since every true occurrence of yy increments that counter,

CTRF(y).\mathrm{CTR}\ge F(y).

From the class proof,

E[CTR]F(y)+mc.\mathbb{E}[\mathrm{CTR}] \le F(y)+\frac{m}{c}.

Define

Error=CTRF(y).\mathrm{Error}=\mathrm{CTR}-F(y).

Then

E[Error]mc=εme.\mathbb{E}[\mathrm{Error}] \le \frac{m}{c}=\frac{\varepsilon m}{e}.

By Markov’s inequality, for one row,

Pr(Errorεm)1e.\Pr(\mathrm{Error}\ge \varepsilon m)\le \frac{1}{e}.

Now suppose we have rr rows. For the Count-Min Sketch estimate to exceed F(y)+εmF(y)+\varepsilon m, every row must have error at least εm\varepsilon m. Therefore,

Pr ⁣(F^(y)F(y)+εm)(1e)r=(1e)ln(1/δ)=δ.\begin{align*} \Pr\!\bigl(\widehat{F}(y)\ge F(y)+\varepsilon m\bigr) &\le \left(\frac{1}{e}\right)^r \\ &= \left(\frac{1}{e}\right)^{\ln(1/\delta)} \\ &= \delta. \end{align*}

Hence,

Pr ⁣(F^(y)F(y)+εm)1δ.\Pr\!\bigl(\widehat{F}(y)\le F(y)+\varepsilon m\bigr)\ge 1-\delta.
Section titled “Questions About Bloom Filters and Related Variants”

After reviewing heavy hitters and Count-Min Sketch, the lecture discussed several questions about Bloom filters and related structures.

In Bloom filters and Count-Min Sketch, the query algorithm reads multiple cells that are usually scattered in memory. A natural question is whether one can design a structure with similar guarantees but with a query algorithm that avoids such scattered access. A related structure mentioned in class was the Quotient Filter (Bender et al.).

A Bloom filter can return a false positive. Suppose a query is not in the true data set, but the Bloom filter says “yes.” If the real data set is then checked and this is confirmed to be a false positive, the question is whether the filter can be updated so that the same incorrect answer does not occur again. This idea was discussed in class in connection with a Broom Filter.

(c) What if the Set is Actually a Multiset?

Section titled “(c) What if the Set is Actually a Multiset?”

If the stored object is a multiset, then items may repeat. Instead of asking only whether qSq\in S, one may ask:

What is f(q) in S?\text{What is }f(q)\text{ in }S?

That is, how many times does qq appear? This leads to counting filters / counting Bloom filters, which are closely related to Count-Min Sketch.

Instead of asking whether one element belongs to the set, one may ask:

Is there an element from [a,b] in S?\text{Is there an element from }[a,b]\text{ in }S?

Another possible query is:

Is there an element in S that is close to q within distance d?\text{Is there an element in }S\text{ that is close to }q\text{ within distance }d?

This leads to distance-sensitive Bloom filters.

The next main topic in class was frequency moment estimation.

Consider the stream

S={1,5,6,5,1,1,2,3,2,3,4}.S=\{1,5,6,5,1,1,2,3,2,3,4\}.

Its item frequencies are:

f(1)=3,f(2)=2,f(3)=2,f(4)=1,f(5)=2,f(6)=1.f(1)=3,\quad f(2)=2,\quad f(3)=2,\quad f(4)=1,\quad f(5)=2,\quad f(6)=1.

Therefore,

i=16f(i)2=23,i=16f(i)=11.\sum_{i=1}^{6} f(i)^2 = 23, \qquad \sum_{i=1}^{6} f(i)=11.

If XX is a random variable, then:

  • X2X^2 is the second moment,
  • X3X^3 is the third moment,
  • in general, XkX^k is the kk-th moment.

For stream frequencies:

  • the 00-th moment is the number of distinct elements,
  • the 11-st moment is the sum of frequencies, i.e. the stream length,
  • the 22-nd moment is the sum of squared frequencies.

Given a stream

S={x1,,xm},xj{1,,n},S=\{x_1,\ldots,x_m\},\qquad x_j\in\{1,\ldots,n\},

define

f(i)={j:xj=i}.f(i)=\left|\{j:x_j=i\}\right|.

The goal is to output

i=1nf(i)2.\sum_{i=1}^{n} f(i)^2.

This is the second frequency moment.

Why care about this quantity? One well-known application is the Gini index from economics, which measures income inequality. Imagine the stream items are salary brackets, and each bracket’s frequency is the number of people earning in that range. When income is concentrated — a few brackets have very high frequencies — the sum of squared frequencies is large. When income is spread more evenly across brackets, the sum is smaller. The Gini index uses exactly this structure to quantify how unequal a distribution is.

More generally, if g(x)g(x) is any function with g(0)=0g(0)=0, then the goal can be generalized to

i=1ng(f(i)).\sum_{i=1}^{n} g(f(i)).

Examples:

g(x)=x2i=1nf(i)2,g(x)=x^2 \quad\Longrightarrow\quad \sum_{i=1}^{n} f(i)^2, g(x)=xki=1nf(i)k.g(x)=x^k \quad\Longrightarrow\quad \sum_{i=1}^{n} f(i)^k.

The algorithm introduced in class was AMS sampling (Alon-Matias-Szegedy).

  • Sample an element from the stream uniformly at random. Suppose the sampled element is xix_i.
  • Count how many times xix_i appears at or after position ii in the stream. Call this value rr.
  • Output m(r2(r1)2)m\bigl(r^2-(r-1)^2\bigr) for the second moment.

For the kk-th moment, the output becomes

m(rk(r1)k).m\bigl(r^k-(r-1)^k\bigr).

For the general function gg, the output becomes

m(g(r)g(r1)).m\bigl(g(r)-g(r-1)\bigr).

Take

S={1,5,6,5,1,1,2,3,2,3,4},m=11.S=\{1,5,6,5,1,1,2,3,2,3,4\}, \qquad m=11.

Suppose the sampled value is xi=5x_i=5 and from that sampled position onward it appears r=2r=2 times. Then the AMS output is

11(22(21)2)=11(41)=33.11\bigl(2^2-(2-1)^2\bigr)=11(4-1)=33.

The true value of the second moment in this stream is 2323, so one run gives a random estimate rather than the exact answer.

To develop some intuition for why this works on average, consider tracing the algorithm across all possible sampled elements. Item 11 has f(1)=3f(1)=3 occurrences, so it could be sampled at any of its three positions in the stream:

  • Sampling the first occurrence of 11: r=3r=3, output =11(3222)=115=55= 11(3^2-2^2) = 11 \cdot 5 = 55.
  • Sampling the second occurrence of 11: r=2r=2, output =11(2212)=113=33= 11(2^2-1^2) = 11 \cdot 3 = 33.
  • Sampling the third occurrence of 11: r=1r=1, output =11(1202)=111=11= 11(1^2-0^2) = 11 \cdot 1 = 11.

Sampling item 66 (which appears exactly once) gives r=1r=1, output =11(1202)=11= 11(1^2-0^2) = 11.

Each of the m=11m=11 stream positions is equally likely to be sampled. If you computed the AMS output for every position and averaged all 11 results, you would get exactly 2323 — the true second moment. The theorem below proves this holds in general.

Define

X=m(r2(r1)2).X = m\bigl(r^2-(r-1)^2\bigr).

The claim proved in class is

E(X)=i=1nf(i)2.\mathbb{E}(X)=\sum_{i=1}^{n} f(i)^2.

Let

Ai={1,if the sampled element has value i,0,otherwise.A_i= \begin{cases} 1, & \text{if the sampled element has value } i,\\ 0, & \text{otherwise.} \end{cases}

Then

E(X)=i=1nE(XAi=1)Pr(Ai=1).\mathbb{E}(X)=\sum_{i=1}^{n}\mathbb{E}(X\mid A_i=1)\Pr(A_i=1).

Since the stream is sampled uniformly over positions,

Pr(Ai=1)=f(i)m.\Pr(A_i=1)=\frac{f(i)}{m}.

Now, conditioned on sampling value ii, each of its f(i)f(i) occurrences is equally likely to be chosen. If the sampled occurrence is the jj-th occurrence of item ii, then the number of copies of ii from that point onward is

f(i)j+1.f(i)-j+1.

So

E(XAi=1)=j=1f(i)m((f(i)j+1)2(f(i)j)2)1f(i).\mathbb{E}(X\mid A_i=1) = \sum_{j=1}^{f(i)} m\left((f(i)-j+1)^2-(f(i)-j)^2\right)\cdot \frac{1}{f(i)}.

Substituting into the expectation gives

E(X)=i=1n(j=1f(i)m((f(i)j+1)2(f(i)j)2)1f(i))f(i)m.\mathbb{E}(X) = \sum_{i=1}^{n} \left( \sum_{j=1}^{f(i)} m\left((f(i)-j+1)^2-(f(i)-j)^2\right)\cdot \frac{1}{f(i)} \right) \frac{f(i)}{m}.

The factors 1f(i)\frac{1}{f(i)} and f(i)f(i) cancel, and mm also cancels. Let

k=f(i)j+1.k=f(i)-j+1.

Then

E(X)=i=1nk=1f(i)(k2(k1)2).\mathbb{E}(X) = \sum_{i=1}^{n}\sum_{k=1}^{f(i)} \bigl(k^2-(k-1)^2\bigr).

Now the inner sum telescopes. Each consecutive pair of terms cancels: the +12+1^2 at the end of the first term is cancelled by the 12-1^2 at the start of the second term; similarly the +22+2^2 from the second is cancelled by the 22-2^2 from the third; and so on. Everything cancels except the very last positive term:

(1202)+(2212)+(3222)++(f(i)2(f(i)1)2)=f(i)2.(1^2-0^2)+(2^2-1^2)+(3^2-2^2)+\cdots+\bigl(f(i)^2-(f(i)-1)^2\bigr)=f(i)^2.

Therefore,

E(X)=i=1nf(i)2.\mathbb{E}(X)=\sum_{i=1}^{n} f(i)^2.

The same proof works for

X=m(g(r)g(r1)),X=m\bigl(g(r)-g(r-1)\bigr),

and gives

E(X)=i=1ng(f(i)).\mathbb{E}(X)=\sum_{i=1}^{n} g(f(i)).

From Expectation to an (ε,δ)(\varepsilon,\delta) Guarantee

Section titled “From Expectation to an (ε,δ)(\varepsilon,\delta)(ε,δ) Guarantee”

One run of AMS sampling gives a single random estimate that is correct in expectation but could be far off on any individual run. To get an (ε,δ)(\varepsilon,\delta) guarantee — meaning the estimate is within a (1±ε)(1\pm\varepsilon) factor of the truth with probability at least 1δ1-\delta — we use the standard boosting strategy: run the algorithm tt independent times

X1,X2,,Xt,X_1,X_2,\ldots,X_t,

and report the average

Y=X1++Xtt.Y=\frac{X_1+\cdots+X_t}{t}.

The question is: how large does tt need to be? This is where the Chernoff bound comes in — it lets us solve for the minimum tt that achieves the desired (ε,δ)(\varepsilon,\delta) guarantee. This is the same approach used in Exercise 4.9 of the textbook.

Chernoff Bound for Non-Bernoulli Variables

Section titled “Chernoff Bound for Non-Bernoulli Variables”

A version of Chernoff’s bound for non-Bernoulli variables was written in class.

Let {Xi}i=1t\{X_i\}_{i=1}^{t} be i.i.d. random variables taking values in [0,C][0,C], and let

Y=i=1tXit.Y=\frac{\sum_{i=1}^{t} X_i}{t}.

Then, for 0<γ10 < \gamma \leq 1:

Pr ⁣(YE(Y)γE(Y))2exp ⁣(γ2E(Y)t3C).\Pr\!\left(|Y-\mathbb{E}(Y)|\ge \gamma\,\mathbb{E}(Y)\right) \le 2\exp\!\left(-\frac{\gamma^2\,\mathbb{E}(Y)\,t}{3C}\right).

For AMS sampling,

X=m(r2(r1)2).X=m\bigl(r^2-(r-1)^2\bigr).

Expanding (r1)2=r22r+1(r-1)^2 = r^2 - 2r + 1, the r2r^2 terms cancel and this simplifies to

X=m(2r1).X = m(2r-1).

Let

f=max1inf(i).f^*=\max_{1\le i\le n} f(i).

Since rr counts how many times the sampled element appears from its position onward, rr can never exceed the frequency of the most frequent element in the stream. Even in the best case — sampling the most frequent item at its very first occurrence — you get at most ff^* copies remaining. So

rf.r\le f^*.

Hence (dropping the 1-1 since it only makes XX smaller)

Xm(2f).X\le m(2f^*).

So, for the non-Bernoulli Chernoff bound, one may take

C=m(2f).C=m(2f^*).

With CC in hand, the Chernoff bound gives a concrete tail bound on YE(Y)|Y - \mathbb{E}(Y)|. Setting that tail bound equal to δ\delta and solving for tt gives the minimum number of AMS repetitions needed to achieve the desired (ε,δ)(\varepsilon,\delta) guarantee.