Lecture 18 (04/13/2026) - Frequency Moments; AMS Sampling Algorithm | CSCI 328

Scribes: Simrandeep Singh and Amrina Qayyum

Plan of the Remaining Course

In the streaming algorithms part of the course, we have already seen:

sampling,
counting,
approximate median,
heavy hitters using Count-Min Sketch.

The remaining topics in the streaming section are:

frequency estimation,
counting distinct elements.

After that, the course will move to other topics.

A course summary shown in class listed the following topics:

Dictionary problem:
- HWC
- FKS
- linear probing
- cuckoo hashing
Approximate membership problem:
- Bloom filter
Streaming algorithms:
- sampling
- counting: Morris, Morris+, Morris++
- approximate median
- heavy hitters: CMS with $(\varepsilon,\delta)$ guarantee
- frequency estimation
- HyperLogLog (counting distinct elements)
External memory algorithms
Nearest neighbor, dimensionality reduction
Gen AI models, privacy (differential privacy)

The probability tools covered in the course: Bernoulli, Geometric, Coupon Collector, Expectation, Variance, Markov, Chebyshev, Chernoff, Balls & Bins.

Summary of Lecture

In this lecture, we covered the following topics:

Heavy Hitters,
Count-Min Sketch,
Bloom Filter Discussion,
Frequency Moment Estimation,
AMS Sampling,
Chernoff Bound for Non-Bernoulli Variables.

This lecture began with a brief review of heavy hitters and the Count-Min Sketch. The discussion then moved to practical questions and limitations related to Bloom filters, including locality of memory access, adaptive handling of false positives, counting variants, and range- or distance-based queries. The second main part of the lecture introduced frequency moment estimation, with particular emphasis on the second frequency moment. The AMS sampling method was presented for estimating this quantity, followed by a proof that the estimator is correct in expectation. The lecture ended with the idea of boosting accuracy by repetition, a connection to Exercise 4.9, and a Chernoff-style concentration bound for non-Bernoulli random variables. The remaining issue of the unknown maximum frequency was postponed to the next class, which will continue with HyperLogLog for counting distinct elements.

Heavy Hitters

Consider a stream

S=\{10,2,2,5,1,2,10,5,5,5,3,1,\ldots\}.

Assume that the stream contains numbers from $\{1,\ldots,n\}$ .

For any $1\le i\le n$ , define

F(i)=\text{frequency of } i \text{ in } S,

that is, the number of times item $i$ appears in the stream.

For example,

F(2)=3,\qquad F(5)=4.

The goal is to keep track of the top- $K$ most frequent items, called the heavy hitters.

Why Allow Error?

If no error is allowed, then even for $K=1$ we may need to store the full stream. For this reason, the lecture allows approximation.

How to Maintain Top- $K$

It is enough to build a data structure that answers queries of the form

\text{"What is }F(i)\text{ for a given }i\text{?"}

Then we maintain a min-heap of size $K$ .

When a new item $i'$ appears:

estimate $F(i')$ using the data structure,
compare this estimate with the minimum-frequency item in the heap,
if the new estimate is larger, remove the minimum item and insert the new one,
if the item is already in the heap, update its value in the heap.

Thus, the method uses space proportional to $K$ , rather than storing frequencies for all items.

Count-Min Sketch

Let

S=\{x_1,x_2,\ldots,x_m\},\qquad x_i\in[n]=\{1,\ldots,n\}.

The goal is to store the stream so as to answer frequency queries:

\text{"What is }F(i)\text{?"}

Parameters

The Count-Min Sketch has an $(\varepsilon,\delta)$ guarantee. Take

c=\frac{e}{\varepsilon},\qquad r=\ln\!\left(\frac{1}{\delta}\right).

For each $1\le j\le r$ , choose a hash function

h_j:\{1,\ldots,n\}\to\{1,\ldots,c\}.

Data Structure

The structure is an $r\times c$ matrix of counters.

Update Rule

When an item $x$ arrives:

compute $h_1(x),h_2(x),\ldots,h_r(x)$ ,
increment the counters in those $r$ cells by $1$ .

Query Rule

To estimate the frequency of an item $y$ :

compute $h_1(y),h_2(y),\ldots,h_r(y)$ ,
look at the corresponding $r$ counters,
output the minimum of those values.

Call the returned value $\widehat{F}(y)$ .

Basic Property

Count-Min Sketch never underestimates:

\widehat{F}(y)\ge F(y).

Hence it only overestimates frequencies.

Guarantee

Let $m$ be the length of the stream. Then

\Pr\!\bigl(\widehat{F}(y)-F(y)\ge \varepsilon m\bigr)\le \delta.

Equivalently,

\Pr\!\bigl(\widehat{F}(y)\le F(y)+\varepsilon m\bigr)\ge 1-\delta.

Important Note

This is an additive error guarantee with respect to the stream length $m$ , not a relative error guarantee.

Proof Sketch

Fix one row, and let $\mathrm{CTR}$ be the corresponding counter for item $y$ in that row. Since every true occurrence of $y$ increments that counter,

\mathrm{CTR}\ge F(y).

From the class proof,

\mathbb{E}[\mathrm{CTR}] \le F(y)+\frac{m}{c}.

Define

\mathrm{Error}=\mathrm{CTR}-F(y).

Then

\mathbb{E}[\mathrm{Error}] \le \frac{m}{c}=\frac{\varepsilon m}{e}.

By Markov’s inequality, for one row,

\Pr(\mathrm{Error}\ge \varepsilon m)\le \frac{1}{e}.

Now suppose we have $r$ rows. For the Count-Min Sketch estimate to exceed $F(y)+\varepsilon m$ , every row must have error at least $\varepsilon m$ . Therefore,

\begin{align*} \Pr\!\bigl(\widehat{F}(y)\ge F(y)+\varepsilon m\bigr) &\le \left(\frac{1}{e}\right)^r \\ &= \left(\frac{1}{e}\right)^{\ln(1/\delta)} \\ &= \delta. \end{align*}

Hence,

\Pr\!\bigl(\widehat{F}(y)\le F(y)+\varepsilon m\bigr)\ge 1-\delta.

After reviewing heavy hitters and Count-Min Sketch, the lecture discussed several questions about Bloom filters and related structures.

(a) Can We Avoid Reading Scattered Cells?

In Bloom filters and Count-Min Sketch, the query algorithm reads multiple cells that are usually scattered in memory. A natural question is whether one can design a structure with similar guarantees but with a query algorithm that avoids such scattered access. A related structure mentioned in class was the Quotient Filter (Bender et al.).

(b) Can a Bloom Filter Fix Its Mistakes?

A Bloom filter can return a false positive. Suppose a query is not in the true data set, but the Bloom filter says “yes.” If the real data set is then checked and this is confirmed to be a false positive, the question is whether the filter can be updated so that the same incorrect answer does not occur again. This idea was discussed in class in connection with a Broom Filter.

(c) What if the Set is Actually a Multiset?

If the stored object is a multiset, then items may repeat. Instead of asking only whether $q\in S$ , one may ask:

\text{What is }f(q)\text{ in }S?

That is, how many times does $q$ appear? This leads to counting filters / counting Bloom filters, which are closely related to Count-Min Sketch.

(d) Range Query

Instead of asking whether one element belongs to the set, one may ask:

\text{Is there an element from }[a,b]\text{ in }S?

(e) Distance Query

Another possible query is:

\text{Is there an element in }S\text{ that is close to }q\text{ within distance }d?

This leads to distance-sensitive Bloom filters.

Frequency Moment Estimation

The next main topic in class was frequency moment estimation.

Consider the stream

S=\{1,5,6,5,1,1,2,3,2,3,4\}.

Its item frequencies are:

f(1)=3,\quad f(2)=2,\quad f(3)=2,\quad f(4)=1,\quad f(5)=2,\quad f(6)=1.

Therefore,

\sum_{i=1}^{6} f(i)^2 = 23, \qquad \sum_{i=1}^{6} f(i)=11.

Moments

If $X$ is a random variable, then:

$X^2$ is the second moment,
$X^3$ is the third moment,
in general, $X^k$ is the $k$ -th moment.

For stream frequencies:

the $0$ -th moment is the number of distinct elements,
the $1$ -st moment is the sum of frequencies, i.e. the stream length,
the $2$ -nd moment is the sum of squared frequencies.

Problem Statement

Given a stream

S=\{x_1,\ldots,x_m\},\qquad x_j\in\{1,\ldots,n\},

define

f(i)=\left|\{j:x_j=i\}\right|.

The goal is to output

\sum_{i=1}^{n} f(i)^2.

This is the second frequency moment.

Why care about this quantity? One well-known application is the Gini index from economics, which measures income inequality. Imagine the stream items are salary brackets, and each bracket’s frequency is the number of people earning in that range. When income is concentrated — a few brackets have very high frequencies — the sum of squared frequencies is large. When income is spread more evenly across brackets, the sum is smaller. The Gini index uses exactly this structure to quantify how unequal a distribution is.

More generally, if $g(x)$ is any function with $g(0)=0$ , then the goal can be generalized to

\sum_{i=1}^{n} g(f(i)).

Examples:

g(x)=x^2 \quad\Longrightarrow\quad \sum_{i=1}^{n} f(i)^2,

g(x)=x^k \quad\Longrightarrow\quad \sum_{i=1}^{n} f(i)^k.

AMS Sampling

The algorithm introduced in class was AMS sampling (Alon-Matias-Szegedy).

Algorithm

Sample an element from the stream uniformly at random. Suppose the sampled element is $x_i$ .
Count how many times $x_i$ appears at or after position $i$ in the stream. Call this value $r$ .
Output $m\bigl(r^2-(r-1)^2\bigr)$ for the second moment.

For the $k$ -th moment, the output becomes

m\bigl(r^k-(r-1)^k\bigr).

For the general function $g$ , the output becomes

m\bigl(g(r)-g(r-1)\bigr).

Example

Take

S=\{1,5,6,5,1,1,2,3,2,3,4\}, \qquad m=11.

Suppose the sampled value is $x_i=5$ and from that sampled position onward it appears $r=2$ times. Then the AMS output is

11\bigl(2^2-(2-1)^2\bigr)=11(4-1)=33.

The true value of the second moment in this stream is $23$ , so one run gives a random estimate rather than the exact answer.

To develop some intuition for why this works on average, consider tracing the algorithm across all possible sampled elements. Item $1$ has $f(1)=3$ occurrences, so it could be sampled at any of its three positions in the stream:

Sampling the first occurrence of $1$ : $r=3$ , output $= 11(3^2-2^2) = 11 \cdot 5 = 55$ .
Sampling the second occurrence of $1$ : $r=2$ , output $= 11(2^2-1^2) = 11 \cdot 3 = 33$ .
Sampling the third occurrence of $1$ : $r=1$ , output $= 11(1^2-0^2) = 11 \cdot 1 = 11$ .

Sampling item $6$ (which appears exactly once) gives $r=1$ , output $= 11(1^2-0^2) = 11$ .

Each of the $m=11$ stream positions is equally likely to be sampled. If you computed the AMS output for every position and averaged all 11 results, you would get exactly $23$ — the true second moment. The theorem below proves this holds in general.

Expected Value of AMS Sampling

Define

X = m\bigl(r^2-(r-1)^2\bigr).

The claim proved in class is

\mathbb{E}(X)=\sum_{i=1}^{n} f(i)^2.

Proof

Let

A_i= \begin{cases} 1, & \text{if the sampled element has value } i,\\ 0, & \text{otherwise.} \end{cases}

Then

\mathbb{E}(X)=\sum_{i=1}^{n}\mathbb{E}(X\mid A_i=1)\Pr(A_i=1).

Since the stream is sampled uniformly over positions,

\Pr(A_i=1)=\frac{f(i)}{m}.

Now, conditioned on sampling value $i$ , each of its $f(i)$ occurrences is equally likely to be chosen. If the sampled occurrence is the $j$ -th occurrence of item $i$ , then the number of copies of $i$ from that point onward is

f(i)-j+1.

\mathbb{E}(X\mid A_i=1) = \sum_{j=1}^{f(i)} m\left((f(i)-j+1)^2-(f(i)-j)^2\right)\cdot \frac{1}{f(i)}.

Substituting into the expectation gives

\mathbb{E}(X) = \sum_{i=1}^{n} \left( \sum_{j=1}^{f(i)} m\left((f(i)-j+1)^2-(f(i)-j)^2\right)\cdot \frac{1}{f(i)} \right) \frac{f(i)}{m}.

The factors $\frac{1}{f(i)}$ and $f(i)$ cancel, and $m$ also cancels. Let

k=f(i)-j+1.

Then

\mathbb{E}(X) = \sum_{i=1}^{n}\sum_{k=1}^{f(i)} \bigl(k^2-(k-1)^2\bigr).

Now the inner sum telescopes. Each consecutive pair of terms cancels: the $+1^2$ at the end of the first term is cancelled by the $-1^2$ at the start of the second term; similarly the $+2^2$ from the second is cancelled by the $-2^2$ from the third; and so on. Everything cancels except the very last positive term:

(1^2-0^2)+(2^2-1^2)+(3^2-2^2)+\cdots+\bigl(f(i)^2-(f(i)-1)^2\bigr)=f(i)^2.

Therefore,

\mathbb{E}(X)=\sum_{i=1}^{n} f(i)^2.

General Form

The same proof works for

X=m\bigl(g(r)-g(r-1)\bigr),

and gives

\mathbb{E}(X)=\sum_{i=1}^{n} g(f(i)).

From Expectation to an $(\varepsilon,\delta)$ Guarantee

One run of AMS sampling gives a single random estimate that is correct in expectation but could be far off on any individual run. To get an $(\varepsilon,\delta)$ guarantee — meaning the estimate is within a $(1\pm\varepsilon)$ factor of the truth with probability at least $1-\delta$ — we use the standard boosting strategy: run the algorithm $t$ independent times

X_1,X_2,\ldots,X_t,

and report the average

Y=\frac{X_1+\cdots+X_t}{t}.

The question is: how large does $t$ need to be? This is where the Chernoff bound comes in — it lets us solve for the minimum $t$ that achieves the desired $(\varepsilon,\delta)$ guarantee. This is the same approach used in Exercise 4.9 of the textbook.

Chernoff Bound for Non-Bernoulli Variables

A version of Chernoff’s bound for non-Bernoulli variables was written in class.

Let $\{X_i\}_{i=1}^{t}$ be i.i.d. random variables taking values in $[0,C]$ , and let

Y=\frac{\sum_{i=1}^{t} X_i}{t}.

Then, for $0 < \gamma \leq 1$ :

\Pr\!\left(|Y-\mathbb{E}(Y)|\ge \gamma\,\mathbb{E}(Y)\right) \le 2\exp\!\left(-\frac{\gamma^2\,\mathbb{E}(Y)\,t}{3C}\right).

Applying This to AMS Sampling

For AMS sampling,

X=m\bigl(r^2-(r-1)^2\bigr).

Expanding $(r-1)^2 = r^2 - 2r + 1$ , the $r^2$ terms cancel and this simplifies to

X = m(2r-1).

Let

f^*=\max_{1\le i\le n} f(i).

Since $r$ counts how many times the sampled element appears from its position onward, $r$ can never exceed the frequency of the most frequent element in the stream. Even in the best case — sampling the most frequent item at its very first occurrence — you get at most $f^*$ copies remaining. So

r\le f^*.

Hence (dropping the $-1$ since it only makes $X$ smaller)

X\le m(2f^*).

So, for the non-Bernoulli Chernoff bound, one may take

C=m(2f^*).

With $C$ in hand, the Chernoff bound gives a concrete tail bound on $|Y - \mathbb{E}(Y)|$ . Setting that tail bound equal to $\delta$ and solving for $t$ gives the minimum number of AMS repetitions needed to achieve the desired $(\varepsilon,\delta)$ guarantee.

Lecture 18 (04/13/2026) - Frequency Moments; AMS Sampling Algorithm | CSCI 328

Plan of the Remaining Course

Summary of Lecture

Heavy Hitters

Why Allow Error?

How to Maintain Top-KKK

Count-Min Sketch

Parameters

Data Structure

Update Rule

Query Rule

Basic Property

Guarantee

Important Note

Proof Sketch

Questions About Bloom Filters and Related Variants

(a) Can We Avoid Reading Scattered Cells?

(b) Can a Bloom Filter Fix Its Mistakes?

(c) What if the Set is Actually a Multiset?

(d) Range Query

(e) Distance Query

Frequency Moment Estimation

Moments

Problem Statement

AMS Sampling

Algorithm

Example

Expected Value of AMS Sampling

Proof

General Form

From Expectation to an (ε,δ)(\varepsilon,\delta)(ε,δ) Guarantee

Chernoff Bound for Non-Bernoulli Variables

Applying This to AMS Sampling

How to Maintain Top- $K$

From Expectation to an $(\varepsilon,\delta)$ Guarantee