Lecture 17 (03/23/2026) - Frequency Estimation (Heavy Hitters); Count-Min Sketch

Frequency Estimation in Streams

Heavy Hitters Problem Statement

Consider a stream $S = \{10, 2, 2, 5, 1, 2, 10, 5, 5, 5, 3, 1, \ldots\}$ , where each element represents an Amazon product ID.

Assume that $S$ contains numbers $1, \ldots, n$ . For any $1 \leq i \leq n$ :

F(i) = \text{frequency of } i \text{ in } S

(i.e., the number of times $i$ appears in $S$ ). For example, $F(2) = 3$ , $F(5) = 4$ .

We want to keep track of the top- $k$ most frequent (heavy hitters) items.

Lower bound: if we don’t allow any errors, then even $k = 1$ requires us to store the full stream. Intuitively, to know the exact most frequent item you may need to distinguish between streams that differ only in their last element, which forces you to remember everything.

Frequency Query Data Structure

If we want to save space, we have to allow errors.
It is enough to build a data structure/sketch that performs queries of the form: What is $F(i)$ ? given any $1 \leq i \leq n$ .

Given such a frequency-query sketch, we can track the top- $k$ items using a min-heap of size $k$ (with each item’s current frequency as its priority). Whenever a new item arrives in the stream, we query its frequency and compare it to the smallest frequency currently in the heap. If it is larger, it displaces the least popular item.

We can define a heap that stores the top- $k$ items in the stream so far.

Initialization: For the first $k$ distinct items seen in the stream, insert them directly into the heap (they are trivially the top- $k$ so far).

For each subsequent item $i'$ arriving in the stream:

Query $F(i')$
Compare $F(i')$ to $F(\text{min\_heap})$ (the frequency of the current minimum in the heap)
If $F(i') \gt F(\text{min\_heap})$ $F (i^{'}) > F (min_heap)$ :
- Delete the minimum from the heap
- Insert $(i', F(i'))$

The total update time per arriving item is $O(\text{query time} + \log k)$ , since the heap stores only $k$ items and heap operations take $O(\log k)$ time.

Problem Statement: Build a Sketch to Answer F(i) Queries

The setting is that we are given a stream $S = \{x_1, x_2, \ldots, x_m\}$ where $x_i \in [n] = \{1, \ldots, n\}$ .

Goal: Store $S$ so as to answer queries: what is $F(i)$ ?

Count-Min Sketch

Introduction

CMS can also be thought of as a stacked Bloom filter: instead of storing bits, we keep actual counters, and when querying we return the minimum counter value across all rows.

Data Structure (r × c Matrix, One Hash Function per Row)

Consider a table (sort of like a matrix) with $r$ rows and $c$ columns, where:

Number of columns: $c = e / \varepsilon$
Number of rows: $r = \ln(1 / \delta)$

$LaTeX diagram$

For every row, we pick a hash function. There are $r$ hash functions $h_1, h_2, \ldots, h_r$ , one per row, where $h_j$ is a perfectly random hash function:

h_j : \{1, \ldots, n\} \to \{1, \ldots, c\}, \quad \text{for } 1 \leq j \leq r

All of these hash functions take any of the $n$ products as inputs and will output one of the $c$ columns. In other words, each hash function takes an item and places it into the columns in its row, uniformly at random.

Update Algorithm

Given a stream $S = \{x_1, \ldots, x_i, \ldots\}$ , for each arriving item $x$ :

Compute each of the hash functions $h_1(x), h_2(x), \ldots, h_r(x)$ .

$LaTeX diagram$

So in every row we will get a cell that the item gets hashed to by that row’s hash function. Then we increment the counters in those cells by 1, indicating that the item has been hashed into that cell. We do this for all items in the stream.

Query Algorithm

Query: Item $y$ - what is the frequency of item $y$ , i.e. $F(y)$ ?

Query algorithm: Compute $h_1(y), h_2(y), \ldots, h_r(y)$ and output the minimum value in these cells.

The reason we take the minimum is that hash collisions can only inflate a counter above the true frequency (other items that collide into the same cell add to it). So the row whose counter is least inflated gives the best estimate, and taking the minimum across all rows gives us the closest value to the truth. This is also where the name comes from: we are counting frequencies using a min.

Let $\hat{F}(y)$ denote the value output by Count-Min Sketch. Note:

Can $\hat{F}(y) < F(y)$ ? No - because every time this item arrives in the stream, we would have updated the counters at least that many times, so the counters that we query for $y$ will always be at least $F(y)$ .
So $\hat{F}(y) \geq F(y)$ always - CMS can only overestimate frequencies. The real question is: by how much can it overestimate?

Theorem:

\Pr\!\left(\hat{F}(y) - F(y) \geq \varepsilon m\right) \leq \delta

\iff \Pr\!\left(\hat{F}(y) \leq F(y) + \varepsilon m\right) \geq 1 - \delta

where $m$ is the length of the stream so far.

Absolute vs. Relative Error

Proof: Lemma E[ctr] ≤ F(y) + m/c

We’ll focus on only one row here, for which $h_1$ is the hash function.

$LaTeX diagram$

$\text{ctr} \geq F(y)$ (since every occurrence of $y$ increments that cell, any collisions only add more).

Lemma: $E[\text{ctr}] \leq F(y) + \dfrac{m}{c}$

Proof: Consider one fixed row. There are $m$ total items in the stream. Each item gets hashed (uniformly at random) to one of the $c$ cells in this row. For any single cell, the expected number of items (out of the full $m$ ) that land in it is $m/c$ . Since $y$ ‘s own occurrences always land in cell $h_1(y)$ (contributing exactly $F(y)$ to the counter), and every other item contributes independently with probability $1/c$ , the expected counter value is at most $F(y) + m/c$ . Note that this bound is actually slightly loose because we include even $y$ ‘s occurrences in the $m/c$ term, but the claim still holds.

Corollary: Let $\text{Error} = \text{ctr} - F(y)$ , then $E[\text{Error}] \leq \dfrac{m}{c}$ .

Proof: Markov on One Row → P[error > εm] ≤ 1/e

Continuing the proof of the guarantee. Since $c = e/\varepsilon$ , we can substitute into the bound from the corollary:

E[\text{Error}] \leq \frac{m}{c} = \frac{m}{e/\varepsilon} = m \cdot \frac{\varepsilon}{e} = \frac{\varepsilon m}{e}

So the error in the first row satisfies $E[\text{Error}_1] \leq \dfrac{\varepsilon m}{e}$ . By Markov’s inequality:

\Pr(\text{Error}_1 \geq \varepsilon m) \leq \frac{1}{e}

By the same argument applied to each row independently, let $\text{Error}_j = \text{ctr}_j - F(y)$ denote the error in row $j$ , where $\text{ctr}_j$ is the counter value at cell $h_j(y)$ in row $j$ . Each $\text{Error}_j$ satisfies the same bound: $E[\text{Error}_j] \leq \varepsilon m / e$ , and by Markov, $\Pr(\text{Error}_j \geq \varepsilon m) \leq 1/e$ .

For CMS to return an overestimate $\hat{F}(y) \geq F(y) + \varepsilon m$ , all the errors $\text{Error}_1, \text{Error}_2, \ldots, \text{Error}_r$ must be $\geq \varepsilon m$ . Since the rows use independent hash functions:

\begin{aligned} \Pr\!\begin{pmatrix} \text{Error}_1 \geq \varepsilon m \quad \text{and} \\ \text{Error}_2 \geq \varepsilon m \quad \text{and} \\ \quad\vdots \\ \text{Error}_r \geq \varepsilon m \end{pmatrix} \leq & \left(\frac{1}{e}\right)^r = \left(\frac{1}{e}\right)^{\ln(1/\delta)} = \delta \end{aligned}

Therefore:

\Pr(\text{all errors are} \geq \varepsilon m) \leq \delta

Taking the complement (“not all rows exceed $\varepsilon m$ ” is the same as “at least one row has error $\leq \varepsilon m$ ”):

\implies \Pr(\text{some error is} \leq \varepsilon m) \geq 1 - \delta

And if even one row $j$ has $\text{Error}_j \leq \varepsilon m$ , then that row’s counter value is $\leq F(y) + \varepsilon m$ . Since $\hat{F}(y)$ is the minimum across all rows, it is $\leq$ that counter in particular, so:

\implies \Pr\!\left(\hat{F}(y) \leq F(y) + \varepsilon m\right) \geq 1 - \delta