Skip to content

Lecture 17 (03/23/2026) - Frequency Estimation (Heavy Hitters); Count-Min Sketch

Consider a stream S={10,2,2,5,1,2,10,5,5,5,3,1,}S = \{10, 2, 2, 5, 1, 2, 10, 5, 5, 5, 3, 1, \ldots\}, where each element represents an Amazon product ID.

Assume that SS contains numbers 1,,n1, \ldots, n. For any 1in1 \leq i \leq n:

F(i)=frequency of i in SF(i) = \text{frequency of } i \text{ in } S

(i.e., the number of times ii appears in SS). For example, F(2)=3F(2) = 3, F(5)=4F(5) = 4.

We want to keep track of the top-kk most frequent (heavy hitters) items.

  • Lower bound: if we don’t allow any errors, then even k=1k = 1 requires us to store the full stream. Intuitively, to know the exact most frequent item you may need to distinguish between streams that differ only in their last element, which forces you to remember everything.
  1. If we want to save space, we have to allow errors.
  2. It is enough to build a data structure/sketch that performs queries of the form: What is F(i)F(i)? given any 1in1 \leq i \leq n.

Given such a frequency-query sketch, we can track the top-kk items using a min-heap of size kk (with each item’s current frequency as its priority). Whenever a new item arrives in the stream, we query its frequency and compare it to the smallest frequency currently in the heap. If it is larger, it displaces the least popular item.

We can define a heap that stores the top-kk items in the stream so far.

Initialization: For the first kk distinct items seen in the stream, insert them directly into the heap (they are trivially the top-kk so far).

For each subsequent item ii' arriving in the stream:

  • Query F(i)F(i')
  • Compare F(i)F(i') to F(min_heap)F(\text{min\_heap}) (the frequency of the current minimum in the heap)
  • If F(i)>F(min_heap)F(i') \gt F(\text{min\_heap}):
    • Delete the minimum from the heap
    • Insert (i,F(i))(i', F(i'))

The total update time per arriving item is O(query time+logk)O(\text{query time} + \log k), since the heap stores only kk items and heap operations take O(logk)O(\log k) time.

Problem Statement: Build a Sketch to Answer F(i) Queries

Section titled “Problem Statement: Build a Sketch to Answer F(i) Queries”

The setting is that we are given a stream S={x1,x2,,xm}S = \{x_1, x_2, \ldots, x_m\} where xi[n]={1,,n}x_i \in [n] = \{1, \ldots, n\}.

Goal: Store SS so as to answer queries: what is F(i)F(i)?

CMS can also be thought of as a stacked Bloom filter: instead of storing bits, we keep actual counters, and when querying we return the minimum counter value across all rows.

Data Structure (r × c Matrix, One Hash Function per Row)

Section titled “Data Structure (r × c Matrix, One Hash Function per Row)”

Consider a table (sort of like a matrix) with rr rows and cc columns, where:

  • Number of columns: c=e/εc = e / \varepsilon
  • Number of rows: r=ln(1/δ)r = \ln(1 / \delta)

LaTeX diagram

For every row, we pick a hash function. There are rr hash functions h1,h2,,hrh_1, h_2, \ldots, h_r, one per row, where hjh_j is a perfectly random hash function:

hj:{1,,n}{1,,c},for 1jrh_j : \{1, \ldots, n\} \to \{1, \ldots, c\}, \quad \text{for } 1 \leq j \leq r

All of these hash functions take any of the nn products as inputs and will output one of the cc columns. In other words, each hash function takes an item and places it into the columns in its row, uniformly at random.

Given a stream S={x1,,xi,}S = \{x_1, \ldots, x_i, \ldots\}, for each arriving item xx:

  1. Compute each of the hash functions h1(x),h2(x),,hr(x)h_1(x), h_2(x), \ldots, h_r(x).

LaTeX diagram

So in every row we will get a cell that the item gets hashed to by that row’s hash function. Then we increment the counters in those cells by 1, indicating that the item has been hashed into that cell. We do this for all items in the stream.

Query: Item yy - what is the frequency of item yy, i.e. F(y)F(y)?

Query algorithm: Compute h1(y),h2(y),,hr(y)h_1(y), h_2(y), \ldots, h_r(y) and output the minimum value in these cells.

The reason we take the minimum is that hash collisions can only inflate a counter above the true frequency (other items that collide into the same cell add to it). So the row whose counter is least inflated gives the best estimate, and taking the minimum across all rows gives us the closest value to the truth. This is also where the name comes from: we are counting frequencies using a min.

Let F^(y)\hat{F}(y) denote the value output by Count-Min Sketch. Note:

  1. Can F^(y)<F(y)\hat{F}(y) < F(y)? No - because every time this item arrives in the stream, we would have updated the counters at least that many times, so the counters that we query for yy will always be at least F(y)F(y).
  2. So F^(y)F(y)\hat{F}(y) \geq F(y) always - CMS can only overestimate frequencies. The real question is: by how much can it overestimate?

Theorem:

Pr ⁣(F^(y)F(y)εm)δ\Pr\!\left(\hat{F}(y) - F(y) \geq \varepsilon m\right) \leq \delta     Pr ⁣(F^(y)F(y)+εm)1δ\iff \Pr\!\left(\hat{F}(y) \leq F(y) + \varepsilon m\right) \geq 1 - \delta

where mm is the length of the stream so far.

We’ll focus on only one row here, for which h1h_1 is the hash function.

LaTeX diagram

  • ctrF(y)\text{ctr} \geq F(y) (since every occurrence of yy increments that cell, any collisions only add more).

Lemma: E[ctr]F(y)+mcE[\text{ctr}] \leq F(y) + \dfrac{m}{c}

Proof: Consider one fixed row. There are mm total items in the stream. Each item gets hashed (uniformly at random) to one of the cc cells in this row. For any single cell, the expected number of items (out of the full mm) that land in it is m/cm/c. Since yy‘s own occurrences always land in cell h1(y)h_1(y) (contributing exactly F(y)F(y) to the counter), and every other item contributes independently with probability 1/c1/c, the expected counter value is at most F(y)+m/cF(y) + m/c. Note that this bound is actually slightly loose because we include even yy‘s occurrences in the m/cm/c term, but the claim still holds.

Corollary: Let Error=ctrF(y)\text{Error} = \text{ctr} - F(y), then E[Error]mcE[\text{Error}] \leq \dfrac{m}{c}.

Proof: Markov on One Row → P[error > εm] ≤ 1/e

Section titled “Proof: Markov on One Row → P[error > εm] ≤ 1/e”

Continuing the proof of the guarantee. Since c=e/εc = e/\varepsilon, we can substitute into the bound from the corollary:

E[Error]mc=me/ε=mεe=εmeE[\text{Error}] \leq \frac{m}{c} = \frac{m}{e/\varepsilon} = m \cdot \frac{\varepsilon}{e} = \frac{\varepsilon m}{e}

So the error in the first row satisfies E[Error1]εmeE[\text{Error}_1] \leq \dfrac{\varepsilon m}{e}. By Markov’s inequality:

Pr(Error1εm)1e\Pr(\text{Error}_1 \geq \varepsilon m) \leq \frac{1}{e}

By the same argument applied to each row independently, let Errorj=ctrjF(y)\text{Error}_j = \text{ctr}_j - F(y) denote the error in row jj, where ctrj\text{ctr}_j is the counter value at cell hj(y)h_j(y) in row jj. Each Errorj\text{Error}_j satisfies the same bound: E[Errorj]εm/eE[\text{Error}_j] \leq \varepsilon m / e, and by Markov, Pr(Errorjεm)1/e\Pr(\text{Error}_j \geq \varepsilon m) \leq 1/e.

For CMS to return an overestimate F^(y)F(y)+εm\hat{F}(y) \geq F(y) + \varepsilon m, all the errors Error1,Error2,,Errorr\text{Error}_1, \text{Error}_2, \ldots, \text{Error}_r must be εm\geq \varepsilon m. Since the rows use independent hash functions:

Pr ⁣(Error1εmandError2εmandErrorrεm)(1e)r=(1e)ln(1/δ)=δ\begin{aligned} \Pr\!\begin{pmatrix} \text{Error}_1 \geq \varepsilon m \quad \text{and} \\ \text{Error}_2 \geq \varepsilon m \quad \text{and} \\ \quad\vdots \\ \text{Error}_r \geq \varepsilon m \end{pmatrix} \leq & \left(\frac{1}{e}\right)^r = \left(\frac{1}{e}\right)^{\ln(1/\delta)} = \delta \end{aligned}

Therefore:

Pr(all errors areεm)δ\Pr(\text{all errors are} \geq \varepsilon m) \leq \delta

Taking the complement (“not all rows exceed εm\varepsilon m” is the same as “at least one row has error εm\leq \varepsilon m”):

    Pr(some error isεm)1δ\implies \Pr(\text{some error is} \leq \varepsilon m) \geq 1 - \delta

And if even one row jj has Errorjεm\text{Error}_j \leq \varepsilon m, then that row’s counter value is F(y)+εm\leq F(y) + \varepsilon m. Since F^(y)\hat{F}(y) is the minimum across all rows, it is \leq that counter in particular, so:

    Pr ⁣(F^(y)F(y)+εm)1δ\implies \Pr\!\left(\hat{F}(y) \leq F(y) + \varepsilon m\right) \geq 1 - \delta