Lecture 17 (03/23/2026) - Frequency Estimation (Heavy Hitters); Count-Min Sketch
Frequency Estimation in Streams
Section titled “Frequency Estimation in Streams”Heavy Hitters Problem Statement
Section titled “Heavy Hitters Problem Statement”Consider a stream , where each element represents an Amazon product ID.
Assume that contains numbers . For any :
(i.e., the number of times appears in ). For example, , .
We want to keep track of the top- most frequent (heavy hitters) items.
- Lower bound: if we don’t allow any errors, then even requires us to store the full stream. Intuitively, to know the exact most frequent item you may need to distinguish between streams that differ only in their last element, which forces you to remember everything.
Frequency Query Data Structure
Section titled “Frequency Query Data Structure”- If we want to save space, we have to allow errors.
- It is enough to build a data structure/sketch that performs queries of the form: What is ? given any .
Given such a frequency-query sketch, we can track the top- items using a min-heap of size (with each item’s current frequency as its priority). Whenever a new item arrives in the stream, we query its frequency and compare it to the smallest frequency currently in the heap. If it is larger, it displaces the least popular item.
We can define a heap that stores the top- items in the stream so far.
Initialization: For the first distinct items seen in the stream, insert them directly into the heap (they are trivially the top- so far).
For each subsequent item arriving in the stream:
- Query
- Compare to (the frequency of the current minimum in the heap)
- If :
- Delete the minimum from the heap
- Insert
The total update time per arriving item is , since the heap stores only items and heap operations take time.
Problem Statement: Build a Sketch to Answer F(i) Queries
Section titled “Problem Statement: Build a Sketch to Answer F(i) Queries”The setting is that we are given a stream where .
Goal: Store so as to answer queries: what is ?
Count-Min Sketch
Section titled “Count-Min Sketch”Introduction
Section titled “Introduction”CMS can also be thought of as a stacked Bloom filter: instead of storing bits, we keep actual counters, and when querying we return the minimum counter value across all rows.
Data Structure (r × c Matrix, One Hash Function per Row)
Section titled “Data Structure (r × c Matrix, One Hash Function per Row)”Consider a table (sort of like a matrix) with rows and columns, where:
- Number of columns:
- Number of rows:
For every row, we pick a hash function. There are hash functions , one per row, where is a perfectly random hash function:
All of these hash functions take any of the products as inputs and will output one of the columns. In other words, each hash function takes an item and places it into the columns in its row, uniformly at random.
Update Algorithm
Section titled “Update Algorithm”Given a stream , for each arriving item :
- Compute each of the hash functions .
So in every row we will get a cell that the item gets hashed to by that row’s hash function. Then we increment the counters in those cells by 1, indicating that the item has been hashed into that cell. We do this for all items in the stream.
Query Algorithm
Section titled “Query Algorithm”Query: Item - what is the frequency of item , i.e. ?
Query algorithm: Compute and output the minimum value in these cells.
The reason we take the minimum is that hash collisions can only inflate a counter above the true frequency (other items that collide into the same cell add to it). So the row whose counter is least inflated gives the best estimate, and taking the minimum across all rows gives us the closest value to the truth. This is also where the name comes from: we are counting frequencies using a min.
Let denote the value output by Count-Min Sketch. Note:
- Can ? No - because every time this item arrives in the stream, we would have updated the counters at least that many times, so the counters that we query for will always be at least .
- So always - CMS can only overestimate frequencies. The real question is: by how much can it overestimate?
Theorem:
where is the length of the stream so far.
Absolute vs. Relative Error
Section titled “Absolute vs. Relative Error”Proof: Lemma E[ctr] ≤ F(y) + m/c
Section titled “Proof: Lemma E[ctr] ≤ F(y) + m/c”We’ll focus on only one row here, for which is the hash function.
- (since every occurrence of increments that cell, any collisions only add more).
Lemma:
Proof: Consider one fixed row. There are total items in the stream. Each item gets hashed (uniformly at random) to one of the cells in this row. For any single cell, the expected number of items (out of the full ) that land in it is . Since ‘s own occurrences always land in cell (contributing exactly to the counter), and every other item contributes independently with probability , the expected counter value is at most . Note that this bound is actually slightly loose because we include even ‘s occurrences in the term, but the claim still holds.
Corollary: Let , then .
Proof: Markov on One Row → P[error > εm] ≤ 1/e
Section titled “Proof: Markov on One Row → P[error > εm] ≤ 1/e”Continuing the proof of the guarantee. Since , we can substitute into the bound from the corollary:
So the error in the first row satisfies . By Markov’s inequality:
By the same argument applied to each row independently, let denote the error in row , where is the counter value at cell in row . Each satisfies the same bound: , and by Markov, .
For CMS to return an overestimate , all the errors must be . Since the rows use independent hash functions:
Therefore:
Taking the complement (“not all rows exceed ” is the same as “at least one row has error ”):
And if even one row has , then that row’s counter value is . Since is the minimum across all rows, it is that counter in particular, so: