Final Example
No calculators allowed (if I feel that you have used a calculator, you get minus 15 points). The exam has 6 questions, adding up to 120 points. Your final score is min(your total score on the 5 problems, 100). Please do the problems in order.
Chernoff Bound: Let be i.i.d. Bernoulli random variables and . Let . Then for any :
Problem 1: Count-Min Sketch (15 pts)
Section titled “Problem 1: Count-Min Sketch (15 pts)”Which real-world problem is the count-min sketch intended to solve? What are the space-requirements and guarantees of the count-min sketch? What other basic data structure would you pair the count-min sketch with to solve the real-world problem stated earlier?
Solution
Real-World Problem
The Count-Min Sketch solves the frequency estimation / heavy hitters problem: given a data stream of items, efficiently answer “what is the frequency of item ?” and maintain the top- most frequent items (“heavy hitters”) using sublinear space.
Space Requirements
The sketch is an matrix of counters, where:
Total space: counters.
Guarantees
Given a stream of length , with probability at least the estimated frequency satisfies:
The sketch never underestimates, and overestimates by at most an additive with high probability.
Paired Data Structure
Pair with a min-heap of size to maintain the top- heavy hitters. As each new item arrives, query the sketch for and compare it against the minimum in the heap; if larger, evict the minimum and insert .
Problem 2: Set Membership Storage (18 pts)
Section titled “Problem 2: Set Membership Storage (18 pts)”You are given a set of records, where each record is a string (in the English alphabet, all lower case) of length 100. You are asked to store the records in order to solve the membership problem.
- (2) What is the membership problem?
- (5) In order to always give a correct answer, how many bits (roughly) would you use in total? How many bits per record does that amount to?
- (1) Suppose you can only store 13 bits per record. What data structure would you use to solve membership?
- (10) What is the guarantee of the data structure above? How often can it make an “error”? You do not need to explain how to build the data structure.
Solution
(2) The Membership Problem
Given a set and a query element , answer: is ?
(5) Bits for Exact Storage
Each record is a string of 100 lowercase letters over an alphabet of size 26. The universe of all possible records has size . To uniquely identify any key in we need:
For records the total is roughly bits, i.e., about 471 bits per record.
(1) With 13 Bits Per Record
Use a Bloom filter (total bits).
(10) Bloom Filter Guarantee
A Bloom filter provides:
- No false negatives: if , the filter always answers “yes.”
- Bounded false positive rate : if , the filter incorrectly answers “yes” with probability at most .
For bits with the optimal number of hash functions , we can find from the space formula :
So the filter errs (false positive) on roughly 1 in 512 non-member queries.
Problem 3: Approximate Percentile Streaming (30 pts)
Section titled “Problem 3: Approximate Percentile Streaming (30 pts)”In class we saw the approximate median algorithm. Now we consider the problem where we are given a stream of distinct elements, and instead of the median (which is the th smallest element) we want to output the 60th percentile (or the key with rank ). Since this is hard to do exactly, we are happy with an element such that . Give a streaming algorithm for this problem such that:
-
- the probability of the returned number being too small is at most
- the probability of the returned number being too large is at most
What is the space usage of your algorithm?
Solution
Algorithm
- Draw elements uniformly at random from the stream (via reservoir sampling).
- Return the th smallest element among the samples.
Analysis
Partition the stream by rank into three regions:
- : the elements with rank (too small)
- Good region: elements with rank in (size )
- : the elements with rank (too large)
Let denote the 180th smallest sample. We bound each failure event.
Event 1 — Too Small ()
falls in iff more than samples land in (since elements are the smallest, if at least of the samples are in then the 180th smallest sample is one of them).
Let . Each sample lands in with probability , so .
Set and solve:
Applying Chernoff (with , ):
For this to be : need , so . ✅
Event 2 — Too Large ()
falls in iff more than samples land in (if there are samples in , which are the largest elements, then the 180th smallest sample is not among the largest samples and could still be in ). More precisely: if fewer than samples have rank , then the 180th smallest has rank .
Let . Each sample lands in with probability , so .
iff . Set and solve:
Applying Chernoff (with , ):
For this to be : need , so . ✅
Conclusion
Setting satisfies both constraints simultaneously:
Space Usage
The algorithm stores a reservoir of samples. Since is a constant (determined only by the error parameters), the total space is words, or bits where is the element universe size.
Problem 4: Bloom Filter Analysis (18 pts)
Section titled “Problem 4: Bloom Filter Analysis (18 pts)”Your friend has built a Bloom Filter on keys using 6 hash functions, and sent it to you. Their Bloom Filter uses 100 bits.
- (5) What is the probability that a given bit in this Bloom Filter is 0?
- (3) What is the expected number of 0 bits in the Bloom Filter?
- (10) Suppose your friend forgot to tell you what was. How would you estimate ? [Hint: Is there some empirical measurement you can make and use your formula above?]
Solution
(5) Probability a Given Bit Is 0
Inserting one key sets bits chosen uniformly at random from positions. The probability that a specific bit is not touched by one insertion is . Since all insertions are independent:
(3) Expected Number of 0 Bits
By linearity of expectation, since each of the 100 bits is independently 0 with the probability above:
(10) Estimating
Count the number of 0 bits in the received filter; call it . From part (3), . Solving for :
So count the 0 bits in the filter to get , then report .
(If all bits are 1 — — the formula breaks down; you can only say is very large relative to .)
Problem 5: MTA Commute Strategy (19 pts)
Section titled “Problem 5: MTA Commute Strategy (19 pts)”Suppose a daily ride on the MTA costs $6, and an annual card costs $834. For the purposes of this question assume there are no weekly/monthly cards. You commute every day for work, but due to the pandemic your company may switch to remote work. Your boss knows the date when everyone will switch to remote work, but has not informed you, and will only do so on the morning of the switch. What would you do in such a situation? Is there a number such that your strategy can guarantee not to spend more than times the amount spent by your boss on the train rides? Assuming you start from January 1 this year, when is the “special date” in your strategy when you will make a decision?
Solution
Reduction to Ski Rental
This is the ski rental problem. The daily MTA fare ($6) plays the role of daily rental cost, and the annual card ($834) plays the role of the one-time purchase. The break-even day is:
After renting for 139 days you’ve spent exactly the cost of the annual card.
Strategy
Buy daily tickets for the first 139 days. On day 140, if you are still commuting, buy the annual card.
Competitive Analysis ()
Let be the total number of days you end up commuting (unknown in advance). Your boss, knowing , would pay .
- If : You only rent. Cost . Ratio .
- If : You rent 139 days then buy. Cost is dollars. Your boss buys immediately for $834. Ratio .
The worst-case ratio is , so this strategy is 2-competitive.
The Special Date
Counting 139 days from January 1 (2026 is not a leap year):
Day 139 May 19. Day 140 is May 20 — that is the special date when, if you are still commuting, you buy the annual card.
Problem 6: Third Frequency Moment (20 pts)
Section titled “Problem 6: Third Frequency Moment (20 pts)”Given a stream, explain what the third frequency moment is. Given an algorithm (without analysis) to output an estimate of the third moment on the stream. Run your algorithm three times independently on the stream and show the output each time. Compute the average of the three outputs, and compare with the true answer.
Solution
Definition
Given stream , let be the frequency of item . The -th frequency moment is . The third frequency moment is:
AMS Sampling Algorithm (for )
- Sample a position uniformly at random from .
- Count = number of occurrences of at positions (i.e., from to the end of the stream).
- Output .
One run gives an unbiased estimate (). Average several independent runs for accuracy.
Running on
Stream length . Item frequencies:
| Item | Positions | ||
|---|---|---|---|
| 1 | 1, 2, 15, 16 | 4 | 64 |
| 2 | 4, 11, 13 | 3 | 27 |
| 3 | 3, 14 | 2 | 8 |
| 5 | 5, 8, 9 | 3 | 27 |
| 6 | 12 | 1 | 1 |
| 7 | 6, 10 | 2 | 8 |
| 8 | 7, 17 | 2 | 8 |
True answer: .
Run 1 — sample position 4 ().
2 appears at positions 4, 11, 13, so occurrences from position 4 onward.
Run 2 — sample position 7 ().
8 appears at positions 7, 17, so occurrences from position 7 onward.
Run 3 — sample position 15 ().
1 appears at positions 15, 16 from position 15 onward, so .
Average:
Comparison: The estimate (187) overshoots the true value (143) by about 31%. This is expected — AMS has high variance and only approaches the true answer with many independent repetitions ( copies are needed for an guarantee on ; in general for ).