Skip to content

Final Example

No calculators allowed (if I feel that you have used a calculator, you get minus 15 points). The exam has 6 questions, adding up to 120 points. Your final score is min(your total score on the 5 problems, 100). Please do the problems in order.

Chernoff Bound: Let {Xi}i=1n\{X_i\}_{i=1}^n be i.i.d. Bernoulli random variables and X=i=1nXiX = \sum_{i=1}^n X_i. Let μ=E[X]\mu = \mathbb{E}[X]. Then for any 0<δ10 < \delta \le 1:

Pr(X(1+δ)μ)eμδ2/3\Pr(X \ge (1+\delta)\mu) \le e^{-\mu\delta^2/3}

Which real-world problem is the count-min sketch intended to solve? What are the space-requirements and guarantees of the count-min sketch? What other basic data structure would you pair the count-min sketch with to solve the real-world problem stated earlier?

Solution

Real-World Problem

The Count-Min Sketch solves the frequency estimation / heavy hitters problem: given a data stream of items, efficiently answer “what is the frequency F(y)F(y) of item yy?” and maintain the top-KK most frequent items (“heavy hitters”) using sublinear space.

Space Requirements

The sketch is an r×cr \times c matrix of counters, where:

c=eεr=ln ⁣(1δ)c = \left\lceil \frac{e}{\varepsilon} \right\rceil \qquad r = \left\lceil \ln\!\left(\frac{1}{\delta}\right) \right\rceil

Total space: O ⁣(1εlog1δ)O\!\left(\frac{1}{\varepsilon} \log \frac{1}{\delta}\right) counters.

Guarantees

Given a stream of length mm, with probability at least 1δ1 - \delta the estimated frequency F^(y)\widehat{F}(y) satisfies:

F(y)F^(y)F(y)+εmF(y) \le \widehat{F}(y) \le F(y) + \varepsilon m

The sketch never underestimates, and overestimates by at most an additive εm\varepsilon m with high probability.

Paired Data Structure

Pair with a min-heap of size KK to maintain the top-KK heavy hitters. As each new item xx arrives, query the sketch for F^(x)\widehat{F}(x) and compare it against the minimum in the heap; if larger, evict the minimum and insert xx.

Problem 2: Set Membership Storage (18 pts)

Section titled “Problem 2: Set Membership Storage (18 pts)”

You are given a set of nn records, where each record is a string (in the English alphabet, all lower case) of length 100. You are asked to store the records in order to solve the membership problem.

  • (2) What is the membership problem?
  • (5) In order to always give a correct answer, how many bits (roughly) would you use in total? How many bits per record does that amount to?
  • (1) Suppose you can only store 13 bits per record. What data structure would you use to solve membership?
  • (10) What is the guarantee of the data structure above? How often can it make an “error”? You do not need to explain how to build the data structure.
Solution

(2) The Membership Problem

Given a set S={x1,,xn}S = \{x_1, \ldots, x_n\} and a query element qq, answer: is qSq \in S?

(5) Bits for Exact Storage

Each record is a string of 100 lowercase letters over an alphabet of size 26. The universe of all possible records has size U=26100|U| = 26^{100}. To uniquely identify any key in UU we need:

log2(26100)=100log226100×4.7=471 bits per record\left\lceil \log_2(26^{100}) \right\rceil = \lceil 100 \log_2 26 \rceil \approx \lceil 100 \times 4.7 \rceil = 471 \text{ bits per record}

For nn records the total is roughly 471n\mathbf{471n} bits, i.e., about 471 bits per record.

(1) With 13 Bits Per Record

Use a Bloom filter (total m=13nm = 13n bits).

(10) Bloom Filter Guarantee

A Bloom filter provides:

  • No false negatives: if qSq \in S, the filter always answers “yes.”
  • Bounded false positive rate ε\varepsilon: if qSq \notin S, the filter incorrectly answers “yes” with probability at most ε\varepsilon.

For m=13nm = 13n bits with the optimal number of hash functions k=0.693m/n=9k = \lceil 0.693 \cdot m/n \rceil = 9, we can find ε\varepsilon from the space formula m=1.44nlog2(1/ε)m = 1.44n \log_2(1/\varepsilon):

13=1.44log2 ⁣(1ε)    log2 ⁣(1ε)9    ε15120.2%13 = 1.44 \log_2\!\left(\frac{1}{\varepsilon}\right) \implies \log_2\!\left(\frac{1}{\varepsilon}\right) \approx 9 \implies \varepsilon \approx \frac{1}{512} \approx 0.2\%

So the filter errs (false positive) on roughly 1 in 512 non-member queries.

Problem 3: Approximate Percentile Streaming (30 pts)

Section titled “Problem 3: Approximate Percentile Streaming (30 pts)”

In class we saw the approximate median algorithm. Now we consider the problem where we are given a stream S={x1,,xm}S = \{x_1, \cdots, x_m\} of mm distinct elements, and instead of the median (which is the m/2m/2th smallest element) we want to output the 60th percentile (or the key with rank 3m/53m/5). Since this is hard to do exactly, we are happy with an element xix_i such that m/2rank(xi)4m/5m/2 \le \operatorname{rank}(x_i) \le 4m/5. Give a streaming algorithm for this problem such that:

  • the probability of the returned number being too small is at most e2e^{-2}
  • the probability of the returned number being too large is at most e20e^{-20}

What is the space usage of your algorithm?

Solution

Algorithm

  1. Draw t=300t = 300 elements uniformly at random from the stream (via reservoir sampling).
  2. Return the 3t/5=180\lceil 3t/5 \rceil = 180th smallest element among the samples.

Analysis

Partition the stream by rank into three regions:

  • SLS_L: the m/2m/2 elements with rank <m/2< m/2 (too small)
  • Good region: elements with rank in [m/2, 4m/5][m/2,\ 4m/5] (size 3m/103m/10)
  • SRS_R: the m/5m/5 elements with rank >4m/5> 4m/5 (too large)

Let xx^* denote the 180th smallest sample. We bound each failure event.

Event 1 — Too Small (rank(x)<m/2\operatorname{rank}(x^*) < m/2)

xx^* falls in SLS_L iff more than 3t/53t/5 samples land in SLS_L (since SLS_L elements are the smallest, if at least 3t/53t/5 of the tt samples are in SLS_L then the 180th smallest sample is one of them).

Let TL=# samples in SLT_L = \#\text{ samples in }S_L. Each sample lands in SLS_L with probability 1/21/2, so μL=E[TL]=t/2\mu_L = \mathbb{E}[T_L] = t/2.

Set 3t/5=(1+γL)(t/2)3t/5 = (1 + \gamma_L)(t/2) and solve:

γL=3t/5t/21=651=15\gamma_L = \frac{3t/5}{t/2} - 1 = \frac{6}{5} - 1 = \frac{1}{5}

Applying Chernoff (with 0<γL=1/510 < \gamma_L = 1/5 \le 1, μL=t/2\mu_L = t/2):

Pr ⁣(TL3t5)eγL2μL/3=exp ⁣(125t213)=et/150\Pr\!\left(T_L \ge \frac{3t}{5}\right) \le e^{-\gamma_L^2 \mu_L / 3} = \exp\!\left(-\frac{1}{25} \cdot \frac{t}{2} \cdot \frac{1}{3}\right) = e^{-t/150}

For this to be e2\le e^{-2}: need t/1502t/150 \ge 2, so t300t \ge 300. ✅

Event 2 — Too Large (rank(x)>4m/5\operatorname{rank}(x^*) > 4m/5)

xx^* falls in SRS_R iff more than 2t/52t/5 samples land in SRS_R (if there are >2t/5> 2t/5 samples in SRS_R, which are the largest elements, then the 180th smallest sample is not among the >2t/5> 2t/5 largest samples and could still be in SRS_R). More precisely: if fewer than 3t/53t/5 samples have rank 4m/5\le 4m/5, then the 180th smallest has rank >4m/5> 4m/5.

Let Y=# samples in SRY = \#\text{ samples in }S_R. Each sample lands in SRS_R with probability SR/m=1/5|S_R|/m = 1/5, so μR=E[Y]=t/5\mu_R = \mathbb{E}[Y] = t/5.

xSRx^* \in S_R iff Y2t/5Y \ge 2t/5. Set 2t/5=(1+γR)(t/5)2t/5 = (1 + \gamma_R)(t/5) and solve:

γR=2t/5t/51=21=1\gamma_R = \frac{2t/5}{t/5} - 1 = 2 - 1 = 1

Applying Chernoff (with 0<γR=110 < \gamma_R = 1 \le 1, μR=t/5\mu_R = t/5):

Pr ⁣(Y2t5)eγR2μR/3=exp ⁣(1t513)=et/15\Pr\!\left(Y \ge \frac{2t}{5}\right) \le e^{-\gamma_R^2 \mu_R / 3} = \exp\!\left(-1 \cdot \frac{t}{5} \cdot \frac{1}{3}\right) = e^{-t/15}

For this to be e20\le e^{-20}: need t/1520t/15 \ge 20, so t300t \ge 300. ✅

Conclusion

Setting t=300t = 300 satisfies both constraints simultaneously:

Pr(too small)e300/150=e2Pr(too large)e300/15=e20\Pr(\text{too small}) \le e^{-300/150} = e^{-2} \qquad \Pr(\text{too large}) \le e^{-300/15} = e^{-20}

Space Usage

The algorithm stores a reservoir of t=300t = 300 samples. Since tt is a constant (determined only by the error parameters), the total space is O(1)O(1) words, or O(logn)O(\log n) bits where nn is the element universe size.

Your friend has built a Bloom Filter on nn keys using 6 hash functions, and sent it to you. Their Bloom Filter uses 100 bits.

  • (5) What is the probability that a given bit in this Bloom Filter is 0?
  • (3) What is the expected number of 0 bits in the Bloom Filter?
  • (10) Suppose your friend forgot to tell you what nn was. How would you estimate nn? [Hint: Is there some empirical measurement you can make and use your formula above?]
Solution

(5) Probability a Given Bit Is 0

Inserting one key sets k=6k = 6 bits chosen uniformly at random from m=100m = 100 positions. The probability that a specific bit is not touched by one insertion is 1k/m=16/1001 - k/m = 1 - 6/100. Since all nn insertions are independent:

Pr(bit=0)=(1km) ⁣n=(16100) ⁣nekn/m=e6n/100=e3n/50\Pr(\text{bit} = 0) = \left(1 - \frac{k}{m}\right)^{\!n} = \left(1 - \frac{6}{100}\right)^{\!n} \approx e^{-kn/m} = e^{-6n/100} = e^{-3n/50}

(3) Expected Number of 0 Bits

By linearity of expectation, since each of the 100 bits is independently 0 with the probability above:

E[# zero bits]=mPr(bit=0)=100e3n/50\mathbb{E}[\text{\# zero bits}] = m \cdot \Pr(\text{bit} = 0) = 100 \cdot e^{-3n/50}

(10) Estimating nn

Count the number of 0 bits in the received filter; call it zz. From part (3), z100e6n/100z \approx 100 \cdot e^{-6n/100}. Solving for nn:

e6n/100=z100    6n100=ln ⁣(z100)    n^=1006ln ⁣(100z)e^{-6n/100} = \frac{z}{100} \implies -\frac{6n}{100} = \ln\!\left(\frac{z}{100}\right) \implies \hat{n} = \frac{100}{6}\ln\!\left(\frac{100}{z}\right)

So count the 0 bits in the filter to get zz, then report n^=1006ln ⁣(100z)\hat{n} = \frac{100}{6}\ln\!\left(\frac{100}{z}\right).

(If all bits are 1 — z=0z = 0 — the formula breaks down; you can only say nn is very large relative to m/km/k.)

Suppose a daily ride on the MTA costs $6, and an annual card costs $834. For the purposes of this question assume there are no weekly/monthly cards. You commute every day for work, but due to the pandemic your company may switch to remote work. Your boss knows the date when everyone will switch to remote work, but has not informed you, and will only do so on the morning of the switch. What would you do in such a situation? Is there a number α\alpha such that your strategy can guarantee not to spend more than α\alpha times the amount spent by your boss on the train rides? Assuming you start from January 1 this year, when is the “special date” in your strategy when you will make a decision?

Solution

Reduction to Ski Rental

This is the ski rental problem. The daily MTA fare ($6) plays the role of daily rental cost, and the annual card ($834) plays the role of the one-time purchase. The break-even day is:

B=$834$6=139 daysB = \frac{\$834}{\$6} = 139 \text{ days}

After renting for 139 days you’ve spent exactly the cost of the annual card.

Strategy

Buy daily tickets for the first 139 days. On day 140, if you are still commuting, buy the annual card.

Competitive Analysis (α=2\alpha = 2)

Let dd be the total number of days you end up commuting (unknown in advance). Your boss, knowing dd, would pay OPT=min(6d, 834)\text{OPT} = \min(6d,\ 834).

  • If d<139d < 139: You only rent. Cost =6d=OPT= 6d = \text{OPT}. Ratio =1= 1.
  • If d139d \ge 139: You rent 139 days then buy. Cost is 139×6+834=834+834=1668139 \times 6 + 834 = 834 + 834 = 1668 dollars. Your boss buys immediately for $834. Ratio =1668/834=2= 1668/834 = 2.

The worst-case ratio is α=2\alpha = 2, so this strategy is 2-competitive.

The Special Date

Counting 139 days from January 1 (2026 is not a leap year):

31Jan+28Feb+31Mar+30Apr=120 days through April 30\underbrace{31}_{\text{Jan}} + \underbrace{28}_{\text{Feb}} + \underbrace{31}_{\text{Mar}} + \underbrace{30}_{\text{Apr}} = 120 \text{ days through April 30}

Day 139 =120+19== 120 + 19 = May 19. Day 140 is May 20 — that is the special date when, if you are still commuting, you buy the annual card.

Problem 6: Third Frequency Moment (20 pts)

Section titled “Problem 6: Third Frequency Moment (20 pts)”

Given a stream, explain what the third frequency moment is. Given an algorithm (without analysis) to output an estimate of the third moment on the stream. Run your algorithm three times independently on the stream S={1,1,3,2,5,7,8,5,5,7,2,6,2,3,1,1,8}S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\} and show the output each time. Compute the average of the three outputs, and compare with the true answer.

Solution

Definition

Given stream S={x1,,xm}S = \{x_1, \ldots, x_m\}, let f(i)={j:xj=i}f(i) = |\{j : x_j = i\}| be the frequency of item ii. The kk-th frequency moment is Fk=if(i)kF_k = \sum_i f(i)^k. The third frequency moment is:

F3=if(i)3F_3 = \sum_i f(i)^3

AMS Sampling Algorithm (for F3F_3)

  1. Sample a position jj uniformly at random from {1,,m}\{1, \ldots, m\}.
  2. Count rr = number of occurrences of xjx_j at positions j\ge j (i.e., from jj to the end of the stream).
  3. Output X=m(r3(r1)3)X = m(r^3 - (r-1)^3).

One run gives an unbiased estimate (E[X]=F3\mathbb{E}[X] = F_3). Average several independent runs for accuracy.

Running on S={1,1,3,2,5,7,8,5,5,7,2,6,2,3,1,1,8}S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\}

Stream length m=17m = 17. Item frequencies:

Item iiPositionsf(i)f(i)f(i)3f(i)^3
11, 2, 15, 16464
24, 11, 13327
33, 1428
55, 8, 9327
61211
76, 1028
87, 1728

True answer: F3=64+27+8+27+1+8+8=143F_3 = 64 + 27 + 8 + 27 + 1 + 8 + 8 = 143.

Run 1 — sample position 4 (x=2x = 2).
2 appears at positions 4, 11, 13, so r=3r = 3 occurrences from position 4 onward.

X1=17(3323)=17(278)=17×19=323X_1 = 17(3^3 - 2^3) = 17(27 - 8) = 17 \times 19 = 323

Run 2 — sample position 7 (x=8x = 8).
8 appears at positions 7, 17, so r=2r = 2 occurrences from position 7 onward.

X2=17(2313)=17(81)=17×7=119X_2 = 17(2^3 - 1^3) = 17(8 - 1) = 17 \times 7 = 119

Run 3 — sample position 15 (x=1x = 1).
1 appears at positions 15, 16 from position 15 onward, so r=2r = 2.

X3=17(2313)=17×7=119X_3 = 17(2^3 - 1^3) = 17 \times 7 = 119

Average:

Y=X1+X2+X33=323+119+1193=5613=187Y = \frac{X_1 + X_2 + X_3}{3} = \frac{323 + 119 + 119}{3} = \frac{561}{3} = 187

Comparison: The estimate (187) overshoots the true value (143) by about 31%. This is expected — AMS has high variance and only approaches the true answer with many independent repetitions (O(n2/3/ε2log(1/δ))O(n^{2/3}/\varepsilon^2 \cdot \log(1/\delta)) copies are needed for an (ε,δ)(\varepsilon, \delta) guarantee on F3F_3; in general O(kn11/k/ε2log(1/δ))O(k \cdot n^{1-1/k}/\varepsilon^2 \cdot \log(1/\delta)) for FkF_k).