Final Example | CSCI 328

No calculators allowed (if I feel that you have used a calculator, you get minus 15 points). The exam has 6 questions, adding up to 120 points. Your final score is min(your total score on the 5 problems, 100). Please do the problems in order.

Chernoff Bound: Let $\{X_i\}_{i=1}^n$ be i.i.d. Bernoulli random variables and $X = \sum_{i=1}^n X_i$ . Let $\mu = \mathbb{E}[X]$ . Then for any $0 < \delta \le 1$ :

\Pr(X \ge (1+\delta)\mu) \le e^{-\mu\delta^2/3}

Problem 1: Count-Min Sketch (15 pts)

Which real-world problem is the count-min sketch intended to solve? What are the space-requirements and guarantees of the count-min sketch? What other basic data structure would you pair the count-min sketch with to solve the real-world problem stated earlier?

Solution

Real-World Problem

The Count-Min Sketch solves the frequency estimation / heavy hitters problem: given a data stream of items, efficiently answer “what is the frequency $F(y)$ of item $y$ ?” and maintain the top- $K$ most frequent items (“heavy hitters”) using sublinear space.

Space Requirements

The sketch is an $r \times c$ matrix of counters, where:

c = \left\lceil \frac{e}{\varepsilon} \right\rceil \qquad r = \left\lceil \ln\!\left(\frac{1}{\delta}\right) \right\rceil

Total space: $O\!\left(\frac{1}{\varepsilon} \log \frac{1}{\delta}\right)$ counters.

Guarantees

Given a stream of length $m$ , with probability at least $1 - \delta$ the estimated frequency $\widehat{F}(y)$ satisfies:

F(y) \le \widehat{F}(y) \le F(y) + \varepsilon m

The sketch never underestimates, and overestimates by at most an additive $\varepsilon m$ with high probability.

Paired Data Structure

Pair with a min-heap of size $K$ to maintain the top- $K$ heavy hitters. As each new item $x$ arrives, query the sketch for $\widehat{F}(x)$ and compare it against the minimum in the heap; if larger, evict the minimum and insert $x$ .

Problem 2: Set Membership Storage (18 pts)

You are given a set of $n$ records, where each record is a string (in the English alphabet, all lower case) of length 100. You are asked to store the records in order to solve the membership problem.

(2) What is the membership problem?
(5) In order to always give a correct answer, how many bits (roughly) would you use in total? How many bits per record does that amount to?
(1) Suppose you can only store 13 bits per record. What data structure would you use to solve membership?
(10) What is the guarantee of the data structure above? How often can it make an “error”? You do not need to explain how to build the data structure.

Solution

(2) The Membership Problem

Given a set $S = \{x_1, \ldots, x_n\}$ and a query element $q$ , answer: is $q \in S$ ?

(5) Bits for Exact Storage

Each record is a string of 100 lowercase letters over an alphabet of size 26. The universe of all possible records has size $|U| = 26^{100}$ . To uniquely identify any key in $U$ we need:

\left\lceil \log_2(26^{100}) \right\rceil = \lceil 100 \log_2 26 \rceil \approx \lceil 100 \times 4.7 \rceil = 471 \text{ bits per record}

For $n$ records the total is roughly $\mathbf{471n}$ bits, i.e., about 471 bits per record.

(1) With 13 Bits Per Record

Use a Bloom filter (total $m = 13n$ bits).

(10) Bloom Filter Guarantee

A Bloom filter provides:

No false negatives: if $q \in S$ , the filter always answers “yes.”
Bounded false positive rate $\varepsilon$ : if $q \notin S$ , the filter incorrectly answers “yes” with probability at most $\varepsilon$ .

For $m = 13n$ bits with the optimal number of hash functions $k = \lceil 0.693 \cdot m/n \rceil = 9$ , we can find $\varepsilon$ from the space formula $m = 1.44n \log_2(1/\varepsilon)$ :

13 = 1.44 \log_2\!\left(\frac{1}{\varepsilon}\right) \implies \log_2\!\left(\frac{1}{\varepsilon}\right) \approx 9 \implies \varepsilon \approx \frac{1}{512} \approx 0.2\%

So the filter errs (false positive) on roughly 1 in 512 non-member queries.

Problem 3: Approximate Percentile Streaming (30 pts)

In class we saw the approximate median algorithm. Now we consider the problem where we are given a stream $S = \{x_1, \cdots, x_m\}$ of $m$ distinct elements, and instead of the median (which is the $m/2$ th smallest element) we want to output the 60th percentile (or the key with rank $3m/5$ ). Since this is hard to do exactly, we are happy with an element $x_i$ such that $m/2 \le \operatorname{rank}(x_i) \le 4m/5$ . Give a streaming algorithm for this problem such that:

the probability of the returned number being too small is at most $e^{-2}$
the probability of the returned number being too large is at most $e^{-20}$

What is the space usage of your algorithm?

Solution

Algorithm

Draw $t = 300$ elements uniformly at random from the stream (via reservoir sampling).
Return the $\lceil 3t/5 \rceil = 180$ th smallest element among the samples.

Analysis

Partition the stream by rank into three regions:

$S_L$ : the $m/2$ elements with rank $< m/2$ (too small)
Good region: elements with rank in $[m/2,\ 4m/5]$ (size $3m/10$ )
$S_R$ : the $m/5$ elements with rank $> 4m/5$ (too large)

Let $x^*$ denote the 180th smallest sample. We bound each failure event.

Event 1 — Too Small ( $\operatorname{rank}(x^*) < m/2$ )

$x^*$ falls in $S_L$ iff more than $3t/5$ samples land in $S_L$ (since $S_L$ elements are the smallest, if at least $3t/5$ of the $t$ samples are in $S_L$ then the 180th smallest sample is one of them).

Let $T_L = \#\text{ samples in }S_L$ . Each sample lands in $S_L$ with probability $1/2$ , so $\mu_L = \mathbb{E}[T_L] = t/2$ .

Set $3t/5 = (1 + \gamma_L)(t/2)$ and solve:

\gamma_L = \frac{3t/5}{t/2} - 1 = \frac{6}{5} - 1 = \frac{1}{5}

Applying Chernoff (with $0 < \gamma_L = 1/5 \le 1$ , $\mu_L = t/2$ ):

\Pr\!\left(T_L \ge \frac{3t}{5}\right) \le e^{-\gamma_L^2 \mu_L / 3} = \exp\!\left(-\frac{1}{25} \cdot \frac{t}{2} \cdot \frac{1}{3}\right) = e^{-t/150}

For this to be $\le e^{-2}$ : need $t/150 \ge 2$ , so $t \ge 300$ .

Event 2 — Too Large ( $\operatorname{rank}(x^*) > 4m/5$ )

$x^*$ falls in $S_R$ iff more than $2t/5$ samples land in $S_R$ (if there are $> 2t/5$ samples in $S_R$ , which are the largest elements, then the 180th smallest sample is not among the $> 2t/5$ largest samples and could still be in $S_R$ ). More precisely: if fewer than $3t/5$ samples have rank $\le 4m/5$ , then the 180th smallest has rank $> 4m/5$ .

Let $Y = \#\text{ samples in }S_R$ . Each sample lands in $S_R$ with probability $|S_R|/m = 1/5$ , so $\mu_R = \mathbb{E}[Y] = t/5$ .

$x^* \in S_R$ iff $Y \ge 2t/5$ . Set $2t/5 = (1 + \gamma_R)(t/5)$ and solve:

\gamma_R = \frac{2t/5}{t/5} - 1 = 2 - 1 = 1

Applying Chernoff (with $0 < \gamma_R = 1 \le 1$ , $\mu_R = t/5$ ):

\Pr\!\left(Y \ge \frac{2t}{5}\right) \le e^{-\gamma_R^2 \mu_R / 3} = \exp\!\left(-1 \cdot \frac{t}{5} \cdot \frac{1}{3}\right) = e^{-t/15}

For this to be $\le e^{-20}$ : need $t/15 \ge 20$ , so $t \ge 300$ .

Conclusion

Setting $t = 300$ satisfies both constraints simultaneously:

\Pr(\text{too small}) \le e^{-300/150} = e^{-2} \qquad \Pr(\text{too large}) \le e^{-300/15} = e^{-20}

Space Usage

The algorithm stores a reservoir of $t = 300$ samples. Since $t$ is a constant (determined only by the error parameters), the total space is $O(1)$ words, or $O(\log n)$ bits where $n$ is the element universe size.

Problem 4: Bloom Filter Analysis (18 pts)

Your friend has built a Bloom Filter on $n$ keys using 6 hash functions, and sent it to you. Their Bloom Filter uses 100 bits.

(5) What is the probability that a given bit in this Bloom Filter is 0?
(3) What is the expected number of 0 bits in the Bloom Filter?
(10) Suppose your friend forgot to tell you what $n$ was. How would you estimate $n$ ? [Hint: Is there some empirical measurement you can make and use your formula above?]

Solution

(5) Probability a Given Bit Is 0

Inserting one key sets $k = 6$ bits chosen uniformly at random from $m = 100$ positions. The probability that a specific bit is not touched by one insertion is $1 - k/m = 1 - 6/100$ . Since all $n$ insertions are independent:

\Pr(\text{bit} = 0) = \left(1 - \frac{k}{m}\right)^{\!n} = \left(1 - \frac{6}{100}\right)^{\!n} \approx e^{-kn/m} = e^{-6n/100} = e^{-3n/50}

(3) Expected Number of 0 Bits

By linearity of expectation, since each of the 100 bits is independently 0 with the probability above:

\mathbb{E}[\text{\# zero bits}] = m \cdot \Pr(\text{bit} = 0) = 100 \cdot e^{-3n/50}

(10) Estimating $n$

Count the number of 0 bits in the received filter; call it $z$ . From part (3), $z \approx 100 \cdot e^{-6n/100}$ . Solving for $n$ :

e^{-6n/100} = \frac{z}{100} \implies -\frac{6n}{100} = \ln\!\left(\frac{z}{100}\right) \implies \hat{n} = \frac{100}{6}\ln\!\left(\frac{100}{z}\right)

So count the 0 bits in the filter to get $z$ , then report $\hat{n} = \frac{100}{6}\ln\!\left(\frac{100}{z}\right)$ .

(If all bits are 1 — $z = 0$ — the formula breaks down; you can only say $n$ is very large relative to $m/k$ .)

Problem 5: MTA Commute Strategy (19 pts)

Suppose a daily ride on the MTA costs $6, and an annual card costs $834. For the purposes of this question assume there are no weekly/monthly cards. You commute every day for work, but due to the pandemic your company may switch to remote work. Your boss knows the date when everyone will switch to remote work, but has not informed you, and will only do so on the morning of the switch. What would you do in such a situation? Is there a number $\alpha$ such that your strategy can guarantee not to spend more than $\alpha$ times the amount spent by your boss on the train rides? Assuming you start from January 1 this year, when is the “special date” in your strategy when you will make a decision?

Solution

Reduction to Ski Rental

This is the ski rental problem. The daily MTA fare ($6) plays the role of daily rental cost, and the annual card ($834) plays the role of the one-time purchase. The break-even day is:

B = \frac{\$834}{\$6} = 139 \text{ days}

After renting for 139 days you’ve spent exactly the cost of the annual card.

Strategy

Buy daily tickets for the first 139 days. On day 140, if you are still commuting, buy the annual card.

Competitive Analysis ( $\alpha = 2$ )

Let $d$ be the total number of days you end up commuting (unknown in advance). Your boss, knowing $d$ , would pay $\text{OPT} = \min(6d,\ 834)$ .

If $d < 139$ : You only rent. Cost $= 6d = \text{OPT}$ . Ratio $= 1$ .
If $d \ge 139$ : You rent 139 days then buy. Cost is $139 \times 6 + 834 = 834 + 834 = 1668$ dollars. Your boss buys immediately for $834. Ratio $= 1668/834 = 2$ .

The worst-case ratio is $\alpha = 2$ , so this strategy is 2-competitive.

The Special Date

Counting 139 days from January 1 (2026 is not a leap year):

\underbrace{31}_{\text{Jan}} + \underbrace{28}_{\text{Feb}} + \underbrace{31}_{\text{Mar}} + \underbrace{30}_{\text{Apr}} = 120 \text{ days through April 30}

Day 139 $= 120 + 19 =$ May 19. Day 140 is May 20 — that is the special date when, if you are still commuting, you buy the annual card.

Problem 6: Third Frequency Moment (20 pts)

Given a stream, explain what the third frequency moment is. Given an algorithm (without analysis) to output an estimate of the third moment on the stream. Run your algorithm three times independently on the stream $S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\}$ and show the output each time. Compute the average of the three outputs, and compare with the true answer.

Solution

Definition

Given stream $S = \{x_1, \ldots, x_m\}$ , let $f(i) = |\{j : x_j = i\}|$ be the frequency of item $i$ . The $k$ -th frequency moment is $F_k = \sum_i f(i)^k$ . The third frequency moment is:

F_3 = \sum_i f(i)^3

AMS Sampling Algorithm (for $F_3$ )

Sample a position $j$ uniformly at random from $\{1, \ldots, m\}$ .
Count $r$ = number of occurrences of $x_j$ at positions $\ge j$ (i.e., from $j$ to the end of the stream).
Output $X = m(r^3 - (r-1)^3)$ .

One run gives an unbiased estimate ( $\mathbb{E}[X] = F_3$ ). Average several independent runs for accuracy.

Running on $S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\}$

Stream length $m = 17$ . Item frequencies:

Item $i$	Positions	$f(i)$	$f(i)^3$
1	1, 2, 15, 16	4	64
2	4, 11, 13	3	27
3	3, 14	2	8
5	5, 8, 9	3	27
6	12	1	1
7	6, 10	2	8
8	7, 17	2	8

True answer: $F_3 = 64 + 27 + 8 + 27 + 1 + 8 + 8 = 143$ .

Run 1 — sample position 4 ( $x = 2$ ).
2 appears at positions 4, 11, 13, so $r = 3$ occurrences from position 4 onward.

X_1 = 17(3^3 - 2^3) = 17(27 - 8) = 17 \times 19 = 323

Run 2 — sample position 7 ( $x = 8$ ).
8 appears at positions 7, 17, so $r = 2$ occurrences from position 7 onward.

X_2 = 17(2^3 - 1^3) = 17(8 - 1) = 17 \times 7 = 119

Run 3 — sample position 15 ( $x = 1$ ).
1 appears at positions 15, 16 from position 15 onward, so $r = 2$ .

X_3 = 17(2^3 - 1^3) = 17 \times 7 = 119

Average:

Y = \frac{X_1 + X_2 + X_3}{3} = \frac{323 + 119 + 119}{3} = \frac{561}{3} = 187

Comparison: The estimate (187) overshoots the true value (143) by about 31%. This is expected — AMS has high variance and only approaches the true answer with many independent repetitions ( $O(n^{2/3}/\varepsilon^2 \cdot \log(1/\delta))$ copies are needed for an $(\varepsilon, \delta)$ guarantee on $F_3$ ; in general $O(k \cdot n^{1-1/k}/\varepsilon^2 \cdot \log(1/\delta))$ for $F_k$ ).

Final Example | CSCI 328

Problem 1: Count-Min Sketch (15 pts)

Real-World Problem

Space Requirements

Guarantees

Paired Data Structure

Problem 2: Set Membership Storage (18 pts)

(2) The Membership Problem

(5) Bits for Exact Storage

(1) With 13 Bits Per Record

(10) Bloom Filter Guarantee

Problem 3: Approximate Percentile Streaming (30 pts)

Algorithm

Analysis

Event 1 — Too Small (rank⁡(x∗)<m/2\operatorname{rank}(x^*) < m/2rank(x∗)<m/2)

Event 2 — Too Large (rank⁡(x∗)>4m/5\operatorname{rank}(x^*) > 4m/5rank(x∗)>4m/5)

Conclusion

Space Usage

Problem 4: Bloom Filter Analysis (18 pts)

(5) Probability a Given Bit Is 0

(3) Expected Number of 0 Bits

(10) Estimating nnn

Problem 5: MTA Commute Strategy (19 pts)

Reduction to Ski Rental

Strategy

Competitive Analysis (α=2\alpha = 2α=2)

The Special Date

Problem 6: Third Frequency Moment (20 pts)

Definition

AMS Sampling Algorithm (for F3F_3F3​)

Running on S={1,1,3,2,5,7,8,5,5,7,2,6,2,3,1,1,8}S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\}S={1,1,3,2,5,7,8,5,5,7,2,6,2,3,1,1,8}

Event 1 — Too Small ( $\operatorname{rank}(x^*) < m/2$ )

Event 2 — Too Large ( $\operatorname{rank}(x^*) > 4m/5$ )

(10) Estimating $n$

Competitive Analysis ( $\alpha = 2$ )

AMS Sampling Algorithm (for $F_3$ )

Running on $S = \{1, 1, 3, 2, 5, 7, 8, 5, 5, 7, 2, 6, 2, 3, 1, 1, 8\}$