Final Prep: Problem Sets | CSCI 328

Problem 1: Morris Counter - Streaming Algorithms

(a) What is an $(\varepsilon, \delta)$ -guarantee for a randomized estimator $\hat{n}$ of some quantity $n$ ? What do $\varepsilon$ and $\delta$ each control?

Solution

An $(\varepsilon, \delta)$ -guarantee is a way of formally expressing that an algorithm gives a good approximate answer with high probability. It has two components:

$\varepsilon$ — the accuracy parameter. This controls how close the estimate is to the true answer. Specifically, it says the estimate is within a multiplicative factor of $(1 \pm \varepsilon)$ of the true value:

\hat{n} \in [(1 - \varepsilon)\,n,\ (1 + \varepsilon)\,n]

So a smaller $\varepsilon$ means a tighter, more accurate estimate.

$\delta$ — the failure probability. This controls how often the algorithm is allowed to be wrong. The guarantee holds with probability at least $1 - \delta$ :

\Pr\!\bigl(\hat{n} \in [(1 - \varepsilon)\,n,\ (1 + \varepsilon)\,n]\bigr) \ge 1 - \delta

So a smaller $\delta$ means the algorithm fails less often.

Putting it together. The full $(\varepsilon, \delta)$ -guarantee says:

With probability at least $1 - \delta$ , my estimate is within $\varepsilon$ of the true answer.

The tradeoff. You can always make $\varepsilon$ and $\delta$ smaller — but it costs you. Typically making them smaller requires running more independent copies of the algorithm (boosting via median-of-means), which increases space and time usage. So $\varepsilon$ and $\delta$ let you precisely tune the accuracy vs. resource tradeoff.

(b) Describe the Morris Counter algorithm and analyze its space complexity compared to a naive counting approach. What is the trade-off being made?

Solution

Morris Counter algorithm:

Initialize counter $C = 0$
On each new stream element, increment $C$ to $C+1$ with probability $p = 1/2^C$
Return estimate $\hat{n} = 2^C - 1$

Space complexity: Only $C$ is stored, where $0 \le C \le \log_2 n$ . This requires $O(\log \log n)$ bits.

Naive approach: Storing the exact count requires $\Theta(\log n)$ bits.

Trade-off: By randomizing increments, we achieve exponential space savings ( $O(\log \log n)$ vs. $O(\log n)$ ), but accept a random error in the estimate. The output is approximate, not exact.

(c) What are the mean and variance of the Morris Counter’s output? Why is the variance a problem?

Solution

Mean and variance:

From the Morris Counter algorithm:

$E[\hat{n}] = E[2^C - 1] = N$ (unbiased)
$\text{Var}(\hat{n}) = \text{Var}(2^C - 1) \approx \frac{N(N-1)}{2} = O(N^2)$

Why it’s a problem: The standard deviation is $O(N)$ , so the output can deviate wildly from the true count. With high probability, the estimate could be off by $N$ or more, making a single Morris Counter unreliable for high-confidence estimates.

(d) Describe three successive strategies for improving the Morris Counter’s reliability. How does each strategy reduce the failure probability, and what is the total space cost?

Solution

Three successive strategies:

Strategy 1: Morris (basic)

Run a single Morris Counter
Success probability: undefined (no confidence guarantee)
Failure probability: unbounded
Space: $O(\log \log N)$ bits

Strategy 2: Morris+ (weak estimate via averaging)

Run $T = \lceil 1/\varepsilon^2 \rceil$ independent Morris Counters in parallel
Return the average of the $T$ estimates
By Chebyshev’s inequality (or Exercise 4.9(b)), this gives an $(\varepsilon, 3/4)$ -estimate
Success probability: $\geq 3/4$ (failure probability $\le 1/4$ )
Space: $O(1/\varepsilon^2 \cdot \log \log N)$ bits

Strategy 3: Morris++ (strong estimate via boosting)

Run $M = \lceil \log(1/\delta) \rceil$ independent Morris+ instances in parallel
Return the median of their outputs
By the median-of-weak-estimates trick (Exercise 4.9(c)), this gives a $(\varepsilon, \delta)$ -estimate
Success probability: $\geq 1 - \delta$ (arbitrary confidence)
Space: $O(\log(1/\delta) \cdot 1/\varepsilon^2 \cdot \log \log N)$ bits

Each strategy improves failure probability at the cost of more parallel instances (and more space).

(e) For an application requiring count estimation within $\varepsilon N$ of the true count with confidence $\geq 1 - \delta$ , where $\varepsilon = 0.1$ and $\delta = 0.01$ , how many parallel Morris counters are needed in each of the three strategies from part (d)?

Solution

Numerical calculation for $\varepsilon = 0.1$ , $\delta = 0.01$ :

Morris: No reliability guarantee.

Morris+:

T = \lceil 1/\varepsilon^2 \rceil = \lceil 1/(0.1)^2 \rceil = \lceil 100 \rceil = 100 \text{ counters}

Morris++:

M = \lceil \log(1/\delta) \rceil = \lceil \log(1/0.01) \rceil = \lceil \log(100) \rceil \approx \lceil 6.64 \rceil = 7 \text{ instances of Morris+}

Total counters needed:

M \times T = 7 \times 100 = 700 \text{ Morris Counters}

Problem 2: Count-Min Sketch

The Count-Min Sketch is given as a problem on the sample final the professor distributed. Throughout, let the stream be $S = \{x_1,\ldots,x_m\}$ with $x_j \in \{1,\ldots,n\}$ , and let $F(i)$ be the true frequency of item $i$ .

(a) What is the heavy hitters / frequency estimation problem, and why does CMS pair with a min-heap to solve it? (No need to give the CMS data structure yet — describe the problem and the high-level role of CMS.)

Solution

The heavy hitters problem is the real-world question Amazon (or any platform that tracks views, purchases, clicks, etc.) actually wants to answer: out of all the items flowing past in a stream, which are the top- $K$ most popular ones right now? You can’t just count everything exactly because the universe of items is huge and the stream never stops, so we need something cheaper.

A natural way to solve heavy hitters is to break it into two pieces:

The sketch. A small data structure that, given any item $y$ , can quickly answer “roughly how many times has $y$ shown up so far?” — i.e. an estimate $\hat{F}(y)$ of its true frequency $F(y)$ . This is the frequency estimation problem, and it’s what Count-Min Sketch is built for.

The heap. A min-heap of size $K$ that holds the current top- $K$ candidates. Whenever a new item $x$ arrives, we ask the sketch for $\hat{F}(x)$ and compare it against the smallest frequency currently in the heap. If the new item is larger, it kicks out the current minimum and takes its slot.

So CMS doesn’t solve heavy hitters by itself — it provides the cheap frequency lookups, and the min-heap turns those lookups into a rolling top- $K$ . We’re forced into this approximate setup because tracking exact frequencies would require $\Omega(n)$ space, which is exactly what sublinear-space streaming forbids.

(b) Describe the Count-Min Sketch data structure with parameters $c$ and $r$ , including how an update is processed and how a query for $F(y)$ is answered. Why is the min over rows used and not, say, the average?

Solution

Pick $r$ hash functions $h_1,\ldots,h_r$ with $h_j : \{1,\ldots,n\} \to \{1,\ldots,c\}$ , and keep an $r \times c$ counter matrix initialized to $0$ .

Update $x$ : for each row $j$ , increment cell $(j, h_j(x))$ by $1$ .
Query $y$ : return $\widehat{F}(y) = \min_{1 \le j \le r} \mathrm{CTR}[j, h_j(y)]$ .

The min is used because each counter $\mathrm{CTR}[j, h_j(y)]$ is at least $F(y)$ (every occurrence of $y$ is counted) plus the contributions from other items hashing to the same cell. So every row gives an overestimate, and the smallest overestimate is the most accurate.

(c) Why does the sketch only overestimate, i.e. why is $\widehat{F}(y) \ge F(y)$ for every item $y$ ?

Solution

Every occurrence of $y$ in the stream increments $\mathrm{CTR}[j, h_j(y)]$ by $1$ in every row $j$ . Therefore each row’s counter is at least $F(y)$ , so the minimum across rows is also at least $F(y)$ .

(d) State the $(\varepsilon,\delta)$ -guarantee proved in class and indicate the values of $c$ and $r$ needed.

Solution

With

c = \frac{e}{\varepsilon}, \qquad r = \ln\!\left(\frac{1}{\delta}\right),

the Count-Min Sketch satisfies

\Pr\!\bigl(\widehat{F}(y) \le F(y) + \varepsilon m\bigr) \ge 1 - \delta.

This is an additive error in terms of the stream length $m$ , not a relative error.

(e) Prove the guarantee. You may use Markov’s inequality without proof.

Solution

Fix one row $j$ and let $\mathrm{CTR} = \mathrm{CTR}[j, h_j(y)]$ . Define $\mathrm{Error} = \mathrm{CTR} - F(y)$ . The expected number of other items that collide with $y$ in this row is at most $m/c$ (each of the at most $m$ non- $y$ stream elements lands in cell $h_j(y)$ with probability $1/c$ ), so

\mathbb{E}[\mathrm{Error}] \le \frac{m}{c} = \frac{\varepsilon m}{e}.

By Markov,

\Pr\bigl(\mathrm{Error} \ge \varepsilon m\bigr) \le \frac{\mathbb{E}[\mathrm{Error}]}{\varepsilon m} \le \frac{1}{e}.

For $\widehat{F}(y) = \min_j$ to exceed $F(y) + \varepsilon m$ , every row’s error must be at least $\varepsilon m$ . Since rows use independent hash functions,

\Pr\bigl(\widehat{F}(y) \ge F(y) + \varepsilon m\bigr) \le \left(\frac{1}{e}\right)^r = e^{-\ln(1/\delta)} = \delta.

(f) Suppose $m = 10^6$ , $\varepsilon = 0.01$ and $\delta = 0.001$ . Compute the sketch size $r \times c$ (in counters).

Solution

c = \lceil e/0.01 \rceil = \lceil 271.83 \rceil = 272,

r = \lceil \ln(1/0.001) \rceil = \lceil \ln(1000) \rceil = \lceil 6.91 \rceil = 7.

Total counters $= r \cdot c = 7 \cdot 272 = 1904$ .

Problem 3: $(\varepsilon,\delta)$ -Approximate Median

This topic was Question 3 on the sample final.

Let $S = \{x_1,\ldots,x_N\}$ be a stream. For $\varepsilon, \delta \in (0,1)$ , we want to return an element $m$ whose rank satisfies

\left(\tfrac{1}{2} - \varepsilon\right) N \le \mathrm{rank}(m) \le \left(\tfrac{1}{2} + \varepsilon\right) N

with probability at least $1 - \delta$ .

(a) What is the $(\varepsilon, \delta)$ -approximate median problem? State the input, the output condition, and what the $\varepsilon$ and $\delta$ parameters each control here.

Solution

The $(\varepsilon, \delta)$ -approximate median problem is a relaxation of “find the exact median element of the stream”. Exact computation is too expensive in sublinear streaming space, so instead of insisting on the element with rank exactly $N/2$ , we settle for any element whose rank is close to the middle, and we allow ourselves to fail every so often.

Concretely, the input is a stream $S = \{x_1, \ldots, x_N\}$ together with two knobs $\varepsilon$ and $\delta$ in $(0, 1)$ . The job is to return some element $m$ whose rank lands inside a window around $N/2$ :

\left(\tfrac{1}{2} - \varepsilon\right) N \le \mathrm{rank}(m) \le \left(\tfrac{1}{2} + \varepsilon\right) N

and the guarantee says this must hold with probability at least $1 - \delta$ .

$\varepsilon$ — the rank window. Controls how far the returned element’s rank is allowed to be from the true median rank $N/2$ , measured as a fraction of $N$ . Smaller $\varepsilon$ means a tighter window; at $\varepsilon = 0$ you’re back to demanding the exact median.

$\delta$ — the failure probability. Controls how often the algorithm is allowed to return something outside that window. Smaller $\delta$ means the algorithm is allowed to mess up less often.

Putting it together: “with probability at least $1 - \delta$ , the returned element’s rank is within an $\varepsilon N$ -sized window of the true median rank $N/2$ .”

(b) Describe the sampling algorithm and explain in words why the sampled median is likely to be a good approximate median.

Solution

Sample $t$ elements $T = \{y_1, \ldots, y_t\}$ uniformly and independently from the stream.
Output the median $m$ of the sample.

If we sample uniformly, then in expectation only a $(\tfrac{1}{2} - \varepsilon)$ fraction of samples fall in either bad region. For the sample median to land in $S_L$ , more than half of the samples must come from $S_L$ , which is far above the expected fraction. Chernoff makes this unlikely once $t$ is large enough.

(c) Define the bad regions $S_L$ and $S_R$ and write down the failure event of the algorithm in terms of $|T_L|$ and $|T_R|$ , where $T$ is the sample set.

Solution

Sort the stream and let

$S_L$ = the smallest $(\tfrac{1}{2} - \varepsilon)N$ elements,
$S_R$ = the largest $(\tfrac{1}{2} - \varepsilon)N$ elements.

Let $T_L = T \cap S_L$ and $T_R = T \cap S_R$ . The algorithm fails iff

|T_L| > t/2 \quad \text{or} \quad |T_R| > t/2.

(d) Using a Chernoff bound (stated below), show that $t = O\!\left(\varepsilon^{-2} \log \delta^{-1}\right)$ samples suffice.

Solution

For $T_L$ , each sample lands in $S_L$ independently with probability $p = \tfrac{1}{2} - \varepsilon$ . So $\mathbb{E}[|T_L|] = (\tfrac{1}{2} - \varepsilon) t = \mu$ . The threshold $t/2$ in multiplicative form is $(1+\gamma)\mu$ with

1 + \gamma = \frac{t/2}{(1/2 - \varepsilon)t} = \frac{1/2}{1/2 - \varepsilon} \implies \gamma = \frac{\varepsilon}{1/2 - \varepsilon}.

Chernoff gives

\Pr(|T_L| > t/2) \le \exp\!\left(-\frac{\gamma^2 \mu}{3}\right) = \exp\!\left(-\frac{1}{3}\cdot\frac{\varepsilon^2}{1/2-\varepsilon}\cdot t\right).

A symmetric bound holds for $T_R$ . Union bound:

\Pr(\text{fail}) \le 2\exp\!\left(-\frac{\varepsilon^2}{3(1/2 - \varepsilon)}\, t\right).

Setting this $\le \delta$ and solving yields

t \ge \frac{3(1/2 - \varepsilon)}{\varepsilon^2}\, \ln\!\frac{2}{\delta} = O\!\left(\frac{1}{\varepsilon^2} \log\!\frac{1}{\delta}\right).

(e) Suppose $\varepsilon = 0.05$ and $\delta = 0.01$ . Give a concrete value of $t$ (using the constant $3$ from the Chernoff bound).

Solution

With $\varepsilon = 0.05$ and $\delta = 0.01$ :

t \ge \frac{3(0.45)}{(0.05)^2}\, \ln\!\frac{2}{0.01} = \frac{1.35}{0.0025}\, \ln 200 \approx 540 \cdot 5.30 \approx 2862.

So roughly $t \approx 2862$ samples suffice.

Problem 4: AMS Sampling for the Second Frequency Moment

This topic was Question 6 on the sample final. Recall $F_2 = \sum_{i=1}^{n} f(i)^2$ for a stream of length $m$ over universe $\{1,\ldots,n\}$ .

(a) What is the $k$ -th frequency moment $F_k$ of a stream? What do $F_0$ , $F_1$ , and $F_2$ correspond to in plain English, and why might a data engineer care about $F_2$ specifically?

Solution

Frequency moments are a family of summary statistics that capture different “shapes” of how items appear in a stream. If $f(i)$ is the number of times item $i$ shows up, then the $k$ -th frequency moment is just the sum of those counts raised to the $k$ -th power:

F_k = \sum_{i=1}^{n} f(i)^k.

The value of $k$ changes what aspect of the stream you’re capturing:

$F_0$ counts the number of distinct items in the stream (using the convention $0^0 = 0$ , so absent items contribute nothing).
$F_1 = m$ is just the total stream length — every item contributes its raw frequency, so you’re really just adding up all the appearances.
$F_2$ is the sum of squared frequencies. Squaring amplifies large frequencies, so $F_2$ becomes a measure of how concentrated the stream is: it’s large when a few items dominate, small when items are spread evenly.

Why care about $F_2$ in particular? It’s exactly the kind of statistic the Gini index uses in economics to measure income inequality — a few people earning huge amounts blow up the squared sum, while an even distribution keeps it small. The same idea shows up in databases, where $F_2$ estimates the cost of a self-join (you’re summing the number of matching pairs per key).

(b) State the AMS sampling algorithm. Describe what $X$ is output for a single sample.

Solution

Sample a position $i \in \{1,\ldots,m\}$ uniformly at random.
Let $r$ be the number of occurrences of $x_i$ from position $i$ to the end of the stream.
Output $X = m\bigl(r^2 - (r-1)^2\bigr) = m(2r - 1)$ .

(c) Prove $\mathbb{E}[X] = F_2$ .

Solution

Condition on the value of the sampled item being $a$ . Given this, the sampled occurrence is uniformly one of the $f(a)$ occurrences of $a$ . If it is the $j$ -th occurrence, then $r = f(a) - j + 1$ , so $X = m(2r - 1)$ .

\mathbb{E}[X \mid \text{value} = a] = \frac{1}{f(a)} \sum_{j=1}^{f(a)} m\bigl(2(f(a) - j + 1) - 1\bigr) = \frac{m}{f(a)} \sum_{r=1}^{f(a)} (2r - 1) = \frac{m \cdot f(a)^2}{f(a)} = m \cdot f(a).

The value is $a$ with probability $f(a)/m$ , so

\mathbb{E}[X] = \sum_{a=1}^{n} \Pr(\text{value} = a)\, \mathbb{E}[X \mid \text{value}=a] = \sum_{a=1}^{n} \frac{f(a)}{m} \cdot m f(a) = \sum_{a=1}^{n} f(a)^2 = F_2.

(d) A single AMS run is unbiased but high variance. Describe the two-stage boosting strategy used to obtain an $(\varepsilon,\delta)$ -estimate of $F_2$ .

Solution

The two-stage strategy (mirroring Morris++):

Stage 1 (averaging). Run $t$ independent AMS estimators in parallel and report their average $Y$ . This reduces variance and gives an $(\varepsilon, \tfrac{1}{4})$ -style guarantee.
Stage 2 (median). Run $O(\log 1/\delta)$ independent copies of Stage 1 and report the median, boosting the failure probability down to $\delta$ .

For part (e) below we directly use a Chernoff bound on the average and skip the median step.

(e) Using the bound $X \le 2 m f^*$ (where $f^* = \max_i f(i)$ ) and the key lemma $\frac{m f^*}{F_2} \le \sqrt{n}$ , show that

t = O\!\left(\frac{\sqrt{n}}{\varepsilon^2}\, \log\!\frac{1}{\delta}\right)

independent copies suffice when their average is reported.

You may use the non-Bernoulli Chernoff bound: if $X_1,\ldots,X_t$ are i.i.d. in $[0,C]$ and $Y = \frac{1}{t}\sum X_i$ , then for $0 < \varepsilon \le 1$ ,

\Pr\bigl(|Y - \mathbb{E}[Y]| \ge \varepsilon \mathbb{E}[Y]\bigr) \le 2\exp\!\left(-\frac{\varepsilon^2 \mathbb{E}[Y]\, t}{3 C}\right).

Solution

With $C = 2 m f^*$ and $\mathbb{E}[Y] = F_2$ , the non-Bernoulli Chernoff bound gives

\Pr\bigl(|Y - F_2| \ge \varepsilon F_2\bigr) \le 2 \exp\!\left(-\frac{\varepsilon^2 F_2\, t}{3 \cdot 2 m f^*}\right) = 2 \exp\!\left(-\frac{\varepsilon^2\, t}{6 \cdot \frac{m f^*}{F_2}}\right).

Substituting the key lemma $\frac{m f^*}{F_2} \le \sqrt{n}$ :

\Pr\bigl(|Y - F_2| \ge \varepsilon F_2\bigr) \le 2\exp\!\left(-\frac{\varepsilon^2 t}{6 \sqrt{n}}\right).

Setting the right-hand side $\le \delta$ and solving for $t$ :

t \ge \frac{6 \sqrt{n}}{\varepsilon^2} \, \ln\!\frac{2}{\delta} = O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \log\!\frac{1}{\delta}\right).

Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash

Recall the idealized setup: a hash function $h : \{1,\ldots,n\} \to [0,1]$ produces a uniform hash for each distinct element. Let $t$ be the number of distinct elements seen so far.

(a) State the distinct elements problem ( $F_0$ ), and describe the original idealized algorithm that this problem is a variant of. What output formula does the original algorithm use, and what is the intuition behind it?

Solution

The distinct elements problem is the streaming version of “how many different items have I seen so far?” — think of it as counting unique users hitting a website, or unique products sold today, while the events fly by. Storing every distinct item in a set works but costs $\Omega(F_0 \log n)$ bits, which defeats the point of streaming. So we want a way to estimate $F_0$ using much less space.

The original min-hash algorithm does this with a surprisingly clean idea: hash every incoming element into the real interval $[0, 1]$ , but only bother to remember the smallest hash you’ve seen so far. Concretely:

Pick a perfectly random hash $h : \{1, \ldots, n\} \to [0, 1]$ once at the start.
As the stream arrives, hash each element and keep updating a running minimum $\min h(x_i)$ .
When asked, output $\widehat{t} = \dfrac{1}{\min h(x_i)} - 1$ .

The intuition is what makes this work. If you’ve seen $t$ distinct elements, their hashes are like $t$ darts thrown uniformly at random into $[0, 1]$ . The more darts you throw, the closer the smallest one gets to $0$ — and in fact the expected minimum of $t$ uniform samples is exactly $\dfrac{1}{t+1}$ . So inverting and subtracting $1$ recovers $t$ . Repeated occurrences of the same item don’t mess things up because they always hash to the same value, so only new items can pull the minimum lower.

(b) Is your friend’s algorithm doomed, or can the output be changed to give a correct estimate of $t$ ?

Solution

Not doomed - the maximum of $t$ uniform random variables has a known expectation, and inverting the relationship lets us recover $t$ .

(c) Derive $\mathbb{E}[\max h(x_i)]$ when there are $t$ distinct elements. Use the identity

\mathbb{E}[X] = \int_0^1 \Pr(X > x)\, dx

for $X \in [0,1]$ .

Solution

Let $X' = \max h(x_i)$ . The event $X' > x$ means at least one of the $t$ hash values exceeds $x$ , but it’s easier to compute the complement: $X' \le x$ means all hash values are $\le x$ . By independence,

\Pr(X' \le x) = x^t \implies \Pr(X' > x) = 1 - x^t.

Using $\mathbb{E}[X'] = \int_0^1 \Pr(X' > x)\, dx$ :

\mathbb{E}[X'] = \int_0^1 (1 - x^t)\, dx = \left[ x - \frac{x^{t+1}}{t+1} \right]_0^1 = 1 - \frac{1}{t+1} = \frac{t}{t+1}.

(d) Based on (c), what should the output of the modified algorithm be, in terms of $X' = \max h(x_i)$ ?

Solution

If $\mathbb{E}[X'] = \tfrac{t}{t+1}$ , then

1 - \mathbb{E}[X'] = \frac{1}{t+1} \implies \frac{1}{1 - \mathbb{E}[X']} - 1 = t.

So the modified algorithm should output

\widehat{t} = \frac{1}{1 - X'} - 1.

(e) As a sanity check, plug in $t = 2$ and verify the output evaluates to $2$ in expectation.

Solution

For $t = 2$ , $\mathbb{E}[X'] = \tfrac{2}{3}$ . Plugging into the output:

\frac{1}{1 - 2/3} - 1 = \frac{1}{1/3} - 1 = 3 - 1 = 2.\ \checkmark

Problem 6: Flajolet-Martin Factor- $32$ Guarantee

In lecture, we proved that the basic Flajolet-Martin algorithm (track the largest least-significant-bit position $X_m$ in the hashes seen so far, output $2^{X_m + 1}$ ) gives a $32$ -approximation with probability at least $\tfrac{2}{3}$ .

(a) What problem does Flajolet-Martin solve, and why is it preferred over the idealized min-hash algorithm? Briefly restate the basic algorithm.

Solution

Flajolet-Martin solves the same problem as min-hash: counting the number of distinct elements ( $F_0$ ) in a stream, in sublinear space. The reason FM is preferred is purely practical — the min-hash algorithm is idealized because it requires hashing into the continuous interval $[0, 1]$ , which means storing real numbers with full precision. That’d take infinitely many bits, which obviously breaks any space guarantee. FM gets around this by working with integer hashes and looking at their bit patterns instead.

The trick FM uses is to track trailing zeros in the binary representations of the hashes. The algorithm is:

Pick a hash $h : \{1, \ldots, n\} \to \{0, 1, \ldots, n - 1\}$ .
For each stream element $x_i$ , look at $h(x_i)$ in binary and find its least significant bit (LSB) position — the position of the rightmost $1$ , which is the same as the number of trailing zeros.
Keep a running maximum $X_m$ of all the LSB positions seen so far.
When asked, output $\widehat{T} = 2^{X_m + 1}$ .

The intuition (developed more in part (b)) is that the more distinct elements you’ve seen, the more likely it is that some of them happened to hash to a value with a really long trailing-zeros pattern — so the running maximum LSB grows roughly like $\log T$ .

(b) Why is the output set to $2^{X_m + 1}$ ? Give an informal one-paragraph justification using the geometric structure of trailing zeros.

Solution

In a uniformly random bit string, the chance of seeing trailing zeros at position $j$ (so the least significant $1$ is at position $j$ ) is $1/2^{j+1}$ . Across $T$ distinct hashes, roughly $T / 2^{j+1}$ end at position $j$ . The largest position $j$ likely to be observed is the one where $T / 2^{j+1} \approx 1$ , i.e. $j \approx \log T - 1$ , so $2^{j+1} \approx T$ . That motivates outputting $2^{X_m + 1}$ as a rough estimate for $T$ .

(c) Identify the two bad events (overestimate and underestimate) in terms of $J_+ = \lfloor \log T \rfloor + 5$ and $J_- = \lfloor \log T \rfloor - 5$ .

Solution

Overestimate: some element’s LSB position exceeds $J_+$ , i.e. $Z_{> J_+} \ge 1$ , where $Z_{>j}$ counts distinct elements with LSB position $> j$ . Then $X_m + 1 > \log T + 5$ , so $2^{X_m + 1} > 32 T$ .
Underestimate: no element has LSB position $> J_-$ , i.e. $Z_{> J_-} = 0$ . Then $X_m + 1 < \log T - 5$ , so $2^{X_m + 1} < T / 32$ .

(d) Use Markov’s inequality on $Z_{>J_+}$ to show $\Pr(\text{overestimate}) \le \tfrac{1}{32}$ .

Solution

Using $2^{\lfloor \log T \rfloor} \ge T/2$ , we get $2^{J_+ + 1} = 2^{\lfloor \log T \rfloor + 6} \ge 2^6 \cdot T/2 = 32 T$ , so

\mathbb{E}[Z_{>J_+}] < \frac{T}{2^{J_+ + 1}} \le \frac{T}{32 T} = \frac{1}{32}.

By Markov,

\Pr(Z_{>J_+} \ge 1) \le \mathbb{E}[Z_{>J_+}] \le \frac{1}{32}.

(e) Use Chebyshev’s inequality on $Z_{>J_-}$ to show $\Pr(\text{underestimate}) \le \tfrac{1}{16}$ .

Solution

Each indicator is Bernoulli with success probability $1/2^{j+1}$ and $\mathrm{Var}(Y_i) \le \mathbb{E}[Y_i]$ , so summing over independent elements gives $\mathrm{Var}(Z_{>J_-}) \le \mathbb{E}[Z_{>J_-}]$ . Also

\mathbb{E}[Z_{>J_-}] = \frac{T}{2^{J_- + 1}} = \frac{T \cdot 2^{4}}{2^{\lfloor \log T \rfloor}} \ge 16,

since $2^{\lfloor \log T \rfloor} \le T$ . By Chebyshev,

\Pr(Z_{>J_-} = 0) \le \Pr\bigl(|Z_{>J_-} - \mathbb{E}[Z_{>J_-}]| \ge \mathbb{E}[Z_{>J_-}]\bigr) \le \frac{\mathrm{Var}(Z_{>J_-})}{(\mathbb{E}[Z_{>J_-}])^2} \le \frac{1}{\mathbb{E}[Z_{>J_-}]} \le \frac{1}{16}.

(f) Combine the bounds and conclude the $32$ -approximation holds with probability $\ge \tfrac{2}{3}$ .

Solution

By the union bound,

\Pr(\text{failure}) \le \Pr(\text{overestimate}) + \Pr(\text{underestimate}) \le \frac{1}{32} + \frac{1}{16} = \frac{3}{32} < \frac{1}{3}.

Hence the algorithm succeeds (output is within a factor of $32$ of $T$ ) with probability at least $1 - 3/32 \ge 2/3$ .

Problem 7: Online Algorithms - Competitive Ratio

(a) What is an online algorithm, and how does it differ from an offline algorithm? Define what it means for an online algorithm to be $c$ -competitive against an offline optimum $\mathrm{OPT}$ . In particular, over what is the ratio taken?

Solution

An online algorithm is one that has to react to its input as it arrives, piece by piece, without seeing what’s coming next. Once it commits to a decision, it usually can’t go back and change it. This is what makes problems like “should I rent skis today or buy them?” hard — you don’t know whether you’ll be skiing for one more day or another fifty.

An offline algorithm, on the other hand, gets to see the entire input up front before making any decisions. It’s free to plan optimally with full knowledge of the future. This isn’t realistic for most real-world problems, but it serves as a useful benchmark: the offline optimum (call it $\mathrm{OPT}$ ) is the best possible cost we could ever hope to achieve, so it tells us how much we’re losing by being forced to act in the dark.

The competitive ratio measures exactly that loss. We say an online algorithm $A$ is $c$ -competitive if its cost is never more than $c$ times $\mathrm{OPT}$ ‘s cost, no matter what the input looks like:

\mathrm{cost}_A(\sigma) \le c \cdot \mathrm{cost}_{\mathrm{OPT}}(\sigma) \quad \text{for every input sequence } \sigma.

So the ratio is taken over all possible inputs, not averaged or expected — we’re guaranteeing that even on the worst possible input, $A$ is at most $c$ times worse than the offline optimum.

(b) Ski rental. Renting skis costs $1$ /day; buying skis costs $B$ . The skier does not know in advance how many days $d$ they will ski. Consider the online algorithm: rent for the first $B$ days, then buy. Prove this algorithm is $2$ -competitive.

Solution

Let $d$ be the (unknown) number of skiing days. The offline optimum is $\mathrm{OPT} = \min(d, B)$ .

Case 1: $d < B$ . The online algorithm rents every day, paying $d$ . $\mathrm{OPT} = d$ . Ratio $= 1$ .
Case 2: $d \ge B$ . The online algorithm rents $B$ days, then buys, total $B + B = 2B$ . $\mathrm{OPT} = B$ . Ratio $= 2$ .

So in the worst case (Case 2), $\frac{\text{ALG}(d)}{\mathrm{OPT}(d)} = 2$ , and the algorithm is exactly $2$ -competitive.

(c) Pizza finding. You are on a number line at position $0$ , looking for a pizza shop at unknown position $\pm d$ (sign unknown). Consider the online algorithm: walk $1$ right, return; walk $2$ left, return; walk $4$ right, return; … doubling each step and alternating direction. State (without re-proving) the competitive ratio of this algorithm.

Solution

The doubling/zig-zag strategy gives a competitive ratio of $9$ :

\text{cost}(\text{ALG}) \le 9 \cdot \mathrm{OPT}.

(The proof was assigned as homework in class.)

(d) Suppose someone proposes the following algorithm for ski rental: always buy on day $1$ . Compute its competitive ratio. Find a sequence (i.e., a value of $d$ ) that demonstrates this ratio.

Solution

The online algorithm pays $B$ regardless of $d$ . The offline OPT is $\min(d, B)$ , which is minimized for small $d$ .

The competitive ratio is

\sup_d \frac{B}{\min(d, B)}.

When $d = 1$ , $\mathrm{OPT} = 1$ and the ratio is $B$ . As $d \to 0$ or for $d = 1$ this gives ratio $B$ , which is $\Omega(B)$ - linear in $B$ . So “always buy” is $B$ -competitive in the worst case, which is bad whenever $B$ is large. The bad sequence is $d = 1$ (the skier only skis one day).

Problem 8: List Update - Move-to-Front

The list update problem: maintain a linked list of $n$ keys under a request sequence $r_1,\ldots,r_m$ with $m > n$ . Accessing the key at position $j$ costs $j$ ; after the access, you may move the key freely to any earlier position for free.

(a) Restate the list update problem in your own words. What is the cost incurred when you access an element, and what operation is allowed for free after each access? Why is it intuitive that frequently-accessed keys should sit near the front of the list?

Solution

You maintain a linked list of $n$ keys. Each request asks for some specific key; to serve it, you walk from the front of the list to that key, paying a cost equal to the key’s current position (so position 1 costs 1, position $n$ costs $n$ ). After serving the request, you may move the just-accessed key to any earlier position in the list for free; moving any other key around still costs.

Intuitively, keys requested often should sit near the front so that future accesses to them are cheap. Sequences that repeatedly ask for keys in the back of the list are the costly ones, so a good list-update strategy tries to keep the “hot” keys at the front.

(b) Consider the “do nothing” algorithm. Find a request sequence on which this algorithm has competitive ratio $\Omega(n)$ .

Solution

Start with the list $1 \to 2 \to \cdots \to n$ and request $r_i = n$ for $i = 1,\ldots,m$ . Each access costs $n$ , so “do nothing” pays $mn$ . An offline algorithm could move $n$ to the front after its first request, paying $n + (m-1) \cdot 1 \le 2m$ . Hence

\frac{mn}{2m} = \frac{n}{2} = \Omega(n).

(c) Consider the “order by current frequency” algorithm: maintain the list so that keys are sorted by how many times they have been requested so far. Find a request sequence on which this algorithm has competitive ratio $\Omega(n)$ .

Solution

Request key $1$ exactly $n$ times, then key $2$ exactly $n$ times, …, then key $n$ exactly $n$ times. After about half the sequence, the first half of the list contains $\{1,\ldots,n/2\}$ and the second half contains $\{n/2+1,\ldots,n\}$ - but the next half of the sequence asks for the second-half keys, each costing at least $n/2$ . The cost of “order by frequency” is at least $\frac{n^2}{2} \cdot \frac{n}{2} = \Omega(n^3)$ , while moving each new key to the front pays at most $n + (n-1) \cdot 1 \le 2n$ per distinct key, hence $\mathrm{OPT} \le 2n^2$ . Ratio $= n/8 = \Omega(n)$ .

(d) State the theorem proved (without re-doing the proof) about Move-to-Front (MTF): what does it say about $\mathrm{cost}_{\mathrm{MTF}}(S)$ in terms of $\mathrm{OPT}(S)$ , $m$ , and $n$ ? Why does this give a competitive ratio strictly less than $2$ for long sequences?

Solution

For every request sequence $S$ of length $m$ ,

\mathrm{cost}_{\mathrm{MTF}}(S) \le 2 \cdot \mathrm{OPT}(S) - m + \binom{n}{2}.

When $m \gg \binom{n}{2}$ , the additive term $-(m - \binom{n}{2})$ is negative, so

\mathrm{cost}_{\mathrm{MTF}}(S) < 2 \cdot \mathrm{OPT}(S).

Hence MTF’s competitive ratio is strictly less than $2$ for long enough sequences.

Problem 9: Paging - LRU, FIFO, and Resource Augmentation

(a) Define the cost of a paging algorithm and what it means for an online paging algorithm to be $k$ -competitive.

Solution

In paging, the only thing we care about is how often the cache fails us. The cost of a paging algorithm on a given request sequence is simply the number of cache misses it incurs — every time a requested item isn’t already in the cache and has to be fetched fresh, that’s one unit of cost. Hits are free.

We compare paging algorithms to the offline optimum $\mathrm{OPT}$ (which gets to see all future requests and evict perfectly using Farthest-in-Future). An online paging algorithm $A$ running with a cache of size $k$ is $k$ -competitive if, no matter what sequence of requests $S$ comes in, $A$ ‘s misses are at most $k$ times $\mathrm{OPT}$ ‘s:

\mathrm{cost}_A(S) \le k \cdot \mathrm{cost}_{\mathrm{OPT}}(S).

The catch with $k$ -competitive is that $k$ is also the cache size, so the guarantee gets worse as the cache gets bigger. That’s a discouraging-sounding bound, but a classical result of Sleator-Tarjan says you can’t do better than this with a deterministic online algorithm.

(b) State (without proof) the Sleator-Tarjan theorem about LRU and FIFO.

Solution

Both LRU and FIFO are $k$ -competitive (where $k$ is the cache size). Moreover, no deterministic online paging algorithm can be better than $k$ -competitive.

(c) Suppose the online algorithm has cache size $k$ and OPT has cache size $k' < k$ . State the competitive ratio of LRU and FIFO under this resource-augmentation model.

Solution

If the online algorithm has cache size $k$ and OPT has cache size $k' < k$ , then LRU and FIFO are

\frac{k}{k - k' + 1}\text{-competitive}.

(d) Apply the formula in (c): if $k = 2000$ and $k' = 1000$ , what is the resulting competitive ratio of LRU and FIFO?

Solution

With $k = 2000$ , $k' = 1000$ :

\frac{k}{k - k' + 1} = \frac{2000}{2000 - 1000 + 1} = \frac{2000}{1001} \approx 2.

So LRU and FIFO are roughly $2$ -competitive when given twice the cache that OPT has.

Problem 10: Multiplicative Weight Updates (Experts)

For this topic the professor explicitly said you should understand the setup and the algorithm; the full proof was not covered during class.

(a) State the experts’ problem. What are the inputs, what does the algorithm output at each time step, and what feedback do we receive?

Solution

The experts’ problem is a stylized version of a real-world dilemma: you have access to $n$ different “experts” (advisors, models, weather services, financial gurus, whatever) and every day you have to make a binary decision based on what they recommend — say, buy or sell. Some experts are reliable, some are not, but you don’t know in advance which is which.

Each round $t = 1, 2, \ldots$ proceeds as follows:

All $n$ experts simultaneously announce their suggestion for the round (buy or sell).
Your algorithm has to commit to one of the two actions, before seeing the truth.
Then the actual outcome is revealed, and each expert is judged “correct” or “mistake” based on whether their suggestion matched.

The whole goal here is to design a strategy whose total number of mistakes is comparable to the number of mistakes made by the best expert in hindsight — that is, the one who turns out after the fact to have been most accurate. The catch is that you don’t know which expert that’s going to be while the rounds are happening, so you can’t just blindly follow one of them.

(b) Describe the Weighted Majority algorithm: how weights are initialized, how decisions are made each round, and how weights are updated after the round.

Solution

Initialize $w_i^{(1)} = 1$ for all $i \in [n]$ .
At round $t$ , compute total weight $\Phi^{(t)} = \sum_i w_i^{(t)}$ . Take the weighted vote: choose the action whose supporting experts hold $\ge \Phi^{(t)}/2$ of the weight (break ties arbitrarily, e.g. flip a coin).
After the truth is revealed, update each expert’s weight: $w_i^{(t+1)} = \begin{cases} w_i^{(t)} & \text{if expert } i \text{ was correct,}\\ (1 - \varepsilon) w_i^{(t)} & \text{if expert } i \text{ made a mistake.} \end{cases}$

(c) State the mistake-bound theorem proved in class. What are the meanings of $m^{(t)}$ , $m_i^{(t)}$ , $n$ , and $\varepsilon$ in the bound?

Solution

After $t$ rounds, for every expert $i \in [n]$ ,

m^{(t)} \le 2(1 + \varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon},

where:

$m^{(t)}$ = number of mistakes made by Weighted Majority through round $t$ ,
$m_i^{(t)}$ = number of mistakes made by expert $i$ through round $t$ ,
$n$ = number of experts,
$\varepsilon > 0$ = the small penalty parameter used in the multiplicative update.

In particular, taking $i$ to be the best expert in hindsight,

m^{(t)} \le 2(1 + \varepsilon)\, m_{\text{best}}^{(t)} + \frac{2 \ln n}{\varepsilon}.

(d) Briefly explain what the bound implies as $t$ grows large (informally - no proof needed).

Solution

The additive $\frac{2 \ln n}{\varepsilon}$ term is independent of $t$ , so it becomes negligible as $t$ grows. The multiplicative factor $2(1+\varepsilon)$ shows that the algorithm pays at most roughly twice as many mistakes as the best expert, with a slight $\varepsilon$ -overhead that can be tuned. So Weighted Majority is competitive with the best expert even though it must commit before knowing the truth each round.

Problem 11: Johnson-Lindenstrauss Lemma

(a) State the Johnson-Lindenstrauss lemma. Be clear about what $N$ , $D$ , $D'$ , and $\varepsilon$ each mean and what the guaranteed distortion is.

Solution

The Johnson-Lindenstrauss lemma says that if you have a bunch of points sitting in some absurdly high-dimensional space, you can squash them down into a much lower-dimensional space and still preserve all the pairwise distances almost exactly — as long as you allow a small amount of distortion. The remarkable part is that the target dimension only needs to depend on how many points you have and how much distortion you’re willing to accept, not on how big the original dimension was.

Formally: for any $0 < \varepsilon < 1$ and any integer $N > 1$ , if you pick a target dimension

D' \ge \frac{\log N}{\varepsilon^2},

then for any set $S$ of $N$ points in $\mathbb{R}^D$ there exists a map $F : \mathbb{R}^D \to \mathbb{R}^{D'}$ such that every pairwise distance is preserved up to a $(1 \pm \varepsilon)$ factor:

(1 - \varepsilon)\, \|x - y\| \le \|F(x) - F(y)\| \le (1 + \varepsilon)\, \|x - y\| \quad \text{for all } x, y \in S.

The four parameters:

$N$ — how many points you have.
$D$ — the original (high) dimension your points live in.
$D'$ — the target (low) dimension you’re projecting down to.
$\varepsilon$ — how much distortion you’re willing to tolerate. Smaller $\varepsilon$ means distances are preserved more faithfully, but you need a higher $D'$ to get there.

Two things stand out about this. First, $D$ doesn’t show up anywhere in the target-dimension formula — so it doesn’t matter if your original space has $5{,}000$ dimensions or $5{,}000{,}000$ . Second, the cost scales as $\log N$ , which is incredibly slow, so even with millions of points you only need modest target dimensions.

(b) Suppose you have $N = 10^6$ points in $\mathbb{R}^D$ with $D = 5000$ , and your downstream algorithm only works in $D' = 200$ dimensions. What is the best (smallest) distortion $\varepsilon$ you can guarantee via JL?

Solution

Solve $D' = \frac{\log N}{\varepsilon^2}$ for $\varepsilon$ :

\varepsilon = \sqrt{\frac{\log N}{D'}}.

Using natural log (the constant depends on the version of JL used, but the dependence on $N$ and $D'$ is the same):

\varepsilon = \sqrt{\frac{\ln 10^6}{200}} = \sqrt{\frac{13.82}{200}} = \sqrt{0.0691} \approx 0.263.

So roughly $\varepsilon \approx 0.26$ - distances are preserved up to about $\pm 26\%$ .

(c) Why is the target dimension $D'$ independent of the original dimension $D$ ? Briefly explain in one or two sentences.

Solution

JL’s target dimension $D' \ge \log N / \varepsilon^2$ depends only on the number of points and the desired distortion, not on the ambient dimension. Intuitively, $N$ points span at most an $N$ -dimensional affine subspace regardless of how many ambient dimensions $D$ they sit in, so the geometric “intrinsic complexity” we must preserve scales with $N$ rather than $D$ .

(d) Suppose you’re OK with the projected distances being up to twice the original. What value of $\varepsilon$ does that correspond to in the JL guarantee, and what’s the catch?

Solution

The JL guarantee is two-sided:

(1 - \varepsilon)\,\|x - y\| \le \|F(x) - F(y)\| \le (1 + \varepsilon)\,\|x - y\|

“Distances at most doubled” only constrains the upper bound: $(1 + \varepsilon) = 2 \Rightarrow \varepsilon = 1$ .

But here’s the catch: $\varepsilon$ also controls the lower bound. At $\varepsilon = 1$ , the lower bound becomes $(1 - 1)\,\|x - y\| = 0$ — which means projected distances are allowed to shrink all the way to zero. Two genuinely far-apart points in the original space could end up at the same location after projection, completely destroying the nearest-neighbor information you were trying to preserve.

So when you “ask for” doubled distances by setting $\varepsilon = 1$ , you’re also accepting that distances could be arbitrarily compressed — which is usually a much worse failure mode than distances being stretched, since stretching at least preserves ordering. This is why JL is typically stated for $0 < \varepsilon < 1$ and why useful applications usually pick much smaller $\varepsilon$ values.

(e) Suppose you tolerate distances being off by up to $\pm 50\%$ in either direction. What is $\varepsilon$ ? And as a rule of thumb, what range of $\varepsilon$ makes a JL embedding actually useful for downstream tasks like nearest-neighbor search?

Solution

” $\pm 50\%$ in either direction” means $\varepsilon = 0.5$ . The guarantee becomes:

0.5\,\|x - y\| \le \|F(x) - F(y)\| \le 1.5\,\|x - y\|

so a projected distance could be anywhere between half and one-and-a-half times the original.

Is that useful? It depends on the task, but generally no — because the ratio between the largest and smallest possible projected distance for the same pair is $1.5 / 0.5 = 3\times$ . Two pairs of points that were originally the same distance apart could end up with projected distances differing by a factor of $3$ . For nearest-neighbor search where you’re trying to distinguish “close” from “very close”, that’s way too much noise.

Practical rule of thumb. For an embedding to be useful for NNS or clustering, you typically want $\varepsilon \le 0.1$ or so, giving distances preserved within $\pm 10\%$ . That keeps the worst-case “max/min projected distance” ratio close to $1.1 / 0.9 \approx 1.22$ , which is usually small enough that the geometric structure of the data survives the projection.

The cost of a smaller $\varepsilon$ is a larger target dimension: $D' \propto 1/\varepsilon^2$ , so cutting $\varepsilon$ in half quadruples the required $D'$ . That’s the central tradeoff in dimension reduction.

Problem 12: Approximate Nearest Neighbor and LSH

Given $n$ data points $x_1,\ldots,x_n$ in $\mathbb{R}^D$ , exact NNS asks for $\arg\min_i d(q, x_i)$ . The trivial brute-force solution runs in time $O(n D)$ .

(a) State the nearest neighbor search (NNS) problem. What is the input, the query, and the desired output? What is the cost of the brute-force solution?

Solution

Nearest neighbor search is the basic primitive behind similarity search — think recommendation systems, image retrieval, or $k$ -NN classifiers in machine learning. The setup is that you preprocess a dataset of $n$ points once, then have to answer a stream of “given this new point, which of my stored points is most similar to it?” queries as fast as possible.

Concretely, you’re given $n$ data points $x_1, \ldots, x_n \in \mathbb{R}^D$ at preprocessing time. Later, a query point $q \in \mathbb{R}^D$ arrives and you need to return the data point closest to $q$ under some distance metric (Euclidean, Hamming, etc.):

x^* = \arg\min_{1 \le i \le n}\,d(q, x_i).

The brute-force solution is the obvious one: just compute $d(q, x_i)$ for every single $i$ and return the smallest. Each distance computation between two $D$ -dimensional vectors takes $O(D)$ time, and there are $n$ of them, so each query costs $O(nD)$ .

In modern datasets both $n$ and $D$ are huge — think a billion images each represented as a $1000$ -dimensional embedding — so $O(nD)$ per query is way too slow. That’s the gap NNS algorithms try to close by doing more clever preprocessing.

(b) State (informally) the hardness result of Williams-Alman for exact NNS. Why does this push us toward approximate NNS?

Solution

For exact NNS in high dimensions, any algorithm running in time $O(n^{1 - \alpha} D)$ for some $\alpha > 0$ (i.e. strictly sub-linear in $n$ ) would violate the Strong Exponential Time Hypothesis (SETH). Since refuting SETH is considered highly unlikely, this rules out fast exact NNS in high dimensions. So we relax the problem to approximate NNS.

(c) Define the $c$ -approximate NNS problem for an approximation factor $c > 1$ .

Solution

Given $c > 1$ , return some $x_j$ satisfying

d(q, x_j) \le c \cdot \min_{1 \le i \le n} d(q, x_i).

That is, return any point whose distance to $q$ is at most $c$ times the true nearest-neighbor distance.

(d) State the preprocessing and query time of the LSH data structure in terms of $n$ , $D$ , and a parameter $\rho$ . What is $\rho$ for Hamming distance, and what is $\rho$ for Euclidean distance?

Solution

LSH gives preprocessing/storage $O(n^{1 + \rho} D)$ and query time $O(n^{\rho} D)$ , where

Hamming distance: $\rho = \frac{1}{c}$ ,
Euclidean distance: $\rho = \frac{1}{c^2}$ .

(e) If $c = 2$ , what is the query time of LSH for Hamming distance and for Euclidean distance? Express the answers as $O(\cdot)$ expressions involving $n$ and $D$ .

Solution

For $c = 2$ :

Hamming: $\rho = 1/2$ , so query time $O(n^{1/2} D) = O(\sqrt{n}\, D)$ .
Euclidean: $\rho = 1/4$ , so query time $O(n^{1/4} D)$ .

Both are dramatically faster than the brute-force $O(n D)$ when $n$ is large.

Problem 13: Run AMS by Hand on a Concrete Stream

Consider the stream

S = \{3, 1, 4, 1, 5, 3, 2, 1, 3, 5\}, \qquad m = 10,

with positions numbered $1, \ldots, 10$ from left to right.

(a) What does the second frequency moment $F_2$ count, and what is the AMS estimator $X$ for a single sampled position?

Solution

$F_2 = \sum_{i=1}^{n} f(i)^2$ is the sum of squared frequencies of items in the stream. For one sample at position $j$ , let $r$ be the number of occurrences of $x_j$ at positions $\ge j$ . The AMS estimator is

X = m\bigl(r^2 - (r-1)^2\bigr) = m(2r - 1).

A single run satisfies $\mathbb{E}[X] = F_2$ .

(b) Compute the true value of $F_2$ for the stream above.

Solution

Frequencies:

Item $i$	Positions	$f(i)$	$f(i)^2$
1	2, 4, 8	3	9
2	7	1	1
3	1, 6, 9	3	9
4	3	1	1
5	5, 10	2	4

F_2 = 9 + 1 + 9 + 1 + 4 = 24.

(c) Run AMS three times for $F_2$ on this stream, sampling positions $1$ , $4$ , and $10$ in turn. Show $r$ and $X$ for each run.

Solution

Run 1 — sample position $1$ ( $x_1 = 3$ ). Occurrences of $3$ from position $1$ onward: positions $1, 6, 9$ , so $r = 3$ .
$X_1 = 10(2 \cdot 3 - 1) = 10 \cdot 5 = 50.$
Run 2 — sample position $4$ ( $x_4 = 1$ ). Occurrences of $1$ from position $4$ onward: positions $4, 8$ , so $r = 2$ .
$X_2 = 10(2 \cdot 2 - 1) = 10 \cdot 3 = 30.$
Run 3 — sample position $10$ ( $x_{10} = 5$ ). Occurrences of $5$ from position $10$ onward: position $10$ , so $r = 1$ .
$X_3 = 10(2 \cdot 1 - 1) = 10 \cdot 1 = 10.$

(d) Average the three outputs $Y = (X_1 + X_2 + X_3)/3$ . By what percentage does the average overshoot or undershoot the true $F_2$ ? Briefly explain why one would not be alarmed by this discrepancy.

Solution

Y = \frac{X_1 + X_2 + X_3}{3} = \frac{50 + 30 + 10}{3} = \frac{90}{3} = 30.

Compared to the true $F_2 = 24$ , the average overshoots by $\frac{30 - 24}{24} = 25\%$ .

Why is this expected? AMS is unbiased ( $\mathbb{E}[X] = F_2$ ) but its single-run variance is large. With only three samples, the empirical average can easily land $25\%$ off the truth. The full $(\varepsilon, \delta)$ -guarantee from Problem 4 requires $O(\sqrt{n}/\varepsilon^2 \cdot \log(1/\delta))$ samples, which would be far more than three even for modest $\varepsilon$ .

Problem 14: Reverse-Engineering a Streaming Counter

(a) Morris counter. Your friend ran a single Morris counter on an unknown stream and tells you the final value of the counter is $C = 8$ . What is their best point estimate $\widehat{N}$ of the number of stream items processed?

Solution

The Morris counter output is $\widehat{N} = 2^C - 1$ , so

\widehat{N} = 2^8 - 1 = 255.

(b) Use the Morris counter’s variance result $\mathrm{Var}(\widehat{N}) \approx N^2/2$ to estimate the standard deviation of the estimate from part (a). Why does this make a single Morris counter unsatisfying for high-confidence applications?

Solution

The standard deviation of the estimate is roughly

\sigma \approx \sqrt{N^2/2} = N/\sqrt{2}.

Using the point estimate $\widehat{N} = 255$ as a stand-in for $N$ , $\sigma \approx 255/\sqrt{2} \approx 180$ . So the true $N$ could plausibly be anywhere from roughly $75$ to $435$ on a single run — the standard deviation is on the order of the mean itself. This is exactly why one needs the Morris+/Morris++ boosting strategies for any application that demands a tight confidence interval.

(c) Flajolet-Martin. Another friend ran the basic Flajolet-Martin algorithm and tells you the final largest LSB position observed is $P = 9$ . What is their point estimate $\widehat{T}$ of the number of distinct elements?

Solution

The basic Flajolet-Martin algorithm outputs $\widehat{T} = 2^{P+1}$ , so

\widehat{T} = 2^{9+1} = 2^{10} = 1024.

(d) Using the factor- $32$ guarantee from Problem 6, give a range of plausible values for the true number of distinct elements $T$ (with probability $\ge 2/3$ ).

Solution

The factor- $32$ guarantee says that with probability $\ge 2/3$ ,

\frac{T}{32} \le \widehat{T} \le 32 T.

Rearranging both sides for $T$ with $\widehat{T} = 1024$ :

\frac{\widehat{T}}{32} \le T \le 32 \widehat{T} \implies \frac{1024}{32} \le T \le 32 \cdot 1024 \implies 32 \le T \le 32768.

So the true distinct-element count is somewhere between $32$ and $32{,}768$ with probability at least $2/3$ — a very wide window, again motivating the boosted Flajolet-Martin / HyperLogLog improvements.

Problem 15: Daily Weather Predictions via MWU

You subscribe to $n = 3$ competing weather services that each post a daily binary forecast (rain / no-rain) for your city. After $T = 1000$ days, you want a strategy whose total wrong predictions are not much more than those of whichever service turns out to have been the most accurate (which you do not know in advance).

You may use the mistake bound proven in class without re-deriving it:

m^{(t)} \le 2(1 + \varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon} \qquad \forall\, i \in [n].

(a) Identify the experts and the binary decisions in MWU language. What is $n$ here?

Solution

The three weather services are the experts ( $n = 3$ ). The binary decision each day is “rain” or “no rain”. Your strategy’s job is to combine the three forecasts into a single daily decision.

(b) State the Weighted Majority algorithm in this context, including the initial weights, the daily decision rule, and the weight update rule after each day.

Solution

Initialize weights $w_1 = w_2 = w_3 = 1$ .
Each morning, compute the total weight $\Phi = w_1 + w_2 + w_3$ . Add up the weights of services predicting rain. If they sum to $\ge \Phi/2$ , your strategy predicts rain; otherwise predict no-rain.
After the actual weather is observed, for each service $i$ that was wrong, set $w_i \leftarrow (1 - \varepsilon) w_i$ ; leave the weight of correctly predicting services unchanged.

(c) Suppose you choose $\varepsilon = 0.1$ , and after the 1000 days the best-performing service was wrong on $50$ days. Give an upper bound on the number of days your Weighted Majority strategy was wrong. (You may approximate $\ln 3 \approx 1.1$ .)

Solution

Plug $\varepsilon = 0.1$ , $n = 3$ , $m_{\text{best}}^{(1000)} = 50$ :

m^{(1000)} \le 2(1.1)(50) + \frac{2 \ln 3}{0.1} \approx 110 + \frac{2 \cdot 1.1}{0.1} = 110 + 22 = 132.

So Weighted Majority makes at most about $132$ wrong predictions over the $1000$ days. That’s roughly $2.6 \times$ the best service — and crucially, you achieved this without knowing in advance which service would be best.

(d) What would happen to the bound from part (c) if you instead naively committed to always following service #1 from day 1 onward, and service #1 turned out to be the worst of the three?

Solution

If you blindly followed service #1 and it happened to be the worst service, you would make as many mistakes as the worst service — which could be arbitrarily large with no bound relative to the best. MWU’s whole point is that you get a guarantee comparable to the best expert in hindsight, regardless of which one that turns out to be, so you never get stuck mimicking a bad expert.

Problem 16: Hand-Simulating a Browser Cache

Your browser holds at most $k = 3$ tabs in memory. You visit URLs in the following order:

S = \{A, B, C, D, A, B, E, A, C, D\}.

(a) Briefly describe each eviction policy: LRU, FIFO, and the offline optimal (Farthest-in-Future).

Solution

LRU (Least Recently Used): on a miss with a full cache, evict the item whose most recent access is furthest in the past.
FIFO (First In, First Out): on a miss with a full cache, evict the item that was inserted into the cache earliest. Accessing an item already in the cache does not change its FIFO position.
Optimal offline (Farthest-in-Future): on a miss with a full cache, evict the item whose next request is furthest in the future (or whose next request never comes).

(b) Simulate LRU on $S$ and report the total number of cache misses.

Solution

Track the cache and recency order (most-recent on the left):

Step	Request	Hit/Miss	Cache after	Recency order (left = most recent)
1	A	Miss	{A}	[A]
2	B	Miss	{A, B}	[B, A]
3	C	Miss	{A, B, C}	[C, B, A]
4	D	Miss	{B, C, D}	[D, C, B] (evicted A, the LRU)
5	A	Miss	{C, D, A}	[A, D, C] (evicted B)
6	B	Miss	{D, A, B}	[B, A, D] (evicted C)
7	E	Miss	{A, B, E}	[E, B, A] (evicted D)
8	A	Hit	{A, B, E}	[A, E, B]
9	C	Miss	{A, E, C}	[C, A, E] (evicted B)
10	D	Miss	{A, C, D}	[D, C, A] (evicted E)

LRU misses: 9 (only step 8 is a hit).

(c) Simulate FIFO on $S$ and report the total number of cache misses.

Solution

Track the cache and the insertion-order queue (left = oldest):

Step	Request	Hit/Miss	Cache after	Queue (left = first in)
1	A	Miss	{A}	[A]
2	B	Miss	{A, B}	[A, B]
3	C	Miss	{A, B, C}	[A, B, C]
4	D	Miss	{B, C, D}	[B, C, D] (evicted A)
5	A	Miss	{C, D, A}	[C, D, A] (evicted B)
6	B	Miss	{D, A, B}	[D, A, B] (evicted C)
7	E	Miss	{A, B, E}	[A, B, E] (evicted D)
8	A	Hit	{A, B, E}	[A, B, E] (queue unchanged)
9	C	Miss	{B, E, C}	[B, E, C] (evicted A)
10	D	Miss	{E, C, D}	[E, C, D] (evicted B)

FIFO misses: 9 (only step 8 is a hit).

(d) Simulate the optimal offline (Farthest-in-Future) algorithm and report the total misses.

Solution

At each miss, look ahead at the next occurrence of each cached item; evict the one whose next request is farthest (or never).

Step	Request	Hit/Miss	Cache after	Eviction reasoning
1	A	Miss	{A}	-
2	B	Miss	{A, B}	-
3	C	Miss	{A, B, C}	-
4	D	Miss	{A, B, D}	Next A@5, B@6, C@9. Evict C (farthest).
5	A	Hit	{A, B, D}	-
6	B	Hit	{A, B, D}	-
7	E	Miss	{A, D, E}	Next A@8, B never, D@10. Evict B (never).
8	A	Hit	{A, D, E}	-
9	C	Miss	{D, E, C}	Next A never, D@10, E never. Evict A.
10	D	Hit	{D, E, C}	-

OPT misses: 6 (steps 1, 2, 3, 4, 7, 9).

(e) What is the empirical ratio (algorithm misses / OPT misses) for LRU and FIFO on this sequence? How does it compare to the worst-case guarantee of $k$ -competitiveness for $k = 3$ ?

Solution

\frac{\mathrm{LRU\ misses}}{\mathrm{OPT\ misses}} = \frac{9}{6} = 1.5, \qquad \frac{\mathrm{FIFO\ misses}}{\mathrm{OPT\ misses}} = \frac{9}{6} = 1.5.

By Sleator-Tarjan, the worst-case bound for both LRU and FIFO at $k = 3$ is $3$ -competitive, i.e. ratio $\le 3$ . On this particular sequence the achieved ratio is only $1.5$ , well within the worst-case bound. (The $k$ -competitive bound is tight only for adversarial sequences — for typical traces, both LRU and FIFO usually perform much better than the worst case.)

Final Prep: Problem Sets | CSCI 328

Problem 1: Morris Counter - Streaming Algorithms

Problem 2: Count-Min Sketch

Problem 3: (ε,δ)(\varepsilon,\delta)(ε,δ)-Approximate Median

Problem 4: AMS Sampling for the Second Frequency Moment

Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash

Problem 6: Flajolet-Martin Factor-323232 Guarantee

Problem 7: Online Algorithms - Competitive Ratio

Problem 8: List Update - Move-to-Front

Problem 9: Paging - LRU, FIFO, and Resource Augmentation

Problem 10: Multiplicative Weight Updates (Experts)

Problem 11: Johnson-Lindenstrauss Lemma

Problem 12: Approximate Nearest Neighbor and LSH

Problem 13: Run AMS by Hand on a Concrete Stream

Problem 14: Reverse-Engineering a Streaming Counter

Problem 15: Daily Weather Predictions via MWU

Problem 16: Hand-Simulating a Browser Cache

Problem 3: $(\varepsilon,\delta)$ -Approximate Median

Problem 6: Flajolet-Martin Factor- $32$ Guarantee