Skip to content

Final Prep: Problem Sets

Problem 1: Morris Counter - Streaming Algorithms

Section titled “Problem 1: Morris Counter - Streaming Algorithms”

(a) What is an (ε,δ)(\varepsilon, \delta)-guarantee for a randomized estimator n^\hat{n} of some quantity nn? What do ε\varepsilon and δ\delta each control?

Solution

An (ε,δ)(\varepsilon, \delta)-guarantee is a way of formally expressing that an algorithm gives a good approximate answer with high probability. It has two components:

ε\varepsilon — the accuracy parameter. This controls how close the estimate is to the true answer. Specifically, it says the estimate is within a multiplicative factor of (1±ε)(1 \pm \varepsilon) of the true value:

n^[(1ε)n, (1+ε)n]\hat{n} \in [(1 - \varepsilon)\,n,\ (1 + \varepsilon)\,n]

So a smaller ε\varepsilon means a tighter, more accurate estimate.

δ\delta — the failure probability. This controls how often the algorithm is allowed to be wrong. The guarantee holds with probability at least 1δ1 - \delta:

Pr ⁣(n^[(1ε)n, (1+ε)n])1δ\Pr\!\bigl(\hat{n} \in [(1 - \varepsilon)\,n,\ (1 + \varepsilon)\,n]\bigr) \ge 1 - \delta

So a smaller δ\delta means the algorithm fails less often.

Putting it together. The full (ε,δ)(\varepsilon, \delta)-guarantee says:

With probability at least 1δ1 - \delta, my estimate is within ε\varepsilon of the true answer.

The tradeoff. You can always make ε\varepsilon and δ\delta smaller — but it costs you. Typically making them smaller requires running more independent copies of the algorithm (boosting via median-of-means), which increases space and time usage. So ε\varepsilon and δ\delta let you precisely tune the accuracy vs. resource tradeoff.

(b) Describe the Morris Counter algorithm and analyze its space complexity compared to a naive counting approach. What is the trade-off being made?

Solution

Morris Counter algorithm:

  1. Initialize counter C=0C = 0
  2. On each new stream element, increment CC to C+1C+1 with probability p=1/2Cp = 1/2^C
  3. Return estimate n^=2C1\hat{n} = 2^C - 1

Space complexity: Only CC is stored, where 0Clog2n0 \le C \le \log_2 n. This requires O(loglogn)O(\log \log n) bits.

Naive approach: Storing the exact count requires Θ(logn)\Theta(\log n) bits.

Trade-off: By randomizing increments, we achieve exponential space savings (O(loglogn)O(\log \log n) vs. O(logn)O(\log n)), but accept a random error in the estimate. The output is approximate, not exact.

(c) What are the mean and variance of the Morris Counter’s output? Why is the variance a problem?

Solution

Mean and variance:

From the Morris Counter algorithm:

  • E[n^]=E[2C1]=NE[\hat{n}] = E[2^C - 1] = N (unbiased)
  • Var(n^)=Var(2C1)N(N1)2=O(N2)\text{Var}(\hat{n}) = \text{Var}(2^C - 1) \approx \frac{N(N-1)}{2} = O(N^2)

Why it’s a problem: The standard deviation is O(N)O(N), so the output can deviate wildly from the true count. With high probability, the estimate could be off by NN or more, making a single Morris Counter unreliable for high-confidence estimates.

(d) Describe three successive strategies for improving the Morris Counter’s reliability. How does each strategy reduce the failure probability, and what is the total space cost?

Solution

Three successive strategies:

Strategy 1: Morris (basic)

  • Run a single Morris Counter
  • Success probability: undefined (no confidence guarantee)
  • Failure probability: unbounded
  • Space: O(loglogN)O(\log \log N) bits

Strategy 2: Morris+ (weak estimate via averaging)

  • Run T=1/ε2T = \lceil 1/\varepsilon^2 \rceil independent Morris Counters in parallel
  • Return the average of the TT estimates
  • By Chebyshev’s inequality (or Exercise 4.9(b)), this gives an (ε,3/4)(\varepsilon, 3/4)-estimate
  • Success probability: 3/4\geq 3/4 (failure probability 1/4\le 1/4)
  • Space: O(1/ε2loglogN)O(1/\varepsilon^2 \cdot \log \log N) bits

Strategy 3: Morris++ (strong estimate via boosting)

  • Run M=log(1/δ)M = \lceil \log(1/\delta) \rceil independent Morris+ instances in parallel
  • Return the median of their outputs
  • By the median-of-weak-estimates trick (Exercise 4.9(c)), this gives a (ε,δ)(\varepsilon, \delta)-estimate
  • Success probability: 1δ\geq 1 - \delta (arbitrary confidence)
  • Space: O(log(1/δ)1/ε2loglogN)O(\log(1/\delta) \cdot 1/\varepsilon^2 \cdot \log \log N) bits

Each strategy improves failure probability at the cost of more parallel instances (and more space).

(e) For an application requiring count estimation within εN\varepsilon N of the true count with confidence 1δ\geq 1 - \delta, where ε=0.1\varepsilon = 0.1 and δ=0.01\delta = 0.01, how many parallel Morris counters are needed in each of the three strategies from part (d)?

Solution

Numerical calculation for ε=0.1\varepsilon = 0.1, δ=0.01\delta = 0.01:

Morris: No reliability guarantee.

Morris+:

T=1/ε2=1/(0.1)2=100=100 countersT = \lceil 1/\varepsilon^2 \rceil = \lceil 1/(0.1)^2 \rceil = \lceil 100 \rceil = 100 \text{ counters}

Morris++:

M=log(1/δ)=log(1/0.01)=log(100)6.64=7 instances of Morris+M = \lceil \log(1/\delta) \rceil = \lceil \log(1/0.01) \rceil = \lceil \log(100) \rceil \approx \lceil 6.64 \rceil = 7 \text{ instances of Morris+}

Total counters needed:

M×T=7×100=700 Morris CountersM \times T = 7 \times 100 = 700 \text{ Morris Counters}

The Count-Min Sketch is given as a problem on the sample final the professor distributed. Throughout, let the stream be S={x1,,xm}S = \{x_1,\ldots,x_m\} with xj{1,,n}x_j \in \{1,\ldots,n\}, and let F(i)F(i) be the true frequency of item ii.

(a) What is the heavy hitters / frequency estimation problem, and why does CMS pair with a min-heap to solve it? (No need to give the CMS data structure yet — describe the problem and the high-level role of CMS.)

Solution

The heavy hitters problem is the real-world question Amazon (or any platform that tracks views, purchases, clicks, etc.) actually wants to answer: out of all the items flowing past in a stream, which are the top-KK most popular ones right now? You can’t just count everything exactly because the universe of items is huge and the stream never stops, so we need something cheaper.

A natural way to solve heavy hitters is to break it into two pieces:

The sketch. A small data structure that, given any item yy, can quickly answer “roughly how many times has yy shown up so far?” — i.e. an estimate F^(y)\hat{F}(y) of its true frequency F(y)F(y). This is the frequency estimation problem, and it’s what Count-Min Sketch is built for.

The heap. A min-heap of size KK that holds the current top-KK candidates. Whenever a new item xx arrives, we ask the sketch for F^(x)\hat{F}(x) and compare it against the smallest frequency currently in the heap. If the new item is larger, it kicks out the current minimum and takes its slot.

So CMS doesn’t solve heavy hitters by itself — it provides the cheap frequency lookups, and the min-heap turns those lookups into a rolling top-KK. We’re forced into this approximate setup because tracking exact frequencies would require Ω(n)\Omega(n) space, which is exactly what sublinear-space streaming forbids.

(b) Describe the Count-Min Sketch data structure with parameters cc and rr, including how an update is processed and how a query for F(y)F(y) is answered. Why is the min over rows used and not, say, the average?

Solution

Pick rr hash functions h1,,hrh_1,\ldots,h_r with hj:{1,,n}{1,,c}h_j : \{1,\ldots,n\} \to \{1,\ldots,c\}, and keep an r×cr \times c counter matrix initialized to 00.

  • Update xx: for each row jj, increment cell (j,hj(x))(j, h_j(x)) by 11.
  • Query yy: return F^(y)=min1jrCTR[j,hj(y)]\widehat{F}(y) = \min_{1 \le j \le r} \mathrm{CTR}[j, h_j(y)].

The min is used because each counter CTR[j,hj(y)]\mathrm{CTR}[j, h_j(y)] is at least F(y)F(y) (every occurrence of yy is counted) plus the contributions from other items hashing to the same cell. So every row gives an overestimate, and the smallest overestimate is the most accurate.

(c) Why does the sketch only overestimate, i.e. why is F^(y)F(y)\widehat{F}(y) \ge F(y) for every item yy?

Solution

Every occurrence of yy in the stream increments CTR[j,hj(y)]\mathrm{CTR}[j, h_j(y)] by 11 in every row jj. Therefore each row’s counter is at least F(y)F(y), so the minimum across rows is also at least F(y)F(y).

(d) State the (ε,δ)(\varepsilon,\delta)-guarantee proved in class and indicate the values of cc and rr needed.

Solution

With

c=eε,r=ln ⁣(1δ),c = \frac{e}{\varepsilon}, \qquad r = \ln\!\left(\frac{1}{\delta}\right),

the Count-Min Sketch satisfies

Pr ⁣(F^(y)F(y)+εm)1δ.\Pr\!\bigl(\widehat{F}(y) \le F(y) + \varepsilon m\bigr) \ge 1 - \delta.

This is an additive error in terms of the stream length mm, not a relative error.

(e) Prove the guarantee. You may use Markov’s inequality without proof.

Solution

Fix one row jj and let CTR=CTR[j,hj(y)]\mathrm{CTR} = \mathrm{CTR}[j, h_j(y)]. Define Error=CTRF(y)\mathrm{Error} = \mathrm{CTR} - F(y). The expected number of other items that collide with yy in this row is at most m/cm/c (each of the at most mm non-yy stream elements lands in cell hj(y)h_j(y) with probability 1/c1/c), so

E[Error]mc=εme.\mathbb{E}[\mathrm{Error}] \le \frac{m}{c} = \frac{\varepsilon m}{e}.

By Markov,

Pr(Errorεm)E[Error]εm1e.\Pr\bigl(\mathrm{Error} \ge \varepsilon m\bigr) \le \frac{\mathbb{E}[\mathrm{Error}]}{\varepsilon m} \le \frac{1}{e}.

For F^(y)=minj\widehat{F}(y) = \min_j to exceed F(y)+εmF(y) + \varepsilon m, every row’s error must be at least εm\varepsilon m. Since rows use independent hash functions,

Pr(F^(y)F(y)+εm)(1e)r=eln(1/δ)=δ.\Pr\bigl(\widehat{F}(y) \ge F(y) + \varepsilon m\bigr) \le \left(\frac{1}{e}\right)^r = e^{-\ln(1/\delta)} = \delta.

(f) Suppose m=106m = 10^6, ε=0.01\varepsilon = 0.01 and δ=0.001\delta = 0.001. Compute the sketch size r×cr \times c (in counters).

Solution c=e/0.01=271.83=272,c = \lceil e/0.01 \rceil = \lceil 271.83 \rceil = 272, r=ln(1/0.001)=ln(1000)=6.91=7.r = \lceil \ln(1/0.001) \rceil = \lceil \ln(1000) \rceil = \lceil 6.91 \rceil = 7.

Total counters =rc=7272=1904= r \cdot c = 7 \cdot 272 = 1904.


Problem 3: (ε,δ)(\varepsilon,\delta)-Approximate Median

Section titled “Problem 3: (ε,δ)(\varepsilon,\delta)(ε,δ)-Approximate Median”

This topic was Question 3 on the sample final.

Let S={x1,,xN}S = \{x_1,\ldots,x_N\} be a stream. For ε,δ(0,1)\varepsilon, \delta \in (0,1), we want to return an element mm whose rank satisfies

(12ε)Nrank(m)(12+ε)N\left(\tfrac{1}{2} - \varepsilon\right) N \le \mathrm{rank}(m) \le \left(\tfrac{1}{2} + \varepsilon\right) N

with probability at least 1δ1 - \delta.

(a) What is the (ε,δ)(\varepsilon, \delta)-approximate median problem? State the input, the output condition, and what the ε\varepsilon and δ\delta parameters each control here.

Solution

The (ε,δ)(\varepsilon, \delta)-approximate median problem is a relaxation of “find the exact median element of the stream”. Exact computation is too expensive in sublinear streaming space, so instead of insisting on the element with rank exactly N/2N/2, we settle for any element whose rank is close to the middle, and we allow ourselves to fail every so often.

Concretely, the input is a stream S={x1,,xN}S = \{x_1, \ldots, x_N\} together with two knobs ε\varepsilon and δ\delta in (0,1)(0, 1). The job is to return some element mm whose rank lands inside a window around N/2N/2:

(12ε)Nrank(m)(12+ε)N\left(\tfrac{1}{2} - \varepsilon\right) N \le \mathrm{rank}(m) \le \left(\tfrac{1}{2} + \varepsilon\right) N

and the guarantee says this must hold with probability at least 1δ1 - \delta.

ε\varepsilon — the rank window. Controls how far the returned element’s rank is allowed to be from the true median rank N/2N/2, measured as a fraction of NN. Smaller ε\varepsilon means a tighter window; at ε=0\varepsilon = 0 you’re back to demanding the exact median.

δ\delta — the failure probability. Controls how often the algorithm is allowed to return something outside that window. Smaller δ\delta means the algorithm is allowed to mess up less often.

Putting it together: “with probability at least 1δ1 - \delta, the returned element’s rank is within an εN\varepsilon N-sized window of the true median rank N/2N/2.”

(b) Describe the sampling algorithm and explain in words why the sampled median is likely to be a good approximate median.

Solution
  • Sample tt elements T={y1,,yt}T = \{y_1, \ldots, y_t\} uniformly and independently from the stream.
  • Output the median mm of the sample.

If we sample uniformly, then in expectation only a (12ε)(\tfrac{1}{2} - \varepsilon) fraction of samples fall in either bad region. For the sample median to land in SLS_L, more than half of the samples must come from SLS_L, which is far above the expected fraction. Chernoff makes this unlikely once tt is large enough.

(c) Define the bad regions SLS_L and SRS_R and write down the failure event of the algorithm in terms of TL|T_L| and TR|T_R|, where TT is the sample set.

Solution

Sort the stream and let

  • SLS_L = the smallest (12ε)N(\tfrac{1}{2} - \varepsilon)N elements,
  • SRS_R = the largest (12ε)N(\tfrac{1}{2} - \varepsilon)N elements.

Let TL=TSLT_L = T \cap S_L and TR=TSRT_R = T \cap S_R. The algorithm fails iff

TL>t/2orTR>t/2.|T_L| > t/2 \quad \text{or} \quad |T_R| > t/2.

(d) Using a Chernoff bound (stated below), show that t=O ⁣(ε2logδ1)t = O\!\left(\varepsilon^{-2} \log \delta^{-1}\right) samples suffice.

Solution

For TLT_L, each sample lands in SLS_L independently with probability p=12εp = \tfrac{1}{2} - \varepsilon. So E[TL]=(12ε)t=μ\mathbb{E}[|T_L|] = (\tfrac{1}{2} - \varepsilon) t = \mu. The threshold t/2t/2 in multiplicative form is (1+γ)μ(1+\gamma)\mu with

1+γ=t/2(1/2ε)t=1/21/2ε    γ=ε1/2ε.1 + \gamma = \frac{t/2}{(1/2 - \varepsilon)t} = \frac{1/2}{1/2 - \varepsilon} \implies \gamma = \frac{\varepsilon}{1/2 - \varepsilon}.

Chernoff gives

Pr(TL>t/2)exp ⁣(γ2μ3)=exp ⁣(13ε21/2εt).\Pr(|T_L| > t/2) \le \exp\!\left(-\frac{\gamma^2 \mu}{3}\right) = \exp\!\left(-\frac{1}{3}\cdot\frac{\varepsilon^2}{1/2-\varepsilon}\cdot t\right).

A symmetric bound holds for TRT_R. Union bound:

Pr(fail)2exp ⁣(ε23(1/2ε)t).\Pr(\text{fail}) \le 2\exp\!\left(-\frac{\varepsilon^2}{3(1/2 - \varepsilon)}\, t\right).

Setting this δ\le \delta and solving yields

t3(1/2ε)ε2ln ⁣2δ=O ⁣(1ε2log ⁣1δ).t \ge \frac{3(1/2 - \varepsilon)}{\varepsilon^2}\, \ln\!\frac{2}{\delta} = O\!\left(\frac{1}{\varepsilon^2} \log\!\frac{1}{\delta}\right).

(e) Suppose ε=0.05\varepsilon = 0.05 and δ=0.01\delta = 0.01. Give a concrete value of tt (using the constant 33 from the Chernoff bound).

Solution

With ε=0.05\varepsilon = 0.05 and δ=0.01\delta = 0.01:

t3(0.45)(0.05)2ln ⁣20.01=1.350.0025ln2005405.302862.t \ge \frac{3(0.45)}{(0.05)^2}\, \ln\!\frac{2}{0.01} = \frac{1.35}{0.0025}\, \ln 200 \approx 540 \cdot 5.30 \approx 2862.

So roughly t2862t \approx 2862 samples suffice.


Problem 4: AMS Sampling for the Second Frequency Moment

Section titled “Problem 4: AMS Sampling for the Second Frequency Moment”

This topic was Question 6 on the sample final. Recall F2=i=1nf(i)2F_2 = \sum_{i=1}^{n} f(i)^2 for a stream of length mm over universe {1,,n}\{1,\ldots,n\}.

(a) What is the kk-th frequency moment FkF_k of a stream? What do F0F_0, F1F_1, and F2F_2 correspond to in plain English, and why might a data engineer care about F2F_2 specifically?

Solution

Frequency moments are a family of summary statistics that capture different “shapes” of how items appear in a stream. If f(i)f(i) is the number of times item ii shows up, then the kk-th frequency moment is just the sum of those counts raised to the kk-th power:

Fk=i=1nf(i)k.F_k = \sum_{i=1}^{n} f(i)^k.

The value of kk changes what aspect of the stream you’re capturing:

  • F0F_0 counts the number of distinct items in the stream (using the convention 00=00^0 = 0, so absent items contribute nothing).
  • F1=mF_1 = m is just the total stream length — every item contributes its raw frequency, so you’re really just adding up all the appearances.
  • F2F_2 is the sum of squared frequencies. Squaring amplifies large frequencies, so F2F_2 becomes a measure of how concentrated the stream is: it’s large when a few items dominate, small when items are spread evenly.

Why care about F2F_2 in particular? It’s exactly the kind of statistic the Gini index uses in economics to measure income inequality — a few people earning huge amounts blow up the squared sum, while an even distribution keeps it small. The same idea shows up in databases, where F2F_2 estimates the cost of a self-join (you’re summing the number of matching pairs per key).

(b) State the AMS sampling algorithm. Describe what XX is output for a single sample.

Solution
  • Sample a position i{1,,m}i \in \{1,\ldots,m\} uniformly at random.
  • Let rr be the number of occurrences of xix_i from position ii to the end of the stream.
  • Output X=m(r2(r1)2)=m(2r1)X = m\bigl(r^2 - (r-1)^2\bigr) = m(2r - 1).

(c) Prove E[X]=F2\mathbb{E}[X] = F_2.

Solution

Condition on the value of the sampled item being aa. Given this, the sampled occurrence is uniformly one of the f(a)f(a) occurrences of aa. If it is the jj-th occurrence, then r=f(a)j+1r = f(a) - j + 1, so X=m(2r1)X = m(2r - 1).

E[Xvalue=a]=1f(a)j=1f(a)m(2(f(a)j+1)1)=mf(a)r=1f(a)(2r1)=mf(a)2f(a)=mf(a).\mathbb{E}[X \mid \text{value} = a] = \frac{1}{f(a)} \sum_{j=1}^{f(a)} m\bigl(2(f(a) - j + 1) - 1\bigr) = \frac{m}{f(a)} \sum_{r=1}^{f(a)} (2r - 1) = \frac{m \cdot f(a)^2}{f(a)} = m \cdot f(a).

The value is aa with probability f(a)/mf(a)/m, so

E[X]=a=1nPr(value=a)E[Xvalue=a]=a=1nf(a)mmf(a)=a=1nf(a)2=F2.\mathbb{E}[X] = \sum_{a=1}^{n} \Pr(\text{value} = a)\, \mathbb{E}[X \mid \text{value}=a] = \sum_{a=1}^{n} \frac{f(a)}{m} \cdot m f(a) = \sum_{a=1}^{n} f(a)^2 = F_2.

(d) A single AMS run is unbiased but high variance. Describe the two-stage boosting strategy used to obtain an (ε,δ)(\varepsilon,\delta)-estimate of F2F_2.

Solution

The two-stage strategy (mirroring Morris++):

  • Stage 1 (averaging). Run tt independent AMS estimators in parallel and report their average YY. This reduces variance and gives an (ε,14)(\varepsilon, \tfrac{1}{4})-style guarantee.
  • Stage 2 (median). Run O(log1/δ)O(\log 1/\delta) independent copies of Stage 1 and report the median, boosting the failure probability down to δ\delta.

For part (e) below we directly use a Chernoff bound on the average and skip the median step.

(e) Using the bound X2mfX \le 2 m f^* (where f=maxif(i)f^* = \max_i f(i)) and the key lemma mfF2n\frac{m f^*}{F_2} \le \sqrt{n}, show that

t=O ⁣(nε2log ⁣1δ)t = O\!\left(\frac{\sqrt{n}}{\varepsilon^2}\, \log\!\frac{1}{\delta}\right)

independent copies suffice when their average is reported.

You may use the non-Bernoulli Chernoff bound: if X1,,XtX_1,\ldots,X_t are i.i.d. in [0,C][0,C] and Y=1tXiY = \frac{1}{t}\sum X_i, then for 0<ε10 < \varepsilon \le 1,

Pr(YE[Y]εE[Y])2exp ⁣(ε2E[Y]t3C).\Pr\bigl(|Y - \mathbb{E}[Y]| \ge \varepsilon \mathbb{E}[Y]\bigr) \le 2\exp\!\left(-\frac{\varepsilon^2 \mathbb{E}[Y]\, t}{3 C}\right).
Solution

With C=2mfC = 2 m f^* and E[Y]=F2\mathbb{E}[Y] = F_2, the non-Bernoulli Chernoff bound gives

Pr(YF2εF2)2exp ⁣(ε2F2t32mf)=2exp ⁣(ε2t6mfF2).\Pr\bigl(|Y - F_2| \ge \varepsilon F_2\bigr) \le 2 \exp\!\left(-\frac{\varepsilon^2 F_2\, t}{3 \cdot 2 m f^*}\right) = 2 \exp\!\left(-\frac{\varepsilon^2\, t}{6 \cdot \frac{m f^*}{F_2}}\right).

Substituting the key lemma mfF2n\frac{m f^*}{F_2} \le \sqrt{n}:

Pr(YF2εF2)2exp ⁣(ε2t6n).\Pr\bigl(|Y - F_2| \ge \varepsilon F_2\bigr) \le 2\exp\!\left(-\frac{\varepsilon^2 t}{6 \sqrt{n}}\right).

Setting the right-hand side δ\le \delta and solving for tt:

t6nε2ln ⁣2δ=O ⁣(nε2log ⁣1δ).t \ge \frac{6 \sqrt{n}}{\varepsilon^2} \, \ln\!\frac{2}{\delta} = O\!\left(\frac{\sqrt{n}}{\varepsilon^2} \log\!\frac{1}{\delta}\right).

Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash

Section titled “Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash”

This is the explicit “final question” the professor posed at the end of Lecture 20. Your friend implemented the idealized distinct-elements algorithm but tracked the maximum hash value instead of the minimum.

Recall the idealized setup: a hash function h:{1,,n}[0,1]h : \{1,\ldots,n\} \to [0,1] produces a uniform hash for each distinct element. Let tt be the number of distinct elements seen so far.

(a) State the distinct elements problem (F0F_0), and describe the original idealized algorithm that this problem is a variant of. What output formula does the original algorithm use, and what is the intuition behind it?

Solution

The distinct elements problem is the streaming version of “how many different items have I seen so far?” — think of it as counting unique users hitting a website, or unique products sold today, while the events fly by. Storing every distinct item in a set works but costs Ω(F0logn)\Omega(F_0 \log n) bits, which defeats the point of streaming. So we want a way to estimate F0F_0 using much less space.

The original min-hash algorithm does this with a surprisingly clean idea: hash every incoming element into the real interval [0,1][0, 1], but only bother to remember the smallest hash you’ve seen so far. Concretely:

  1. Pick a perfectly random hash h:{1,,n}[0,1]h : \{1, \ldots, n\} \to [0, 1] once at the start.
  2. As the stream arrives, hash each element and keep updating a running minimum minh(xi)\min h(x_i).
  3. When asked, output t^=1minh(xi)1\widehat{t} = \dfrac{1}{\min h(x_i)} - 1.

The intuition is what makes this work. If you’ve seen tt distinct elements, their hashes are like tt darts thrown uniformly at random into [0,1][0, 1]. The more darts you throw, the closer the smallest one gets to 00 — and in fact the expected minimum of tt uniform samples is exactly 1t+1\dfrac{1}{t+1}. So inverting and subtracting 11 recovers tt. Repeated occurrences of the same item don’t mess things up because they always hash to the same value, so only new items can pull the minimum lower.

(b) Is your friend’s algorithm doomed, or can the output be changed to give a correct estimate of tt?

Solution

Not doomed - the maximum of tt uniform random variables has a known expectation, and inverting the relationship lets us recover tt.

(c) Derive E[maxh(xi)]\mathbb{E}[\max h(x_i)] when there are tt distinct elements. Use the identity

E[X]=01Pr(X>x)dx\mathbb{E}[X] = \int_0^1 \Pr(X > x)\, dx

for X[0,1]X \in [0,1].

Solution

Let X=maxh(xi)X' = \max h(x_i). The event X>xX' > x means at least one of the tt hash values exceeds xx, but it’s easier to compute the complement: XxX' \le x means all hash values are x\le x. By independence,

Pr(Xx)=xt    Pr(X>x)=1xt.\Pr(X' \le x) = x^t \implies \Pr(X' > x) = 1 - x^t.

Using E[X]=01Pr(X>x)dx\mathbb{E}[X'] = \int_0^1 \Pr(X' > x)\, dx:

E[X]=01(1xt)dx=[xxt+1t+1]01=11t+1=tt+1.\mathbb{E}[X'] = \int_0^1 (1 - x^t)\, dx = \left[ x - \frac{x^{t+1}}{t+1} \right]_0^1 = 1 - \frac{1}{t+1} = \frac{t}{t+1}.

(d) Based on (c), what should the output of the modified algorithm be, in terms of X=maxh(xi)X' = \max h(x_i)?

Solution

If E[X]=tt+1\mathbb{E}[X'] = \tfrac{t}{t+1}, then

1E[X]=1t+1    11E[X]1=t.1 - \mathbb{E}[X'] = \frac{1}{t+1} \implies \frac{1}{1 - \mathbb{E}[X']} - 1 = t.

So the modified algorithm should output

t^=11X1.\widehat{t} = \frac{1}{1 - X'} - 1.

(e) As a sanity check, plug in t=2t = 2 and verify the output evaluates to 22 in expectation.

Solution

For t=2t = 2, E[X]=23\mathbb{E}[X'] = \tfrac{2}{3}. Plugging into the output:

112/31=11/31=31=2. \frac{1}{1 - 2/3} - 1 = \frac{1}{1/3} - 1 = 3 - 1 = 2.\ \checkmark

Problem 6: Flajolet-Martin Factor-3232 Guarantee

Section titled “Problem 6: Flajolet-Martin Factor-323232 Guarantee”

In lecture, we proved that the basic Flajolet-Martin algorithm (track the largest least-significant-bit position XmX_m in the hashes seen so far, output 2Xm+12^{X_m + 1}) gives a 3232-approximation with probability at least 23\tfrac{2}{3}.

(a) What problem does Flajolet-Martin solve, and why is it preferred over the idealized min-hash algorithm? Briefly restate the basic algorithm.

Solution

Flajolet-Martin solves the same problem as min-hash: counting the number of distinct elements (F0F_0) in a stream, in sublinear space. The reason FM is preferred is purely practical — the min-hash algorithm is idealized because it requires hashing into the continuous interval [0,1][0, 1], which means storing real numbers with full precision. That’d take infinitely many bits, which obviously breaks any space guarantee. FM gets around this by working with integer hashes and looking at their bit patterns instead.

The trick FM uses is to track trailing zeros in the binary representations of the hashes. The algorithm is:

  1. Pick a hash h:{1,,n}{0,1,,n1}h : \{1, \ldots, n\} \to \{0, 1, \ldots, n - 1\}.
  2. For each stream element xix_i, look at h(xi)h(x_i) in binary and find its least significant bit (LSB) position — the position of the rightmost 11, which is the same as the number of trailing zeros.
  3. Keep a running maximum XmX_m of all the LSB positions seen so far.
  4. When asked, output T^=2Xm+1\widehat{T} = 2^{X_m + 1}.

The intuition (developed more in part (b)) is that the more distinct elements you’ve seen, the more likely it is that some of them happened to hash to a value with a really long trailing-zeros pattern — so the running maximum LSB grows roughly like logT\log T.

(b) Why is the output set to 2Xm+12^{X_m + 1}? Give an informal one-paragraph justification using the geometric structure of trailing zeros.

Solution

In a uniformly random bit string, the chance of seeing trailing zeros at position jj (so the least significant 11 is at position jj) is 1/2j+11/2^{j+1}. Across TT distinct hashes, roughly T/2j+1T / 2^{j+1} end at position jj. The largest position jj likely to be observed is the one where T/2j+11T / 2^{j+1} \approx 1, i.e. jlogT1j \approx \log T - 1, so 2j+1T2^{j+1} \approx T. That motivates outputting 2Xm+12^{X_m + 1} as a rough estimate for TT.

(c) Identify the two bad events (overestimate and underestimate) in terms of J+=logT+5J_+ = \lfloor \log T \rfloor + 5 and J=logT5J_- = \lfloor \log T \rfloor - 5.

Solution
  • Overestimate: some element’s LSB position exceeds J+J_+, i.e. Z>J+1Z_{> J_+} \ge 1, where Z>jZ_{>j} counts distinct elements with LSB position >j> j. Then Xm+1>logT+5X_m + 1 > \log T + 5, so 2Xm+1>32T2^{X_m + 1} > 32 T.
  • Underestimate: no element has LSB position >J> J_-, i.e. Z>J=0Z_{> J_-} = 0. Then Xm+1<logT5X_m + 1 < \log T - 5, so 2Xm+1<T/322^{X_m + 1} < T / 32.

(d) Use Markov’s inequality on Z>J+Z_{>J_+} to show Pr(overestimate)132\Pr(\text{overestimate}) \le \tfrac{1}{32}.

Solution

Using 2logTT/22^{\lfloor \log T \rfloor} \ge T/2, we get 2J++1=2logT+626T/2=32T2^{J_+ + 1} = 2^{\lfloor \log T \rfloor + 6} \ge 2^6 \cdot T/2 = 32 T, so

E[Z>J+]<T2J++1T32T=132.\mathbb{E}[Z_{>J_+}] < \frac{T}{2^{J_+ + 1}} \le \frac{T}{32 T} = \frac{1}{32}.

By Markov,

Pr(Z>J+1)E[Z>J+]132.\Pr(Z_{>J_+} \ge 1) \le \mathbb{E}[Z_{>J_+}] \le \frac{1}{32}.

(e) Use Chebyshev’s inequality on Z>JZ_{>J_-} to show Pr(underestimate)116\Pr(\text{underestimate}) \le \tfrac{1}{16}.

Solution

Each indicator is Bernoulli with success probability 1/2j+11/2^{j+1} and Var(Yi)E[Yi]\mathrm{Var}(Y_i) \le \mathbb{E}[Y_i], so summing over independent elements gives Var(Z>J)E[Z>J]\mathrm{Var}(Z_{>J_-}) \le \mathbb{E}[Z_{>J_-}]. Also

E[Z>J]=T2J+1=T242logT16,\mathbb{E}[Z_{>J_-}] = \frac{T}{2^{J_- + 1}} = \frac{T \cdot 2^{4}}{2^{\lfloor \log T \rfloor}} \ge 16,

since 2logTT2^{\lfloor \log T \rfloor} \le T. By Chebyshev,

Pr(Z>J=0)Pr(Z>JE[Z>J]E[Z>J])Var(Z>J)(E[Z>J])21E[Z>J]116.\Pr(Z_{>J_-} = 0) \le \Pr\bigl(|Z_{>J_-} - \mathbb{E}[Z_{>J_-}]| \ge \mathbb{E}[Z_{>J_-}]\bigr) \le \frac{\mathrm{Var}(Z_{>J_-})}{(\mathbb{E}[Z_{>J_-}])^2} \le \frac{1}{\mathbb{E}[Z_{>J_-}]} \le \frac{1}{16}.

(f) Combine the bounds and conclude the 3232-approximation holds with probability 23\ge \tfrac{2}{3}.

Solution

By the union bound,

Pr(failure)Pr(overestimate)+Pr(underestimate)132+116=332<13.\Pr(\text{failure}) \le \Pr(\text{overestimate}) + \Pr(\text{underestimate}) \le \frac{1}{32} + \frac{1}{16} = \frac{3}{32} < \frac{1}{3}.

Hence the algorithm succeeds (output is within a factor of 3232 of TT) with probability at least 13/322/31 - 3/32 \ge 2/3.


Problem 7: Online Algorithms - Competitive Ratio

Section titled “Problem 7: Online Algorithms - Competitive Ratio”

This problem matches the question format the professor described for online algorithms: “I give you a problem, I give you an algorithm, you find its competitive ratio.” (See Lecture 23 transcript.)

(a) What is an online algorithm, and how does it differ from an offline algorithm? Define what it means for an online algorithm to be cc-competitive against an offline optimum OPT\mathrm{OPT}. In particular, over what is the ratio taken?

Solution

An online algorithm is one that has to react to its input as it arrives, piece by piece, without seeing what’s coming next. Once it commits to a decision, it usually can’t go back and change it. This is what makes problems like “should I rent skis today or buy them?” hard — you don’t know whether you’ll be skiing for one more day or another fifty.

An offline algorithm, on the other hand, gets to see the entire input up front before making any decisions. It’s free to plan optimally with full knowledge of the future. This isn’t realistic for most real-world problems, but it serves as a useful benchmark: the offline optimum (call it OPT\mathrm{OPT}) is the best possible cost we could ever hope to achieve, so it tells us how much we’re losing by being forced to act in the dark.

The competitive ratio measures exactly that loss. We say an online algorithm AA is cc-competitive if its cost is never more than cc times OPT\mathrm{OPT}‘s cost, no matter what the input looks like:

costA(σ)ccostOPT(σ)for every input sequence σ.\mathrm{cost}_A(\sigma) \le c \cdot \mathrm{cost}_{\mathrm{OPT}}(\sigma) \quad \text{for every input sequence } \sigma.

So the ratio is taken over all possible inputs, not averaged or expected — we’re guaranteeing that even on the worst possible input, AA is at most cc times worse than the offline optimum.

(b) Ski rental. Renting skis costs 11/day; buying skis costs BB. The skier does not know in advance how many days dd they will ski. Consider the online algorithm: rent for the first BB days, then buy. Prove this algorithm is 22-competitive.

Solution

Let dd be the (unknown) number of skiing days. The offline optimum is OPT=min(d,B)\mathrm{OPT} = \min(d, B).

  • Case 1: d<Bd < B. The online algorithm rents every day, paying dd. OPT=d\mathrm{OPT} = d. Ratio =1= 1.
  • Case 2: dBd \ge B. The online algorithm rents BB days, then buys, total B+B=2BB + B = 2B. OPT=B\mathrm{OPT} = B. Ratio =2= 2.

So in the worst case (Case 2), ALG(d)OPT(d)=2\frac{\text{ALG}(d)}{\mathrm{OPT}(d)} = 2, and the algorithm is exactly 22-competitive.

(c) Pizza finding. You are on a number line at position 00, looking for a pizza shop at unknown position ±d\pm d (sign unknown). Consider the online algorithm: walk 11 right, return; walk 22 left, return; walk 44 right, return; … doubling each step and alternating direction. State (without re-proving) the competitive ratio of this algorithm.

Solution

The doubling/zig-zag strategy gives a competitive ratio of 99:

cost(ALG)9OPT.\text{cost}(\text{ALG}) \le 9 \cdot \mathrm{OPT}.

(The proof was assigned as homework in class.)

(d) Suppose someone proposes the following algorithm for ski rental: always buy on day 11. Compute its competitive ratio. Find a sequence (i.e., a value of dd) that demonstrates this ratio.

Solution

The online algorithm pays BB regardless of dd. The offline OPT is min(d,B)\min(d, B), which is minimized for small dd.

The competitive ratio is

supdBmin(d,B).\sup_d \frac{B}{\min(d, B)}.

When d=1d = 1, OPT=1\mathrm{OPT} = 1 and the ratio is BB. As d0d \to 0 or for d=1d = 1 this gives ratio BB, which is Ω(B)\Omega(B) - linear in BB. So “always buy” is BB-competitive in the worst case, which is bad whenever BB is large. The bad sequence is d=1d = 1 (the skier only skis one day).


The list update problem: maintain a linked list of nn keys under a request sequence r1,,rmr_1,\ldots,r_m with m>nm > n. Accessing the key at position jj costs jj; after the access, you may move the key freely to any earlier position for free.

(a) Restate the list update problem in your own words. What is the cost incurred when you access an element, and what operation is allowed for free after each access? Why is it intuitive that frequently-accessed keys should sit near the front of the list?

Solution

You maintain a linked list of nn keys. Each request asks for some specific key; to serve it, you walk from the front of the list to that key, paying a cost equal to the key’s current position (so position 1 costs 1, position nn costs nn). After serving the request, you may move the just-accessed key to any earlier position in the list for free; moving any other key around still costs.

Intuitively, keys requested often should sit near the front so that future accesses to them are cheap. Sequences that repeatedly ask for keys in the back of the list are the costly ones, so a good list-update strategy tries to keep the “hot” keys at the front.

(b) Consider the “do nothing” algorithm. Find a request sequence on which this algorithm has competitive ratio Ω(n)\Omega(n).

Solution

Start with the list 12n1 \to 2 \to \cdots \to n and request ri=nr_i = n for i=1,,mi = 1,\ldots,m. Each access costs nn, so “do nothing” pays mnmn. An offline algorithm could move nn to the front after its first request, paying n+(m1)12mn + (m-1) \cdot 1 \le 2m. Hence

mn2m=n2=Ω(n).\frac{mn}{2m} = \frac{n}{2} = \Omega(n).

(c) Consider the “order by current frequency” algorithm: maintain the list so that keys are sorted by how many times they have been requested so far. Find a request sequence on which this algorithm has competitive ratio Ω(n)\Omega(n).

Solution

Request key 11 exactly nn times, then key 22 exactly nn times, …, then key nn exactly nn times. After about half the sequence, the first half of the list contains {1,,n/2}\{1,\ldots,n/2\} and the second half contains {n/2+1,,n}\{n/2+1,\ldots,n\} - but the next half of the sequence asks for the second-half keys, each costing at least n/2n/2. The cost of “order by frequency” is at least n22n2=Ω(n3)\frac{n^2}{2} \cdot \frac{n}{2} = \Omega(n^3), while moving each new key to the front pays at most n+(n1)12nn + (n-1) \cdot 1 \le 2n per distinct key, hence OPT2n2\mathrm{OPT} \le 2n^2. Ratio =n/8=Ω(n)= n/8 = \Omega(n).

(d) State the theorem proved (without re-doing the proof) about Move-to-Front (MTF): what does it say about costMTF(S)\mathrm{cost}_{\mathrm{MTF}}(S) in terms of OPT(S)\mathrm{OPT}(S), mm, and nn? Why does this give a competitive ratio strictly less than 22 for long sequences?

Solution

For every request sequence SS of length mm,

costMTF(S)2OPT(S)m+(n2).\mathrm{cost}_{\mathrm{MTF}}(S) \le 2 \cdot \mathrm{OPT}(S) - m + \binom{n}{2}.

When m(n2)m \gg \binom{n}{2}, the additive term (m(n2))-(m - \binom{n}{2}) is negative, so

costMTF(S)<2OPT(S).\mathrm{cost}_{\mathrm{MTF}}(S) < 2 \cdot \mathrm{OPT}(S).

Hence MTF’s competitive ratio is strictly less than 22 for long enough sequences.


Problem 9: Paging - LRU, FIFO, and Resource Augmentation

Section titled “Problem 9: Paging - LRU, FIFO, and Resource Augmentation”

(a) Define the cost of a paging algorithm and what it means for an online paging algorithm to be kk-competitive.

Solution

In paging, the only thing we care about is how often the cache fails us. The cost of a paging algorithm on a given request sequence is simply the number of cache misses it incurs — every time a requested item isn’t already in the cache and has to be fetched fresh, that’s one unit of cost. Hits are free.

We compare paging algorithms to the offline optimum OPT\mathrm{OPT} (which gets to see all future requests and evict perfectly using Farthest-in-Future). An online paging algorithm AA running with a cache of size kk is kk-competitive if, no matter what sequence of requests SS comes in, AA‘s misses are at most kk times OPT\mathrm{OPT}‘s:

costA(S)kcostOPT(S).\mathrm{cost}_A(S) \le k \cdot \mathrm{cost}_{\mathrm{OPT}}(S).

The catch with kk-competitive is that kk is also the cache size, so the guarantee gets worse as the cache gets bigger. That’s a discouraging-sounding bound, but a classical result of Sleator-Tarjan says you can’t do better than this with a deterministic online algorithm.

(b) State (without proof) the Sleator-Tarjan theorem about LRU and FIFO.

Solution

Both LRU and FIFO are kk-competitive (where kk is the cache size). Moreover, no deterministic online paging algorithm can be better than kk-competitive.

(c) Suppose the online algorithm has cache size kk and OPT has cache size k<kk' < k. State the competitive ratio of LRU and FIFO under this resource-augmentation model.

Solution

If the online algorithm has cache size kk and OPT has cache size k<kk' < k, then LRU and FIFO are

kkk+1-competitive.\frac{k}{k - k' + 1}\text{-competitive}.

(d) Apply the formula in (c): if k=2000k = 2000 and k=1000k' = 1000, what is the resulting competitive ratio of LRU and FIFO?

Solution

With k=2000k = 2000, k=1000k' = 1000:

kkk+1=200020001000+1=200010012.\frac{k}{k - k' + 1} = \frac{2000}{2000 - 1000 + 1} = \frac{2000}{1001} \approx 2.

So LRU and FIFO are roughly 22-competitive when given twice the cache that OPT has.


Problem 10: Multiplicative Weight Updates (Experts)

Section titled “Problem 10: Multiplicative Weight Updates (Experts)”

For this topic the professor explicitly said you should understand the setup and the algorithm; the full proof is not required for the exam.

(a) State the experts’ problem. What are the inputs, what does the algorithm output at each time step, and what feedback do we receive?

Solution

The experts’ problem is a stylized version of a real-world dilemma: you have access to nn different “experts” (advisors, models, weather services, financial gurus, whatever) and every day you have to make a binary decision based on what they recommend — say, buy or sell. Some experts are reliable, some are not, but you don’t know in advance which is which.

Each round t=1,2,t = 1, 2, \ldots proceeds as follows:

  • All nn experts simultaneously announce their suggestion for the round (buy or sell).
  • Your algorithm has to commit to one of the two actions, before seeing the truth.
  • Then the actual outcome is revealed, and each expert is judged “correct” or “mistake” based on whether their suggestion matched.

The whole goal here is to design a strategy whose total number of mistakes is comparable to the number of mistakes made by the best expert in hindsight — that is, the one who turns out after the fact to have been most accurate. The catch is that you don’t know which expert that’s going to be while the rounds are happening, so you can’t just blindly follow one of them.

(b) Describe the Weighted Majority algorithm: how weights are initialized, how decisions are made each round, and how weights are updated after the round.

Solution
  • Initialize wi(1)=1w_i^{(1)} = 1 for all i[n]i \in [n].
  • At round tt, compute total weight Φ(t)=iwi(t)\Phi^{(t)} = \sum_i w_i^{(t)}. Take the weighted vote: choose the action whose supporting experts hold Φ(t)/2\ge \Phi^{(t)}/2 of the weight (break ties arbitrarily, e.g. flip a coin).
  • After the truth is revealed, update each expert’s weight: wi(t+1)={wi(t)if expert i was correct,(1ε)wi(t)if expert i made a mistake.w_i^{(t+1)} = \begin{cases} w_i^{(t)} & \text{if expert } i \text{ was correct,}\\ (1 - \varepsilon) w_i^{(t)} & \text{if expert } i \text{ made a mistake.} \end{cases}

(c) State the mistake-bound theorem proved in class. What are the meanings of m(t)m^{(t)}, mi(t)m_i^{(t)}, nn, and ε\varepsilon in the bound?

Solution

After tt rounds, for every expert i[n]i \in [n],

m(t)2(1+ε)mi(t)+2lnnε,m^{(t)} \le 2(1 + \varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon},

where:

  • m(t)m^{(t)} = number of mistakes made by Weighted Majority through round tt,
  • mi(t)m_i^{(t)} = number of mistakes made by expert ii through round tt,
  • nn = number of experts,
  • ε>0\varepsilon > 0 = the small penalty parameter used in the multiplicative update.

In particular, taking ii to be the best expert in hindsight,

m(t)2(1+ε)mbest(t)+2lnnε.m^{(t)} \le 2(1 + \varepsilon)\, m_{\text{best}}^{(t)} + \frac{2 \ln n}{\varepsilon}.

(d) Briefly explain what the bound implies as tt grows large (informally - no proof needed).

Solution

The additive 2lnnε\frac{2 \ln n}{\varepsilon} term is independent of tt, so it becomes negligible as tt grows. The multiplicative factor 2(1+ε)2(1+\varepsilon) shows that the algorithm pays at most roughly twice as many mistakes as the best expert, with a slight ε\varepsilon-overhead that can be tuned. So Weighted Majority is competitive with the best expert even though it must commit before knowing the truth each round.


The professor explicitly described one exam-style question on JL (Lecture 25 transcript): “I give you NN, DD, DD'; calculate the best ε\varepsilon you can get.”

(a) State the Johnson-Lindenstrauss lemma. Be clear about what NN, DD, DD', and ε\varepsilon each mean and what the guaranteed distortion is.

Solution

The Johnson-Lindenstrauss lemma says that if you have a bunch of points sitting in some absurdly high-dimensional space, you can squash them down into a much lower-dimensional space and still preserve all the pairwise distances almost exactly — as long as you allow a small amount of distortion. The remarkable part is that the target dimension only needs to depend on how many points you have and how much distortion you’re willing to accept, not on how big the original dimension was.

Formally: for any 0<ε<10 < \varepsilon < 1 and any integer N>1N > 1, if you pick a target dimension

DlogNε2,D' \ge \frac{\log N}{\varepsilon^2},

then for any set SS of NN points in RD\mathbb{R}^D there exists a map F:RDRDF : \mathbb{R}^D \to \mathbb{R}^{D'} such that every pairwise distance is preserved up to a (1±ε)(1 \pm \varepsilon) factor:

(1ε)xyF(x)F(y)(1+ε)xyfor all x,yS.(1 - \varepsilon)\, \|x - y\| \le \|F(x) - F(y)\| \le (1 + \varepsilon)\, \|x - y\| \quad \text{for all } x, y \in S.

The four parameters:

  • NN — how many points you have.
  • DD — the original (high) dimension your points live in.
  • DD' — the target (low) dimension you’re projecting down to.
  • ε\varepsilon — how much distortion you’re willing to tolerate. Smaller ε\varepsilon means distances are preserved more faithfully, but you need a higher DD' to get there.

Two things stand out about this. First, DD doesn’t show up anywhere in the target-dimension formula — so it doesn’t matter if your original space has 5,0005{,}000 dimensions or 5,000,0005{,}000{,}000. Second, the cost scales as logN\log N, which is incredibly slow, so even with millions of points you only need modest target dimensions.

(b) Suppose you have N=106N = 10^6 points in RD\mathbb{R}^D with D=5000D = 5000, and your downstream algorithm only works in D=200D' = 200 dimensions. What is the best (smallest) distortion ε\varepsilon you can guarantee via JL?

Solution

Solve D=logNε2D' = \frac{\log N}{\varepsilon^2} for ε\varepsilon:

ε=logND.\varepsilon = \sqrt{\frac{\log N}{D'}}.

Using natural log (the constant depends on the version of JL used, but the dependence on NN and DD' is the same):

ε=ln106200=13.82200=0.06910.263.\varepsilon = \sqrt{\frac{\ln 10^6}{200}} = \sqrt{\frac{13.82}{200}} = \sqrt{0.0691} \approx 0.263.

So roughly ε0.26\varepsilon \approx 0.26 - distances are preserved up to about ±26%\pm 26\%.

(c) Why is the target dimension DD' independent of the original dimension DD? Briefly explain in one or two sentences.

Solution

JL’s target dimension DlogN/ε2D' \ge \log N / \varepsilon^2 depends only on the number of points and the desired distortion, not on the ambient dimension. Intuitively, NN points span at most an NN-dimensional affine subspace regardless of how many ambient dimensions DD they sit in, so the geometric “intrinsic complexity” we must preserve scales with NN rather than DD.

(d) Suppose you’re OK with the projected distances being up to twice the original. What value of ε\varepsilon does that correspond to in the JL guarantee, and what’s the catch?

Solution

The JL guarantee is two-sided:

(1ε)xyF(x)F(y)(1+ε)xy(1 - \varepsilon)\,\|x - y\| \le \|F(x) - F(y)\| \le (1 + \varepsilon)\,\|x - y\|

“Distances at most doubled” only constrains the upper bound: (1+ε)=2ε=1(1 + \varepsilon) = 2 \Rightarrow \varepsilon = 1.

But here’s the catch: ε\varepsilon also controls the lower bound. At ε=1\varepsilon = 1, the lower bound becomes (11)xy=0(1 - 1)\,\|x - y\| = 0 — which means projected distances are allowed to shrink all the way to zero. Two genuinely far-apart points in the original space could end up at the same location after projection, completely destroying the nearest-neighbor information you were trying to preserve.

So when you “ask for” doubled distances by setting ε=1\varepsilon = 1, you’re also accepting that distances could be arbitrarily compressed — which is usually a much worse failure mode than distances being stretched, since stretching at least preserves ordering. This is why JL is typically stated for 0<ε<10 < \varepsilon < 1 and why useful applications usually pick much smaller ε\varepsilon values.

(e) Suppose you tolerate distances being off by up to ±50%\pm 50\% in either direction. What is ε\varepsilon? And as a rule of thumb, what range of ε\varepsilon makes a JL embedding actually useful for downstream tasks like nearest-neighbor search?

Solution

±50%\pm 50\% in either direction” means ε=0.5\varepsilon = 0.5. The guarantee becomes:

0.5xyF(x)F(y)1.5xy0.5\,\|x - y\| \le \|F(x) - F(y)\| \le 1.5\,\|x - y\|

so a projected distance could be anywhere between half and one-and-a-half times the original.

Is that useful? It depends on the task, but generally no — because the ratio between the largest and smallest possible projected distance for the same pair is 1.5/0.5=3×1.5 / 0.5 = 3\times. Two pairs of points that were originally the same distance apart could end up with projected distances differing by a factor of 33. For nearest-neighbor search where you’re trying to distinguish “close” from “very close”, that’s way too much noise.

Practical rule of thumb. For an embedding to be useful for NNS or clustering, you typically want ε0.1\varepsilon \le 0.1 or so, giving distances preserved within ±10%\pm 10\%. That keeps the worst-case “max/min projected distance” ratio close to 1.1/0.91.221.1 / 0.9 \approx 1.22, which is usually small enough that the geometric structure of the data survives the projection.

The cost of a smaller ε\varepsilon is a larger target dimension: D1/ε2D' \propto 1/\varepsilon^2, so cutting ε\varepsilon in half quadruples the required DD'. That’s the central tradeoff in dimension reduction.


Problem 12: Approximate Nearest Neighbor and LSH

Section titled “Problem 12: Approximate Nearest Neighbor and LSH”

Given nn data points x1,,xnx_1,\ldots,x_n in RD\mathbb{R}^D, exact NNS asks for argminid(q,xi)\arg\min_i d(q, x_i). The trivial brute-force solution runs in time O(nD)O(n D).

(a) State the nearest neighbor search (NNS) problem. What is the input, the query, and the desired output? What is the cost of the brute-force solution?

Solution

Nearest neighbor search is the basic primitive behind similarity search — think recommendation systems, image retrieval, or kk-NN classifiers in machine learning. The setup is that you preprocess a dataset of nn points once, then have to answer a stream of “given this new point, which of my stored points is most similar to it?” queries as fast as possible.

Concretely, you’re given nn data points x1,,xnRDx_1, \ldots, x_n \in \mathbb{R}^D at preprocessing time. Later, a query point qRDq \in \mathbb{R}^D arrives and you need to return the data point closest to qq under some distance metric (Euclidean, Hamming, etc.):

x=argmin1ind(q,xi).x^* = \arg\min_{1 \le i \le n}\,d(q, x_i).

The brute-force solution is the obvious one: just compute d(q,xi)d(q, x_i) for every single ii and return the smallest. Each distance computation between two DD-dimensional vectors takes O(D)O(D) time, and there are nn of them, so each query costs O(nD)O(nD).

In modern datasets both nn and DD are huge — think a billion images each represented as a 10001000-dimensional embedding — so O(nD)O(nD) per query is way too slow. That’s the gap NNS algorithms try to close by doing more clever preprocessing.

(b) State (informally) the hardness result of Williams-Alman for exact NNS. Why does this push us toward approximate NNS?

Solution

For exact NNS in high dimensions, any algorithm running in time O(n1αD)O(n^{1 - \alpha} D) for some α>0\alpha > 0 (i.e. strictly sub-linear in nn) would violate the Strong Exponential Time Hypothesis (SETH). Since refuting SETH is considered highly unlikely, this rules out fast exact NNS in high dimensions. So we relax the problem to approximate NNS.

(c) Define the cc-approximate NNS problem for an approximation factor c>1c > 1.

Solution

Given c>1c > 1, return some xjx_j satisfying

d(q,xj)cmin1ind(q,xi).d(q, x_j) \le c \cdot \min_{1 \le i \le n} d(q, x_i).

That is, return any point whose distance to qq is at most cc times the true nearest-neighbor distance.

(d) State the preprocessing and query time of the LSH data structure in terms of nn, DD, and a parameter ρ\rho. What is ρ\rho for Hamming distance, and what is ρ\rho for Euclidean distance?

Solution

LSH gives preprocessing/storage O(n1+ρD)O(n^{1 + \rho} D) and query time O(nρD)O(n^{\rho} D), where

  • Hamming distance: ρ=1c\rho = \frac{1}{c},
  • Euclidean distance: ρ=1c2\rho = \frac{1}{c^2}.

(e) If c=2c = 2, what is the query time of LSH for Hamming distance and for Euclidean distance? Express the answers as O()O(\cdot) expressions involving nn and DD.

Solution

For c=2c = 2:

  • Hamming: ρ=1/2\rho = 1/2, so query time O(n1/2D)=O(nD)O(n^{1/2} D) = O(\sqrt{n}\, D).
  • Euclidean: ρ=1/4\rho = 1/4, so query time O(n1/4D)O(n^{1/4} D).

Both are dramatically faster than the brute-force O(nD)O(n D) when nn is large.


Problem 13: Run AMS by Hand on a Concrete Stream

Section titled “Problem 13: Run AMS by Hand on a Concrete Stream”

This problem mirrors the style of Question 6 on the sample final (final-example).

Consider the stream

S={3,1,4,1,5,3,2,1,3,5},m=10,S = \{3, 1, 4, 1, 5, 3, 2, 1, 3, 5\}, \qquad m = 10,

with positions numbered 1,,101, \ldots, 10 from left to right.

(a) What does the second frequency moment F2F_2 count, and what is the AMS estimator XX for a single sampled position?

Solution

F2=i=1nf(i)2F_2 = \sum_{i=1}^{n} f(i)^2 is the sum of squared frequencies of items in the stream. For one sample at position jj, let rr be the number of occurrences of xjx_j at positions j\ge j. The AMS estimator is

X=m(r2(r1)2)=m(2r1).X = m\bigl(r^2 - (r-1)^2\bigr) = m(2r - 1).

A single run satisfies E[X]=F2\mathbb{E}[X] = F_2.

(b) Compute the true value of F2F_2 for the stream above.

Solution

Frequencies:

Item iiPositionsf(i)f(i)f(i)2f(i)^2
12, 4, 839
2711
31, 6, 939
4311
55, 1024
F2=9+1+9+1+4=24.F_2 = 9 + 1 + 9 + 1 + 4 = 24.

(c) Run AMS three times for F2F_2 on this stream, sampling positions 11, 44, and 1010 in turn. Show rr and XX for each run.

Solution
  • Run 1 — sample position 11 (x1=3x_1 = 3). Occurrences of 33 from position 11 onward: positions 1,6,91, 6, 9, so r=3r = 3.

    X1=10(231)=105=50.X_1 = 10(2 \cdot 3 - 1) = 10 \cdot 5 = 50.
  • Run 2 — sample position 44 (x4=1x_4 = 1). Occurrences of 11 from position 44 onward: positions 4,84, 8, so r=2r = 2.

    X2=10(221)=103=30.X_2 = 10(2 \cdot 2 - 1) = 10 \cdot 3 = 30.
  • Run 3 — sample position 1010 (x10=5x_{10} = 5). Occurrences of 55 from position 1010 onward: position 1010, so r=1r = 1.

    X3=10(211)=101=10.X_3 = 10(2 \cdot 1 - 1) = 10 \cdot 1 = 10.

(d) Average the three outputs Y=(X1+X2+X3)/3Y = (X_1 + X_2 + X_3)/3. By what percentage does the average overshoot or undershoot the true F2F_2? Briefly explain why one would not be alarmed by this discrepancy.

Solution Y=X1+X2+X33=50+30+103=903=30.Y = \frac{X_1 + X_2 + X_3}{3} = \frac{50 + 30 + 10}{3} = \frac{90}{3} = 30.

Compared to the true F2=24F_2 = 24, the average overshoots by 302424=25%\frac{30 - 24}{24} = 25\%.

Why is this expected? AMS is unbiased (E[X]=F2\mathbb{E}[X] = F_2) but its single-run variance is large. With only three samples, the empirical average can easily land 25%25\% off the truth. The full (ε,δ)(\varepsilon, \delta)-guarantee from Problem 4 requires O(n/ε2log(1/δ))O(\sqrt{n}/\varepsilon^2 \cdot \log(1/\delta)) samples, which would be far more than three even for modest ε\varepsilon.


Problem 14: Reverse-Engineering a Streaming Counter

Section titled “Problem 14: Reverse-Engineering a Streaming Counter”

This problem mirrors the style of Question 4 of the sample final (final-example), where you’re handed a friend’s data structure and must reverse-engineer information from it.

(a) Morris counter. Your friend ran a single Morris counter on an unknown stream and tells you the final value of the counter is C=8C = 8. What is their best point estimate N^\widehat{N} of the number of stream items processed?

Solution

The Morris counter output is N^=2C1\widehat{N} = 2^C - 1, so

N^=281=255.\widehat{N} = 2^8 - 1 = 255.

(b) Use the Morris counter’s variance result Var(N^)N2/2\mathrm{Var}(\widehat{N}) \approx N^2/2 to estimate the standard deviation of the estimate from part (a). Why does this make a single Morris counter unsatisfying for high-confidence applications?

Solution

The standard deviation of the estimate is roughly

σN2/2=N/2.\sigma \approx \sqrt{N^2/2} = N/\sqrt{2}.

Using the point estimate N^=255\widehat{N} = 255 as a stand-in for NN, σ255/2180\sigma \approx 255/\sqrt{2} \approx 180. So the true NN could plausibly be anywhere from roughly 7575 to 435435 on a single run — the standard deviation is on the order of the mean itself. This is exactly why one needs the Morris+/Morris++ boosting strategies for any application that demands a tight confidence interval.

(c) Flajolet-Martin. Another friend ran the basic Flajolet-Martin algorithm and tells you the final largest LSB position observed is P=9P = 9. What is their point estimate T^\widehat{T} of the number of distinct elements?

Solution

The basic Flajolet-Martin algorithm outputs T^=2P+1\widehat{T} = 2^{P+1}, so

T^=29+1=210=1024.\widehat{T} = 2^{9+1} = 2^{10} = 1024.

(d) Using the factor-3232 guarantee from Problem 6, give a range of plausible values for the true number of distinct elements TT (with probability 2/3\ge 2/3).

Solution

The factor-3232 guarantee says that with probability 2/3\ge 2/3,

T32T^32T.\frac{T}{32} \le \widehat{T} \le 32 T.

Rearranging both sides for TT with T^=1024\widehat{T} = 1024:

T^32T32T^    102432T321024    32T32768.\frac{\widehat{T}}{32} \le T \le 32 \widehat{T} \implies \frac{1024}{32} \le T \le 32 \cdot 1024 \implies 32 \le T \le 32768.

So the true distinct-element count is somewhere between 3232 and 32,76832{,}768 with probability at least 2/32/3 — a very wide window, again motivating the boosted Flajolet-Martin / HyperLogLog improvements.


Problem 15: Daily Weather Predictions via MWU

Section titled “Problem 15: Daily Weather Predictions via MWU”

You subscribe to n=3n = 3 competing weather services that each post a daily binary forecast (rain / no-rain) for your city. After T=1000T = 1000 days, you want a strategy whose total wrong predictions are not much more than those of whichever service turns out to have been the most accurate (which you do not know in advance).

You may use the mistake bound proven in class without re-deriving it:

m(t)2(1+ε)mi(t)+2lnnεi[n].m^{(t)} \le 2(1 + \varepsilon)\, m_i^{(t)} + \frac{2 \ln n}{\varepsilon} \qquad \forall\, i \in [n].

(a) Identify the experts and the binary decisions in MWU language. What is nn here?

Solution

The three weather services are the experts (n=3n = 3). The binary decision each day is “rain” or “no rain”. Your strategy’s job is to combine the three forecasts into a single daily decision.

(b) State the Weighted Majority algorithm in this context, including the initial weights, the daily decision rule, and the weight update rule after each day.

Solution
  • Initialize weights w1=w2=w3=1w_1 = w_2 = w_3 = 1.
  • Each morning, compute the total weight Φ=w1+w2+w3\Phi = w_1 + w_2 + w_3. Add up the weights of services predicting rain. If they sum to Φ/2\ge \Phi/2, your strategy predicts rain; otherwise predict no-rain.
  • After the actual weather is observed, for each service ii that was wrong, set wi(1ε)wiw_i \leftarrow (1 - \varepsilon) w_i; leave the weight of correctly predicting services unchanged.

(c) Suppose you choose ε=0.1\varepsilon = 0.1, and after the 1000 days the best-performing service was wrong on 5050 days. Give an upper bound on the number of days your Weighted Majority strategy was wrong. (You may approximate ln31.1\ln 3 \approx 1.1.)

Solution

Plug ε=0.1\varepsilon = 0.1, n=3n = 3, mbest(1000)=50m_{\text{best}}^{(1000)} = 50:

m(1000)2(1.1)(50)+2ln30.1110+21.10.1=110+22=132.m^{(1000)} \le 2(1.1)(50) + \frac{2 \ln 3}{0.1} \approx 110 + \frac{2 \cdot 1.1}{0.1} = 110 + 22 = 132.

So Weighted Majority makes at most about 132132 wrong predictions over the 10001000 days. That’s roughly 2.6×2.6 \times the best service — and crucially, you achieved this without knowing in advance which service would be best.

(d) What would happen to the bound from part (c) if you instead naively committed to always following service #1 from day 1 onward, and service #1 turned out to be the worst of the three?

Solution

If you blindly followed service #1 and it happened to be the worst service, you would make as many mistakes as the worst service — which could be arbitrarily large with no bound relative to the best. MWU’s whole point is that you get a guarantee comparable to the best expert in hindsight, regardless of which one that turns out to be, so you never get stuck mimicking a bad expert.


Problem 16: Hand-Simulating a Browser Cache

Section titled “Problem 16: Hand-Simulating a Browser Cache”

This problem mirrors the run-by-hand style of Question 6 of the sample final (final-example), applied to the paging problem.

Your browser holds at most k=3k = 3 tabs in memory. You visit URLs in the following order:

S={A,B,C,D,A,B,E,A,C,D}.S = \{A, B, C, D, A, B, E, A, C, D\}.

(a) Briefly describe each eviction policy: LRU, FIFO, and the offline optimal (Farthest-in-Future).

Solution
  • LRU (Least Recently Used): on a miss with a full cache, evict the item whose most recent access is furthest in the past.
  • FIFO (First In, First Out): on a miss with a full cache, evict the item that was inserted into the cache earliest. Accessing an item already in the cache does not change its FIFO position.
  • Optimal offline (Farthest-in-Future): on a miss with a full cache, evict the item whose next request is furthest in the future (or whose next request never comes).

(b) Simulate LRU on SS and report the total number of cache misses.

Solution

Track the cache and recency order (most-recent on the left):

StepRequestHit/MissCache afterRecency order (left = most recent)
1AMiss{A}[A]
2BMiss{A, B}[B, A]
3CMiss{A, B, C}[C, B, A]
4DMiss{B, C, D}[D, C, B] (evicted A, the LRU)
5AMiss{C, D, A}[A, D, C] (evicted B)
6BMiss{D, A, B}[B, A, D] (evicted C)
7EMiss{A, B, E}[E, B, A] (evicted D)
8AHit{A, B, E}[A, E, B]
9CMiss{A, E, C}[C, A, E] (evicted B)
10DMiss{A, C, D}[D, C, A] (evicted E)

LRU misses: 9 (only step 8 is a hit).

(c) Simulate FIFO on SS and report the total number of cache misses.

Solution

Track the cache and the insertion-order queue (left = oldest):

StepRequestHit/MissCache afterQueue (left = first in)
1AMiss{A}[A]
2BMiss{A, B}[A, B]
3CMiss{A, B, C}[A, B, C]
4DMiss{B, C, D}[B, C, D] (evicted A)
5AMiss{C, D, A}[C, D, A] (evicted B)
6BMiss{D, A, B}[D, A, B] (evicted C)
7EMiss{A, B, E}[A, B, E] (evicted D)
8AHit{A, B, E}[A, B, E] (queue unchanged)
9CMiss{B, E, C}[B, E, C] (evicted A)
10DMiss{E, C, D}[E, C, D] (evicted B)

FIFO misses: 9 (only step 8 is a hit).

(d) Simulate the optimal offline (Farthest-in-Future) algorithm and report the total misses.

Solution

At each miss, look ahead at the next occurrence of each cached item; evict the one whose next request is farthest (or never).

StepRequestHit/MissCache afterEviction reasoning
1AMiss{A}-
2BMiss{A, B}-
3CMiss{A, B, C}-
4DMiss{A, B, D}Next A@5, B@6, C@9. Evict C (farthest).
5AHit{A, B, D}-
6BHit{A, B, D}-
7EMiss{A, D, E}Next A@8, B never, D@10. Evict B (never).
8AHit{A, D, E}-
9CMiss{D, E, C}Next A never, D@10, E never. Evict A.
10DHit{D, E, C}-

OPT misses: 6 (steps 1, 2, 3, 4, 7, 9).

(e) What is the empirical ratio (algorithm misses / OPT misses) for LRU and FIFO on this sequence? How does it compare to the worst-case guarantee of kk-competitiveness for k=3k = 3?

Solution LRU missesOPT misses=96=1.5,FIFO missesOPT misses=96=1.5.\frac{\mathrm{LRU\ misses}}{\mathrm{OPT\ misses}} = \frac{9}{6} = 1.5, \qquad \frac{\mathrm{FIFO\ misses}}{\mathrm{OPT\ misses}} = \frac{9}{6} = 1.5.

By Sleator-Tarjan, the worst-case bound for both LRU and FIFO at k=3k = 3 is 33-competitive, i.e. ratio 3\le 3. On this particular sequence the achieved ratio is only 1.51.5, well within the worst-case bound. (The kk-competitive bound is tight only for adversarial sequences — for typical traces, both LRU and FIFO usually perform much better than the worst case.)