Final Prep: Problem Sets
Problem 1: Morris Counter - Streaming Algorithms
Section titled “Problem 1: Morris Counter - Streaming Algorithms”(a) What is an -guarantee for a randomized estimator of some quantity ? What do and each control?
Solution
An -guarantee is a way of formally expressing that an algorithm gives a good approximate answer with high probability. It has two components:
— the accuracy parameter. This controls how close the estimate is to the true answer. Specifically, it says the estimate is within a multiplicative factor of of the true value:
So a smaller means a tighter, more accurate estimate.
— the failure probability. This controls how often the algorithm is allowed to be wrong. The guarantee holds with probability at least :
So a smaller means the algorithm fails less often.
Putting it together. The full -guarantee says:
With probability at least , my estimate is within of the true answer.
The tradeoff. You can always make and smaller — but it costs you. Typically making them smaller requires running more independent copies of the algorithm (boosting via median-of-means), which increases space and time usage. So and let you precisely tune the accuracy vs. resource tradeoff.
(b) Describe the Morris Counter algorithm and analyze its space complexity compared to a naive counting approach. What is the trade-off being made?
Solution
Morris Counter algorithm:
- Initialize counter
- On each new stream element, increment to with probability
- Return estimate
Space complexity: Only is stored, where . This requires bits.
Naive approach: Storing the exact count requires bits.
Trade-off: By randomizing increments, we achieve exponential space savings ( vs. ), but accept a random error in the estimate. The output is approximate, not exact.
(c) What are the mean and variance of the Morris Counter’s output? Why is the variance a problem?
Solution
Mean and variance:
From the Morris Counter algorithm:
- (unbiased)
Why it’s a problem: The standard deviation is , so the output can deviate wildly from the true count. With high probability, the estimate could be off by or more, making a single Morris Counter unreliable for high-confidence estimates.
(d) Describe three successive strategies for improving the Morris Counter’s reliability. How does each strategy reduce the failure probability, and what is the total space cost?
Solution
Three successive strategies:
Strategy 1: Morris (basic)
- Run a single Morris Counter
- Success probability: undefined (no confidence guarantee)
- Failure probability: unbounded
- Space: bits
Strategy 2: Morris+ (weak estimate via averaging)
- Run independent Morris Counters in parallel
- Return the average of the estimates
- By Chebyshev’s inequality (or Exercise 4.9(b)), this gives an -estimate
- Success probability: (failure probability )
- Space: bits
Strategy 3: Morris++ (strong estimate via boosting)
- Run independent Morris+ instances in parallel
- Return the median of their outputs
- By the median-of-weak-estimates trick (Exercise 4.9(c)), this gives a -estimate
- Success probability: (arbitrary confidence)
- Space: bits
Each strategy improves failure probability at the cost of more parallel instances (and more space).
(e) For an application requiring count estimation within of the true count with confidence , where and , how many parallel Morris counters are needed in each of the three strategies from part (d)?
Solution
Numerical calculation for , :
Morris: No reliability guarantee.
Morris+:
Morris++:
Total counters needed:
Problem 2: Count-Min Sketch
Section titled “Problem 2: Count-Min Sketch”The Count-Min Sketch is given as a problem on the sample final the professor distributed. Throughout, let the stream be with , and let be the true frequency of item .
(a) What is the heavy hitters / frequency estimation problem, and why does CMS pair with a min-heap to solve it? (No need to give the CMS data structure yet — describe the problem and the high-level role of CMS.)
Solution
The heavy hitters problem is the real-world question Amazon (or any platform that tracks views, purchases, clicks, etc.) actually wants to answer: out of all the items flowing past in a stream, which are the top- most popular ones right now? You can’t just count everything exactly because the universe of items is huge and the stream never stops, so we need something cheaper.
A natural way to solve heavy hitters is to break it into two pieces:
The sketch. A small data structure that, given any item , can quickly answer “roughly how many times has shown up so far?” — i.e. an estimate of its true frequency . This is the frequency estimation problem, and it’s what Count-Min Sketch is built for.
The heap. A min-heap of size that holds the current top- candidates. Whenever a new item arrives, we ask the sketch for and compare it against the smallest frequency currently in the heap. If the new item is larger, it kicks out the current minimum and takes its slot.
So CMS doesn’t solve heavy hitters by itself — it provides the cheap frequency lookups, and the min-heap turns those lookups into a rolling top-. We’re forced into this approximate setup because tracking exact frequencies would require space, which is exactly what sublinear-space streaming forbids.
(b) Describe the Count-Min Sketch data structure with parameters and , including how an update is processed and how a query for is answered. Why is the min over rows used and not, say, the average?
Solution
Pick hash functions with , and keep an counter matrix initialized to .
- Update : for each row , increment cell by .
- Query : return .
The min is used because each counter is at least (every occurrence of is counted) plus the contributions from other items hashing to the same cell. So every row gives an overestimate, and the smallest overestimate is the most accurate.
(c) Why does the sketch only overestimate, i.e. why is for every item ?
Solution
Every occurrence of in the stream increments by in every row . Therefore each row’s counter is at least , so the minimum across rows is also at least .
(d) State the -guarantee proved in class and indicate the values of and needed.
Solution
With
the Count-Min Sketch satisfies
This is an additive error in terms of the stream length , not a relative error.
(e) Prove the guarantee. You may use Markov’s inequality without proof.
Solution
Fix one row and let . Define . The expected number of other items that collide with in this row is at most (each of the at most non- stream elements lands in cell with probability ), so
By Markov,
For to exceed , every row’s error must be at least . Since rows use independent hash functions,
(f) Suppose , and . Compute the sketch size (in counters).
Solution
Total counters .
Problem 3: -Approximate Median
Section titled “Problem 3: (ε,δ)(\varepsilon,\delta)(ε,δ)-Approximate Median”This topic was Question 3 on the sample final.
Let be a stream. For , we want to return an element whose rank satisfies
with probability at least .
(a) What is the -approximate median problem? State the input, the output condition, and what the and parameters each control here.
Solution
The -approximate median problem is a relaxation of “find the exact median element of the stream”. Exact computation is too expensive in sublinear streaming space, so instead of insisting on the element with rank exactly , we settle for any element whose rank is close to the middle, and we allow ourselves to fail every so often.
Concretely, the input is a stream together with two knobs and in . The job is to return some element whose rank lands inside a window around :
and the guarantee says this must hold with probability at least .
— the rank window. Controls how far the returned element’s rank is allowed to be from the true median rank , measured as a fraction of . Smaller means a tighter window; at you’re back to demanding the exact median.
— the failure probability. Controls how often the algorithm is allowed to return something outside that window. Smaller means the algorithm is allowed to mess up less often.
Putting it together: “with probability at least , the returned element’s rank is within an -sized window of the true median rank .”
(b) Describe the sampling algorithm and explain in words why the sampled median is likely to be a good approximate median.
Solution
- Sample elements uniformly and independently from the stream.
- Output the median of the sample.
If we sample uniformly, then in expectation only a fraction of samples fall in either bad region. For the sample median to land in , more than half of the samples must come from , which is far above the expected fraction. Chernoff makes this unlikely once is large enough.
(c) Define the bad regions and and write down the failure event of the algorithm in terms of and , where is the sample set.
Solution
Sort the stream and let
- = the smallest elements,
- = the largest elements.
Let and . The algorithm fails iff
(d) Using a Chernoff bound (stated below), show that samples suffice.
Solution
For , each sample lands in independently with probability . So . The threshold in multiplicative form is with
Chernoff gives
A symmetric bound holds for . Union bound:
Setting this and solving yields
(e) Suppose and . Give a concrete value of (using the constant from the Chernoff bound).
Solution
With and :
So roughly samples suffice.
Problem 4: AMS Sampling for the Second Frequency Moment
Section titled “Problem 4: AMS Sampling for the Second Frequency Moment”This topic was Question 6 on the sample final. Recall for a stream of length over universe .
(a) What is the -th frequency moment of a stream? What do , , and correspond to in plain English, and why might a data engineer care about specifically?
Solution
Frequency moments are a family of summary statistics that capture different “shapes” of how items appear in a stream. If is the number of times item shows up, then the -th frequency moment is just the sum of those counts raised to the -th power:
The value of changes what aspect of the stream you’re capturing:
- counts the number of distinct items in the stream (using the convention , so absent items contribute nothing).
- is just the total stream length — every item contributes its raw frequency, so you’re really just adding up all the appearances.
- is the sum of squared frequencies. Squaring amplifies large frequencies, so becomes a measure of how concentrated the stream is: it’s large when a few items dominate, small when items are spread evenly.
Why care about in particular? It’s exactly the kind of statistic the Gini index uses in economics to measure income inequality — a few people earning huge amounts blow up the squared sum, while an even distribution keeps it small. The same idea shows up in databases, where estimates the cost of a self-join (you’re summing the number of matching pairs per key).
(b) State the AMS sampling algorithm. Describe what is output for a single sample.
Solution
- Sample a position uniformly at random.
- Let be the number of occurrences of from position to the end of the stream.
- Output .
(c) Prove .
Solution
Condition on the value of the sampled item being . Given this, the sampled occurrence is uniformly one of the occurrences of . If it is the -th occurrence, then , so .
The value is with probability , so
(d) A single AMS run is unbiased but high variance. Describe the two-stage boosting strategy used to obtain an -estimate of .
Solution
The two-stage strategy (mirroring Morris++):
- Stage 1 (averaging). Run independent AMS estimators in parallel and report their average . This reduces variance and gives an -style guarantee.
- Stage 2 (median). Run independent copies of Stage 1 and report the median, boosting the failure probability down to .
For part (e) below we directly use a Chernoff bound on the average and skip the median step.
(e) Using the bound (where ) and the key lemma , show that
independent copies suffice when their average is reported.
You may use the non-Bernoulli Chernoff bound: if are i.i.d. in and , then for ,
Solution
With and , the non-Bernoulli Chernoff bound gives
Substituting the key lemma :
Setting the right-hand side and solving for :
Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash
Section titled “Problem 5: Counting Distinct Elements - Idealized Algorithm with Max-Hash”This is the explicit “final question” the professor posed at the end of Lecture 20. Your friend implemented the idealized distinct-elements algorithm but tracked the maximum hash value instead of the minimum.
Recall the idealized setup: a hash function produces a uniform hash for each distinct element. Let be the number of distinct elements seen so far.
(a) State the distinct elements problem (), and describe the original idealized algorithm that this problem is a variant of. What output formula does the original algorithm use, and what is the intuition behind it?
Solution
The distinct elements problem is the streaming version of “how many different items have I seen so far?” — think of it as counting unique users hitting a website, or unique products sold today, while the events fly by. Storing every distinct item in a set works but costs bits, which defeats the point of streaming. So we want a way to estimate using much less space.
The original min-hash algorithm does this with a surprisingly clean idea: hash every incoming element into the real interval , but only bother to remember the smallest hash you’ve seen so far. Concretely:
- Pick a perfectly random hash once at the start.
- As the stream arrives, hash each element and keep updating a running minimum .
- When asked, output .
The intuition is what makes this work. If you’ve seen distinct elements, their hashes are like darts thrown uniformly at random into . The more darts you throw, the closer the smallest one gets to — and in fact the expected minimum of uniform samples is exactly . So inverting and subtracting recovers . Repeated occurrences of the same item don’t mess things up because they always hash to the same value, so only new items can pull the minimum lower.
(b) Is your friend’s algorithm doomed, or can the output be changed to give a correct estimate of ?
Solution
Not doomed - the maximum of uniform random variables has a known expectation, and inverting the relationship lets us recover .
(c) Derive when there are distinct elements. Use the identity
for .
Solution
Let . The event means at least one of the hash values exceeds , but it’s easier to compute the complement: means all hash values are . By independence,
Using :
(d) Based on (c), what should the output of the modified algorithm be, in terms of ?
Solution
If , then
So the modified algorithm should output
(e) As a sanity check, plug in and verify the output evaluates to in expectation.
Solution
For , . Plugging into the output:
Problem 6: Flajolet-Martin Factor- Guarantee
Section titled “Problem 6: Flajolet-Martin Factor-323232 Guarantee”In lecture, we proved that the basic Flajolet-Martin algorithm (track the largest least-significant-bit position in the hashes seen so far, output ) gives a -approximation with probability at least .
(a) What problem does Flajolet-Martin solve, and why is it preferred over the idealized min-hash algorithm? Briefly restate the basic algorithm.
Solution
Flajolet-Martin solves the same problem as min-hash: counting the number of distinct elements () in a stream, in sublinear space. The reason FM is preferred is purely practical — the min-hash algorithm is idealized because it requires hashing into the continuous interval , which means storing real numbers with full precision. That’d take infinitely many bits, which obviously breaks any space guarantee. FM gets around this by working with integer hashes and looking at their bit patterns instead.
The trick FM uses is to track trailing zeros in the binary representations of the hashes. The algorithm is:
- Pick a hash .
- For each stream element , look at in binary and find its least significant bit (LSB) position — the position of the rightmost , which is the same as the number of trailing zeros.
- Keep a running maximum of all the LSB positions seen so far.
- When asked, output .
The intuition (developed more in part (b)) is that the more distinct elements you’ve seen, the more likely it is that some of them happened to hash to a value with a really long trailing-zeros pattern — so the running maximum LSB grows roughly like .
(b) Why is the output set to ? Give an informal one-paragraph justification using the geometric structure of trailing zeros.
Solution
In a uniformly random bit string, the chance of seeing trailing zeros at position (so the least significant is at position ) is . Across distinct hashes, roughly end at position . The largest position likely to be observed is the one where , i.e. , so . That motivates outputting as a rough estimate for .
(c) Identify the two bad events (overestimate and underestimate) in terms of and .
Solution
- Overestimate: some element’s LSB position exceeds , i.e. , where counts distinct elements with LSB position . Then , so .
- Underestimate: no element has LSB position , i.e. . Then , so .
(d) Use Markov’s inequality on to show .
Solution
Using , we get , so
By Markov,
(e) Use Chebyshev’s inequality on to show .
Solution
Each indicator is Bernoulli with success probability and , so summing over independent elements gives . Also
since . By Chebyshev,
(f) Combine the bounds and conclude the -approximation holds with probability .
Solution
By the union bound,
Hence the algorithm succeeds (output is within a factor of of ) with probability at least .
Problem 7: Online Algorithms - Competitive Ratio
Section titled “Problem 7: Online Algorithms - Competitive Ratio”This problem matches the question format the professor described for online algorithms: “I give you a problem, I give you an algorithm, you find its competitive ratio.” (See Lecture 23 transcript.)
(a) What is an online algorithm, and how does it differ from an offline algorithm? Define what it means for an online algorithm to be -competitive against an offline optimum . In particular, over what is the ratio taken?
Solution
An online algorithm is one that has to react to its input as it arrives, piece by piece, without seeing what’s coming next. Once it commits to a decision, it usually can’t go back and change it. This is what makes problems like “should I rent skis today or buy them?” hard — you don’t know whether you’ll be skiing for one more day or another fifty.
An offline algorithm, on the other hand, gets to see the entire input up front before making any decisions. It’s free to plan optimally with full knowledge of the future. This isn’t realistic for most real-world problems, but it serves as a useful benchmark: the offline optimum (call it ) is the best possible cost we could ever hope to achieve, so it tells us how much we’re losing by being forced to act in the dark.
The competitive ratio measures exactly that loss. We say an online algorithm is -competitive if its cost is never more than times ‘s cost, no matter what the input looks like:
So the ratio is taken over all possible inputs, not averaged or expected — we’re guaranteeing that even on the worst possible input, is at most times worse than the offline optimum.
(b) Ski rental. Renting skis costs /day; buying skis costs . The skier does not know in advance how many days they will ski. Consider the online algorithm: rent for the first days, then buy. Prove this algorithm is -competitive.
Solution
Let be the (unknown) number of skiing days. The offline optimum is .
- Case 1: . The online algorithm rents every day, paying . . Ratio .
- Case 2: . The online algorithm rents days, then buys, total . . Ratio .
So in the worst case (Case 2), , and the algorithm is exactly -competitive.
(c) Pizza finding. You are on a number line at position , looking for a pizza shop at unknown position (sign unknown). Consider the online algorithm: walk right, return; walk left, return; walk right, return; … doubling each step and alternating direction. State (without re-proving) the competitive ratio of this algorithm.
Solution
The doubling/zig-zag strategy gives a competitive ratio of :
(The proof was assigned as homework in class.)
(d) Suppose someone proposes the following algorithm for ski rental: always buy on day . Compute its competitive ratio. Find a sequence (i.e., a value of ) that demonstrates this ratio.
Solution
The online algorithm pays regardless of . The offline OPT is , which is minimized for small .
The competitive ratio is
When , and the ratio is . As or for this gives ratio , which is - linear in . So “always buy” is -competitive in the worst case, which is bad whenever is large. The bad sequence is (the skier only skis one day).
Problem 8: List Update - Move-to-Front
Section titled “Problem 8: List Update - Move-to-Front”The list update problem: maintain a linked list of keys under a request sequence with . Accessing the key at position costs ; after the access, you may move the key freely to any earlier position for free.
(a) Restate the list update problem in your own words. What is the cost incurred when you access an element, and what operation is allowed for free after each access? Why is it intuitive that frequently-accessed keys should sit near the front of the list?
Solution
You maintain a linked list of keys. Each request asks for some specific key; to serve it, you walk from the front of the list to that key, paying a cost equal to the key’s current position (so position 1 costs 1, position costs ). After serving the request, you may move the just-accessed key to any earlier position in the list for free; moving any other key around still costs.
Intuitively, keys requested often should sit near the front so that future accesses to them are cheap. Sequences that repeatedly ask for keys in the back of the list are the costly ones, so a good list-update strategy tries to keep the “hot” keys at the front.
(b) Consider the “do nothing” algorithm. Find a request sequence on which this algorithm has competitive ratio .
Solution
Start with the list and request for . Each access costs , so “do nothing” pays . An offline algorithm could move to the front after its first request, paying . Hence
(c) Consider the “order by current frequency” algorithm: maintain the list so that keys are sorted by how many times they have been requested so far. Find a request sequence on which this algorithm has competitive ratio .
Solution
Request key exactly times, then key exactly times, …, then key exactly times. After about half the sequence, the first half of the list contains and the second half contains - but the next half of the sequence asks for the second-half keys, each costing at least . The cost of “order by frequency” is at least , while moving each new key to the front pays at most per distinct key, hence . Ratio .
(d) State the theorem proved (without re-doing the proof) about Move-to-Front (MTF): what does it say about in terms of , , and ? Why does this give a competitive ratio strictly less than for long sequences?
Solution
For every request sequence of length ,
When , the additive term is negative, so
Hence MTF’s competitive ratio is strictly less than for long enough sequences.
Problem 9: Paging - LRU, FIFO, and Resource Augmentation
Section titled “Problem 9: Paging - LRU, FIFO, and Resource Augmentation”(a) Define the cost of a paging algorithm and what it means for an online paging algorithm to be -competitive.
Solution
In paging, the only thing we care about is how often the cache fails us. The cost of a paging algorithm on a given request sequence is simply the number of cache misses it incurs — every time a requested item isn’t already in the cache and has to be fetched fresh, that’s one unit of cost. Hits are free.
We compare paging algorithms to the offline optimum (which gets to see all future requests and evict perfectly using Farthest-in-Future). An online paging algorithm running with a cache of size is -competitive if, no matter what sequence of requests comes in, ‘s misses are at most times ‘s:
The catch with -competitive is that is also the cache size, so the guarantee gets worse as the cache gets bigger. That’s a discouraging-sounding bound, but a classical result of Sleator-Tarjan says you can’t do better than this with a deterministic online algorithm.
(b) State (without proof) the Sleator-Tarjan theorem about LRU and FIFO.
Solution
Both LRU and FIFO are -competitive (where is the cache size). Moreover, no deterministic online paging algorithm can be better than -competitive.
(c) Suppose the online algorithm has cache size and OPT has cache size . State the competitive ratio of LRU and FIFO under this resource-augmentation model.
Solution
If the online algorithm has cache size and OPT has cache size , then LRU and FIFO are
(d) Apply the formula in (c): if and , what is the resulting competitive ratio of LRU and FIFO?
Solution
With , :
So LRU and FIFO are roughly -competitive when given twice the cache that OPT has.
Problem 10: Multiplicative Weight Updates (Experts)
Section titled “Problem 10: Multiplicative Weight Updates (Experts)”For this topic the professor explicitly said you should understand the setup and the algorithm; the full proof is not required for the exam.
(a) State the experts’ problem. What are the inputs, what does the algorithm output at each time step, and what feedback do we receive?
Solution
The experts’ problem is a stylized version of a real-world dilemma: you have access to different “experts” (advisors, models, weather services, financial gurus, whatever) and every day you have to make a binary decision based on what they recommend — say, buy or sell. Some experts are reliable, some are not, but you don’t know in advance which is which.
Each round proceeds as follows:
- All experts simultaneously announce their suggestion for the round (buy or sell).
- Your algorithm has to commit to one of the two actions, before seeing the truth.
- Then the actual outcome is revealed, and each expert is judged “correct” or “mistake” based on whether their suggestion matched.
The whole goal here is to design a strategy whose total number of mistakes is comparable to the number of mistakes made by the best expert in hindsight — that is, the one who turns out after the fact to have been most accurate. The catch is that you don’t know which expert that’s going to be while the rounds are happening, so you can’t just blindly follow one of them.
(b) Describe the Weighted Majority algorithm: how weights are initialized, how decisions are made each round, and how weights are updated after the round.
Solution
- Initialize for all .
- At round , compute total weight . Take the weighted vote: choose the action whose supporting experts hold of the weight (break ties arbitrarily, e.g. flip a coin).
- After the truth is revealed, update each expert’s weight:
(c) State the mistake-bound theorem proved in class. What are the meanings of , , , and in the bound?
Solution
After rounds, for every expert ,
where:
- = number of mistakes made by Weighted Majority through round ,
- = number of mistakes made by expert through round ,
- = number of experts,
- = the small penalty parameter used in the multiplicative update.
In particular, taking to be the best expert in hindsight,
(d) Briefly explain what the bound implies as grows large (informally - no proof needed).
Solution
The additive term is independent of , so it becomes negligible as grows. The multiplicative factor shows that the algorithm pays at most roughly twice as many mistakes as the best expert, with a slight -overhead that can be tuned. So Weighted Majority is competitive with the best expert even though it must commit before knowing the truth each round.
Problem 11: Johnson-Lindenstrauss Lemma
Section titled “Problem 11: Johnson-Lindenstrauss Lemma”The professor explicitly described one exam-style question on JL (Lecture 25 transcript): “I give you , , ; calculate the best you can get.”
(a) State the Johnson-Lindenstrauss lemma. Be clear about what , , , and each mean and what the guaranteed distortion is.
Solution
The Johnson-Lindenstrauss lemma says that if you have a bunch of points sitting in some absurdly high-dimensional space, you can squash them down into a much lower-dimensional space and still preserve all the pairwise distances almost exactly — as long as you allow a small amount of distortion. The remarkable part is that the target dimension only needs to depend on how many points you have and how much distortion you’re willing to accept, not on how big the original dimension was.
Formally: for any and any integer , if you pick a target dimension
then for any set of points in there exists a map such that every pairwise distance is preserved up to a factor:
The four parameters:
- — how many points you have.
- — the original (high) dimension your points live in.
- — the target (low) dimension you’re projecting down to.
- — how much distortion you’re willing to tolerate. Smaller means distances are preserved more faithfully, but you need a higher to get there.
Two things stand out about this. First, doesn’t show up anywhere in the target-dimension formula — so it doesn’t matter if your original space has dimensions or . Second, the cost scales as , which is incredibly slow, so even with millions of points you only need modest target dimensions.
(b) Suppose you have points in with , and your downstream algorithm only works in dimensions. What is the best (smallest) distortion you can guarantee via JL?
Solution
Solve for :
Using natural log (the constant depends on the version of JL used, but the dependence on and is the same):
So roughly - distances are preserved up to about .
(c) Why is the target dimension independent of the original dimension ? Briefly explain in one or two sentences.
Solution
JL’s target dimension depends only on the number of points and the desired distortion, not on the ambient dimension. Intuitively, points span at most an -dimensional affine subspace regardless of how many ambient dimensions they sit in, so the geometric “intrinsic complexity” we must preserve scales with rather than .
(d) Suppose you’re OK with the projected distances being up to twice the original. What value of does that correspond to in the JL guarantee, and what’s the catch?
Solution
The JL guarantee is two-sided:
“Distances at most doubled” only constrains the upper bound: .
But here’s the catch: also controls the lower bound. At , the lower bound becomes — which means projected distances are allowed to shrink all the way to zero. Two genuinely far-apart points in the original space could end up at the same location after projection, completely destroying the nearest-neighbor information you were trying to preserve.
So when you “ask for” doubled distances by setting , you’re also accepting that distances could be arbitrarily compressed — which is usually a much worse failure mode than distances being stretched, since stretching at least preserves ordering. This is why JL is typically stated for and why useful applications usually pick much smaller values.
(e) Suppose you tolerate distances being off by up to in either direction. What is ? And as a rule of thumb, what range of makes a JL embedding actually useful for downstream tasks like nearest-neighbor search?
Solution
” in either direction” means . The guarantee becomes:
so a projected distance could be anywhere between half and one-and-a-half times the original.
Is that useful? It depends on the task, but generally no — because the ratio between the largest and smallest possible projected distance for the same pair is . Two pairs of points that were originally the same distance apart could end up with projected distances differing by a factor of . For nearest-neighbor search where you’re trying to distinguish “close” from “very close”, that’s way too much noise.
Practical rule of thumb. For an embedding to be useful for NNS or clustering, you typically want or so, giving distances preserved within . That keeps the worst-case “max/min projected distance” ratio close to , which is usually small enough that the geometric structure of the data survives the projection.
The cost of a smaller is a larger target dimension: , so cutting in half quadruples the required . That’s the central tradeoff in dimension reduction.
Problem 12: Approximate Nearest Neighbor and LSH
Section titled “Problem 12: Approximate Nearest Neighbor and LSH”Given data points in , exact NNS asks for . The trivial brute-force solution runs in time .
(a) State the nearest neighbor search (NNS) problem. What is the input, the query, and the desired output? What is the cost of the brute-force solution?
Solution
Nearest neighbor search is the basic primitive behind similarity search — think recommendation systems, image retrieval, or -NN classifiers in machine learning. The setup is that you preprocess a dataset of points once, then have to answer a stream of “given this new point, which of my stored points is most similar to it?” queries as fast as possible.
Concretely, you’re given data points at preprocessing time. Later, a query point arrives and you need to return the data point closest to under some distance metric (Euclidean, Hamming, etc.):
The brute-force solution is the obvious one: just compute for every single and return the smallest. Each distance computation between two -dimensional vectors takes time, and there are of them, so each query costs .
In modern datasets both and are huge — think a billion images each represented as a -dimensional embedding — so per query is way too slow. That’s the gap NNS algorithms try to close by doing more clever preprocessing.
(b) State (informally) the hardness result of Williams-Alman for exact NNS. Why does this push us toward approximate NNS?
Solution
For exact NNS in high dimensions, any algorithm running in time for some (i.e. strictly sub-linear in ) would violate the Strong Exponential Time Hypothesis (SETH). Since refuting SETH is considered highly unlikely, this rules out fast exact NNS in high dimensions. So we relax the problem to approximate NNS.
(c) Define the -approximate NNS problem for an approximation factor .
Solution
Given , return some satisfying
That is, return any point whose distance to is at most times the true nearest-neighbor distance.
(d) State the preprocessing and query time of the LSH data structure in terms of , , and a parameter . What is for Hamming distance, and what is for Euclidean distance?
Solution
LSH gives preprocessing/storage and query time , where
- Hamming distance: ,
- Euclidean distance: .
(e) If , what is the query time of LSH for Hamming distance and for Euclidean distance? Express the answers as expressions involving and .
Solution
For :
- Hamming: , so query time .
- Euclidean: , so query time .
Both are dramatically faster than the brute-force when is large.
Problem 13: Run AMS by Hand on a Concrete Stream
Section titled “Problem 13: Run AMS by Hand on a Concrete Stream”This problem mirrors the style of Question 6 on the sample final (final-example).
Consider the stream
with positions numbered from left to right.
(a) What does the second frequency moment count, and what is the AMS estimator for a single sampled position?
Solution
is the sum of squared frequencies of items in the stream. For one sample at position , let be the number of occurrences of at positions . The AMS estimator is
A single run satisfies .
(b) Compute the true value of for the stream above.
Solution
Frequencies:
| Item | Positions | ||
|---|---|---|---|
| 1 | 2, 4, 8 | 3 | 9 |
| 2 | 7 | 1 | 1 |
| 3 | 1, 6, 9 | 3 | 9 |
| 4 | 3 | 1 | 1 |
| 5 | 5, 10 | 2 | 4 |
(c) Run AMS three times for on this stream, sampling positions , , and in turn. Show and for each run.
Solution
-
Run 1 — sample position (). Occurrences of from position onward: positions , so .
-
Run 2 — sample position (). Occurrences of from position onward: positions , so .
-
Run 3 — sample position (). Occurrences of from position onward: position , so .
(d) Average the three outputs . By what percentage does the average overshoot or undershoot the true ? Briefly explain why one would not be alarmed by this discrepancy.
Solution
Compared to the true , the average overshoots by .
Why is this expected? AMS is unbiased () but its single-run variance is large. With only three samples, the empirical average can easily land off the truth. The full -guarantee from Problem 4 requires samples, which would be far more than three even for modest .
Problem 14: Reverse-Engineering a Streaming Counter
Section titled “Problem 14: Reverse-Engineering a Streaming Counter”This problem mirrors the style of Question 4 of the sample final (final-example), where you’re handed a friend’s data structure and must reverse-engineer information from it.
(a) Morris counter. Your friend ran a single Morris counter on an unknown stream and tells you the final value of the counter is . What is their best point estimate of the number of stream items processed?
Solution
The Morris counter output is , so
(b) Use the Morris counter’s variance result to estimate the standard deviation of the estimate from part (a). Why does this make a single Morris counter unsatisfying for high-confidence applications?
Solution
The standard deviation of the estimate is roughly
Using the point estimate as a stand-in for , . So the true could plausibly be anywhere from roughly to on a single run — the standard deviation is on the order of the mean itself. This is exactly why one needs the Morris+/Morris++ boosting strategies for any application that demands a tight confidence interval.
(c) Flajolet-Martin. Another friend ran the basic Flajolet-Martin algorithm and tells you the final largest LSB position observed is . What is their point estimate of the number of distinct elements?
Solution
The basic Flajolet-Martin algorithm outputs , so
(d) Using the factor- guarantee from Problem 6, give a range of plausible values for the true number of distinct elements (with probability ).
Solution
The factor- guarantee says that with probability ,
Rearranging both sides for with :
So the true distinct-element count is somewhere between and with probability at least — a very wide window, again motivating the boosted Flajolet-Martin / HyperLogLog improvements.
Problem 15: Daily Weather Predictions via MWU
Section titled “Problem 15: Daily Weather Predictions via MWU”You subscribe to competing weather services that each post a daily binary forecast (rain / no-rain) for your city. After days, you want a strategy whose total wrong predictions are not much more than those of whichever service turns out to have been the most accurate (which you do not know in advance).
You may use the mistake bound proven in class without re-deriving it:
(a) Identify the experts and the binary decisions in MWU language. What is here?
Solution
The three weather services are the experts (). The binary decision each day is “rain” or “no rain”. Your strategy’s job is to combine the three forecasts into a single daily decision.
(b) State the Weighted Majority algorithm in this context, including the initial weights, the daily decision rule, and the weight update rule after each day.
Solution
- Initialize weights .
- Each morning, compute the total weight . Add up the weights of services predicting rain. If they sum to , your strategy predicts rain; otherwise predict no-rain.
- After the actual weather is observed, for each service that was wrong, set ; leave the weight of correctly predicting services unchanged.
(c) Suppose you choose , and after the 1000 days the best-performing service was wrong on days. Give an upper bound on the number of days your Weighted Majority strategy was wrong. (You may approximate .)
Solution
Plug , , :
So Weighted Majority makes at most about wrong predictions over the days. That’s roughly the best service — and crucially, you achieved this without knowing in advance which service would be best.
(d) What would happen to the bound from part (c) if you instead naively committed to always following service #1 from day 1 onward, and service #1 turned out to be the worst of the three?
Solution
If you blindly followed service #1 and it happened to be the worst service, you would make as many mistakes as the worst service — which could be arbitrarily large with no bound relative to the best. MWU’s whole point is that you get a guarantee comparable to the best expert in hindsight, regardless of which one that turns out to be, so you never get stuck mimicking a bad expert.
Problem 16: Hand-Simulating a Browser Cache
Section titled “Problem 16: Hand-Simulating a Browser Cache”This problem mirrors the run-by-hand style of Question 6 of the sample final (final-example), applied to the paging problem.
Your browser holds at most tabs in memory. You visit URLs in the following order:
(a) Briefly describe each eviction policy: LRU, FIFO, and the offline optimal (Farthest-in-Future).
Solution
- LRU (Least Recently Used): on a miss with a full cache, evict the item whose most recent access is furthest in the past.
- FIFO (First In, First Out): on a miss with a full cache, evict the item that was inserted into the cache earliest. Accessing an item already in the cache does not change its FIFO position.
- Optimal offline (Farthest-in-Future): on a miss with a full cache, evict the item whose next request is furthest in the future (or whose next request never comes).
(b) Simulate LRU on and report the total number of cache misses.
Solution
Track the cache and recency order (most-recent on the left):
| Step | Request | Hit/Miss | Cache after | Recency order (left = most recent) |
|---|---|---|---|---|
| 1 | A | Miss | {A} | [A] |
| 2 | B | Miss | {A, B} | [B, A] |
| 3 | C | Miss | {A, B, C} | [C, B, A] |
| 4 | D | Miss | {B, C, D} | [D, C, B] (evicted A, the LRU) |
| 5 | A | Miss | {C, D, A} | [A, D, C] (evicted B) |
| 6 | B | Miss | {D, A, B} | [B, A, D] (evicted C) |
| 7 | E | Miss | {A, B, E} | [E, B, A] (evicted D) |
| 8 | A | Hit | {A, B, E} | [A, E, B] |
| 9 | C | Miss | {A, E, C} | [C, A, E] (evicted B) |
| 10 | D | Miss | {A, C, D} | [D, C, A] (evicted E) |
LRU misses: 9 (only step 8 is a hit).
(c) Simulate FIFO on and report the total number of cache misses.
Solution
Track the cache and the insertion-order queue (left = oldest):
| Step | Request | Hit/Miss | Cache after | Queue (left = first in) |
|---|---|---|---|---|
| 1 | A | Miss | {A} | [A] |
| 2 | B | Miss | {A, B} | [A, B] |
| 3 | C | Miss | {A, B, C} | [A, B, C] |
| 4 | D | Miss | {B, C, D} | [B, C, D] (evicted A) |
| 5 | A | Miss | {C, D, A} | [C, D, A] (evicted B) |
| 6 | B | Miss | {D, A, B} | [D, A, B] (evicted C) |
| 7 | E | Miss | {A, B, E} | [A, B, E] (evicted D) |
| 8 | A | Hit | {A, B, E} | [A, B, E] (queue unchanged) |
| 9 | C | Miss | {B, E, C} | [B, E, C] (evicted A) |
| 10 | D | Miss | {E, C, D} | [E, C, D] (evicted B) |
FIFO misses: 9 (only step 8 is a hit).
(d) Simulate the optimal offline (Farthest-in-Future) algorithm and report the total misses.
Solution
At each miss, look ahead at the next occurrence of each cached item; evict the one whose next request is farthest (or never).
| Step | Request | Hit/Miss | Cache after | Eviction reasoning |
|---|---|---|---|---|
| 1 | A | Miss | {A} | - |
| 2 | B | Miss | {A, B} | - |
| 3 | C | Miss | {A, B, C} | - |
| 4 | D | Miss | {A, B, D} | Next A@5, B@6, C@9. Evict C (farthest). |
| 5 | A | Hit | {A, B, D} | - |
| 6 | B | Hit | {A, B, D} | - |
| 7 | E | Miss | {A, D, E} | Next A@8, B never, D@10. Evict B (never). |
| 8 | A | Hit | {A, D, E} | - |
| 9 | C | Miss | {D, E, C} | Next A never, D@10, E never. Evict A. |
| 10 | D | Hit | {D, E, C} | - |
OPT misses: 6 (steps 1, 2, 3, 4, 7, 9).
(e) What is the empirical ratio (algorithm misses / OPT misses) for LRU and FIFO on this sequence? How does it compare to the worst-case guarantee of -competitiveness for ?
Solution
By Sleator-Tarjan, the worst-case bound for both LRU and FIFO at is -competitive, i.e. ratio . On this particular sequence the achieved ratio is only , well within the worst-case bound. (The -competitive bound is tight only for adversarial sequences — for typical traces, both LRU and FIFO usually perform much better than the worst case.)