Skip to content

Final Prep: Basic Understanding Questions

Questions explicitly posed to students across lectures 12-15, 17-24 (March 9 — May 6).

Why Does the Bloom Filter Have No False Negatives?

Section titled “Why Does the Bloom Filter Have No False Negatives?”

Why does the Bloom Filter guarantee no false negatives?

Answer

When a key is inserted, all kk of its hash positions are set to 1, and bits are never reset from 1 back to 0. So if a key was ever inserted, all kk positions it mapped to must still be 1. Any query for that key will find all kk positions set to 1 and correctly return “yes.”

Bloom Filter Hash Function Count for 2% FPR

Section titled “Bloom Filter Hash Function Count for 2% FPR”

If asked to build a Bloom Filter with a 2% false positive rate, how many hash functions should you use?

Answer k=log(1/ε)=log(1/0.02)=log(50)=6 hash functionsk = \lceil \log(1/\varepsilon) \rceil = \lceil \log(1/0.02) \rceil = \lceil \log(50) \rceil = 6 \text{ hash functions}

For the same 2% false positive rate, how many bits per key does the Bloom Filter use?

Answer mn=1.44log(1/ε)=1.44log(50)1.44×6=8.64    8.64=9 bits per key\frac{m}{n} = 1.44 \cdot \log(1/\varepsilon) = 1.44 \cdot \log(50) \approx 1.44 \times 6 = 8.64 \implies \lceil 8.64 \rceil = \textbf{9 bits per key}

Finding a Missing Number in O(log n) Space

Section titled “Finding a Missing Number in O(log n) Space”

You’re shown 4,999 of the numbers 1 through 5,000 one at a time in a random order. One number is missing. You’re only allowed to store O(logn)O(\log n) bits at any time. How do you find the missing number?

Answer

Precompute the total expected sum Stotal=n(n+1)/2S_{\text{total}} = n(n+1)/2. Maintain a single running sum ss, initialized to 0. As each element xx arrives, update ss+xs \leftarrow s + x. At the end, the missing number is StotalsS_{\text{total}} - s.

The only value stored is the running sum, which is at most n2/2\approx n^2/2, requiring 2logn\approx 2\log n bits — far fewer than the O(n)O(n) bits needed to store the full array.

Uniform Sampling Probability: What Do s and t Represent?

Section titled “Uniform Sampling Probability: What Do s and t Represent?”

In the uniform sampling algorithm, when the sample size is s=2s = 2 and 9 elements have passed so far, any given element has probability 2/92/9 of being in the sample. Where does the 2 come from, and where does the 9 come from?

Answer
  • 2=s2 = s, the required sample size
  • 9=t9 = t, the number of elements seen so far (current time)

For the sample to be uniform at time tt, all tt elements must have an equal s/ts/t chance of being in the sample.

How Is Streaming Sampling Different from Static Sampling?

Section titled “How Is Streaming Sampling Different from Static Sampling?”

How is the streaming sampling problem different from the kind of sampling problem you’d encounter in CSCI 323?

Answer

Two key differences:

  1. Space constraint — you cannot store the entire stream, so you can’t collect all elements and then randomly choose ss of them after the fact.
  2. Online / at-every-moment guarantee — at every time tt, the current sample must already satisfy the uniform guarantee Pr[element in sample]=s/t\Pr[\text{element in sample}] = s/t. The sample must continuously evolve as new elements arrive, which a one-shot static algorithm doesn’t need to worry about.

What Happens If You Always Include the New Key?

Section titled “What Happens If You Always Include the New Key?”

In the streaming sampling algorithm, if you always put every newly arrived key into the sample (kicking out a random existing key when full), does that give a uniform sample?

Answer

No — your sample would always consist of exactly the last ss keys seen. Every earlier key would have zero probability of remaining in the sample at later times, violating the uniformity requirement that all elements have equal probability.

What Must Be Stored Beyond the Sample Itself?

Section titled “What Must Be Stored Beyond the Sample Itself?”

Can the streaming sampling algorithm be implemented by storing only the ss sampled keys, or must something else be stored?

Answer

You must also store ii, the current count of how many keys have been seen so far. Without knowing ii, you cannot compute the acceptance probability s/is/i for each newly arriving key. Storing ii requires only logi\log i bits, so the total space is O(s+logT)O(s + \log T).

Inductive Proof Base Case: Why t=s+1t = s + 1 and Not t=1t = 1?

Section titled “Inductive Proof Base Case: Why t=s+1t = s + 1t=s+1 and Not t=1t = 1t=1?”

In the inductive correctness proof for the uniform sampling algorithm, the base case is t=s+1t = s + 1 rather than t=1t = 1. Why does this make sense?

Answer

For the first ss arrivals (tst \leq s), the algorithm simply stores every key — no probabilistic decisions are made. Since s/t=1s/t = 1 for tst \leq s, all keys have probability 1 of being in the sample, which trivially satisfies the guarantee. The first genuinely interesting case — where the algorithm must decide whether to accept or reject a new key — is when the (s+1)(s+1)th key arrives, making t=s+1t = s + 1 the natural base case.

When proving the inductive step, you use the total probability formula for the event that key xjx_j is in the sample at time k+1k+1, conditioning on whether it was in the sample at time kk. One of the two resulting terms is zero. Why?

Answer

The term Pr[xjAk+1xjAk]=0\Pr[x_j \in A_{k+1} \mid x_j \notin A_k] = 0, because keys can only enter the sample at the moment they first appear in the stream. If xjx_j was not accepted when it arrived, it has already passed and there is no mechanism to add it retroactively.

Lecture 13 - Morris Counter & Variance Reduction

Section titled “Lecture 13 - Morris Counter & Variance Reduction”

How many bits of working memory do you need to keep an exact count of TT items that have passed through a stream?

Answer

logT\lceil \log T \rceil bits — the minimum required to represent the number TT in binary.

Midterm Review: Cuckoo Hashing vs. Hashing with Chaining

Section titled “Midterm Review: Cuckoo Hashing vs. Hashing with Chaining”

What is the main advantage of cuckoo hashing over hashing with chaining?

Answer

O(1)O(1) worst-case query time. Hashing with chaining guarantees O(1)O(1) expected query time, but a single query can take O(n)O(n) in the worst case. Cuckoo hashing stores every key in exactly one of two possible positions, so any query checks at most 2 locations.

Midterm Review: Hashing with Chaining vs. FKS

Section titled “Midterm Review: Hashing with Chaining vs. FKS”

What is the main advantage of hashing with chaining over FKS hashing?

Answer

Hashing with chaining has O(n)O(n) worst-case preprocessing (build) time, since each of the nn insertions takes O(1)O(1) worst-case. FKS hashing only guarantees O(n)O(n) expected build time — there is some probability of needing to rebuild, so its build time is not worst-case O(n)O(n).

What Does the Morris Counter Reduce to with Increment Probability 1?

Section titled “What Does the Morris Counter Reduce to with Increment Probability 1?”

What does the Morris counter algorithm become if you change the increment probability from 1/2c1/2^c to 1 (always increment)?

Answer

A regular exact counter — it increments by 1 every time a key appears, and at the end its value equals exactly NN (the total number of keys seen). This uses logN\lceil \log N \rceil bits, which is what the Morris counter is designed to improve upon.

Simulating a Biased Coin with Only a Fair Coin

Section titled “Simulating a Biased Coin with Only a Fair Coin”

You have only a single fair coin (probability 1/21/2 heads). How do you simulate an event that occurs with probability 1/2k1/2^k?

Answer

Flip the coin kk times. Declare “success” if every flip comes up heads. Since each flip is independent, Pr[all heads]=(1/2)k=1/2k\Pr[\text{all heads}] = (1/2)^k = 1/2^k.

This matters in implementing the Morris counter: rather than storing and computing the full probability 1/2c1/2^c (which would require storing 2cN2^c \approx N — defeating the purpose), you just flip a fair coin cc times and increment only if all cc flips are heads.

First Thing to Verify About a New Randomized Algorithm

Section titled “First Thing to Verify About a New Randomized Algorithm”

If you’ve just designed a randomized algorithm that outputs an estimate Q^\hat{Q}, what is the first property you’d want to verify?

Answer

That E[Q^]=Q\mathbb{E}[\hat{Q}] = Q, i.e., the estimate is unbiased — correct on average. If the algorithm produces the wrong answer in expectation, there is little hope of making it useful. Establishing this is typically the first step before analyzing variance or applying concentration inequalities.

Variance of an Average of Independent Estimates

Section titled “Variance of an Average of Independent Estimates”

You have TT independent random variables X1,,XTX_1, \ldots, X_T, each with variance σ\sigma. If Y=(X1++XT)/TY = (X_1 + \cdots + X_T) / T, what is Var(Y)\text{Var}(Y)?

Answer Var(Y)=σT\text{Var}(Y) = \frac{\sigma}{T}

Derivation:

Var(Y)=Var ⁣(XiT)=1T2Var ⁣(Xi)=1T2Tσ=σT\text{Var}(Y) = \text{Var}\!\left(\frac{\sum X_i}{T}\right) = \frac{1}{T^2}\,\text{Var}\!\left(\sum X_i\right) = \frac{1}{T^2} \cdot T\sigma = \frac{\sigma}{T}

where independence lets the variance of the sum equal the sum of variances, and the scaling rule Var(cX)=c2Var(X)\text{Var}(cX) = c^2\,\text{Var}(X) pulls out the 1/T21/T^2.

Suppose you have an unbiased estimator Q^\hat{Q} (correct in expectation) but with high variance. How can you reduce the variance while keeping the estimate unbiased?

Answer

Run TT independent copies of the algorithm to get estimates Q^1,,Q^T\hat{Q}_1, \ldots, \hat{Q}_T, and output their average Qˉ=(Q^1++Q^T)/T\bar{Q} = (\hat{Q}_1 + \cdots + \hat{Q}_T)/T.

  • E[Qˉ]=Q\mathbb{E}[\bar{Q}] = Q — the average is still unbiased.
  • Var(Qˉ)=Var(Q^)/T\text{Var}(\bar{Q}) = \text{Var}(\hat{Q}) / T — variance shrinks by a factor of TT.

By Chebyshev’s inequality, the probability that Qˉ\bar{Q} deviates from QQ by more than cc is at most Var(Q^)/(Tc2)\text{Var}(\hat{Q}) / (T c^2), which can be made as small as desired by increasing TT.

Lecture 14 was led by a TA (Gibyo) working through textbook exercises 4.9 and 4.5. The main explicit question posed mid-solution:

Expectation of the Number of Bad Weak Estimates (Exercise 4.9)

Section titled “Expectation of the Number of Bad Weak Estimates (Exercise 4.9)”

In the proof of exercise 4.9, we define Yi=1Y_i = 1 if the ii-th weak estimate is bad (outside the ε\varepsilon-relative error range), and Yi=0Y_i = 0 otherwise. We collect tt' independent weak estimates. What is E ⁣[i=1tYi]\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right]?

Answer E ⁣[i=1tYi]=t4\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right] = \frac{t'}{4}

Each YiY_i is a Bernoulli random variable. A “weak estimate” is defined with δ=1/4\delta = 1/4, meaning each individual estimate is bad with probability at most 1/41/4. By linearity of expectation and independence of the estimates:

E ⁣[i=1tYi]=tE[Yi]=t14=t4\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right] = t' \cdot \mathbb{E}[Y_i] = t' \cdot \frac{1}{4} = \frac{t'}{4}

Lecture 15 - Approximate Median & Morris+/++

Section titled “Lecture 15 - Approximate Median & Morris+/++”

Lecture 15 was again led by a TA (Gibyo). The session covered the (ε,δ)(\varepsilon, \delta)-approximate median algorithm and the Morris+ / Morris++ algorithms. Unlike the professor’s lectures, the TA rarely posed explicit conceptual questions to the class. There are no substantial Q&A exchanges to include here.

Lecture 17 - Heavy Hitters & Count Min Sketch

Section titled “Lecture 17 - Heavy Hitters & Count Min Sketch”

How to Find the Approximate Median of a Stream

Section titled “How to Find the Approximate Median of a Stream”

If you are given a stream of numbers, how would you find an approximate median?

Answer

Take T=O ⁣(log(1/δ)ε2)T = O\!\left(\dfrac{\log(1/\delta)}{\varepsilon^2}\right) uniform samples from the stream and return the median of those samples.

This gives an (ε,δ)(\varepsilon, \delta)-approximate median — meaning the returned value has rank within εn\varepsilon n of the true median — with probability at least 1δ1 - \delta. No repetition or averaging step is needed beyond choosing a large enough sample size TT.

Using Count Min Sketch for Top-K: The Role of a Min-Heap

Section titled “Using Count Min Sketch for Top-K: The Role of a Min-Heap”

Suppose you have a Count Min Sketch that can estimate the frequency of any item. How do you use it to maintain the top-KK most frequent items as the stream arrives?

Answer

Maintain a min-heap of size KK, keyed by estimated frequency. Initialize it with the first KK items.

Whenever a new item appears in the stream, query its estimated frequency f^\hat{f} from the CMS. If f^\hat{f} exceeds the frequency of the current minimum in the heap (the least popular of the KK tracked items), evict that minimum and insert the new item.

This keeps the heap holding the current top-KK candidates at all times, using only O(K)O(K) extra space and O(logK)O(\log K) update time per arrival.

When querying the Count Min Sketch for the frequency of item yy, what is the final output?

Answer

Apply every row’s hash function to yy: compute h1(y),h2(y),,hR(y)h_1(y), h_2(y), \ldots, h_R(y) to obtain one cell index per row. Read the counter stored in each of those RR cells and return the minimum value among them. This is why the data structure is called the count min sketch.

Why does the Count Min Sketch return the minimum counter value across all rows, rather than, say, the maximum or the average?

Answer

Whenever two items hash to the same cell in a row, their counts collide and that cell’s counter gets inflated by items other than yy. The minimum across rows is the counter that has been inflated the least, making it the closest estimate to the true frequency. Taking the maximum or average would allow a single heavily-collided row to dominate the result.

Can the Count Min Sketch ever return f^(y)<f(y)\hat{f}(y) < f(y) — an estimate smaller than the true frequency?

Answer

No. Every time yy appeared in the stream, its designated cell in every row was incremented by 1. Counters are never decremented. So each of those RR cell values is at least f(y)f(y), and therefore so is their minimum. The Count Min Sketch can only overestimate — it never underestimates.

How Is the CMS Guarantee Different from the Standard Relative-Error Guarantee?

Section titled “How Is the CMS Guarantee Different from the Standard Relative-Error Guarantee?”

The Count Min Sketch guarantees f^(y)f(y)+εm\hat{f}(y) \leq f(y) + \varepsilon m with probability at least 1δ1 - \delta. How does this differ from the usual relative-error (ε,δ)(\varepsilon, \delta) guarantee?

Answer

In a standard relative-error guarantee, the error is bounded by εf(y)\varepsilon \cdot f(y) — a fraction of the true frequency of the queried item. For the CMS, the error bound is εm\varepsilon m, where mm is the total stream length, regardless of how large or small f(y)f(y) is.

Since f(y)mf(y) \leq m for any item, εmεf(y)\varepsilon m \geq \varepsilon f(y), so this is a weaker, absolute error bound. A matching lower bound shows that a relative-error guarantee requires storing the full stream, so the absolute bound is the best achievable in sublinear space.

The expected error in one row’s counter for item yy is at most m/cm/c (where c=e/εc = e/\varepsilon columns). Using Markov’s inequality, what is the probability that this row’s error exceeds εm\varepsilon m?

Answer

By Markov’s inequality, Pr[Z>t]E[Z]/t\Pr[Z > t] \leq \mathbb{E}[Z] / t. Here:

Pr[row error>εm]m/cεm=mε/eεm=1e\Pr[\text{row error} > \varepsilon m] \leq \frac{m/c}{\varepsilon m} = \frac{m \cdot \varepsilon / e}{\varepsilon m} = \frac{1}{e}

So each row independently has at most a 1/e1/2.721/e \approx 1/2.72 chance of its error exceeding the allowed tolerance.

Why Must All Rows Fail for the CMS to Give a Bad Answer?

Section titled “Why Must All Rows Fail for the CMS to Give a Bad Answer?”

For the Count Min Sketch to return an overestimate larger than f(y)+εmf(y) + \varepsilon m, what must be true of all RR rows simultaneously?

Answer

The CMS output is the minimum counter across all rows. For this minimum to exceed f(y)+εmf(y) + \varepsilon m, every row’s counter must exceed that threshold — all RR bad events must occur together.

If even one row has an error within εm\varepsilon m, the minimum is controlled and the guarantee holds. Because the rows use independent hash functions, the probability that all RR rows simultaneously have error greater than εm\varepsilon m is at most (1/e)R(1/e)^R. With R=log(1/δ)R = \log(1/\delta) rows, this is (1/e)log(1/δ)=δ(1/e)^{\log(1/\delta)} = \delta.

Lecture 18 - Frequency Moments & AMS Sampling

Section titled “Lecture 18 - Frequency Moments & AMS Sampling”

Where Does E[X2]\mathbb{E}[X^2] Show Up?

Section titled “Where Does E[X2]\mathbb{E}[X^2]E[X2] Show Up?”

If you have a random variable XX, for what purpose would you ever need to compute E[X2]\mathbb{E}[X^2]?

Answer

In computing the variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2

The second moment E[X2]\mathbb{E}[X^2] is needed any time you want to measure how spread out a distribution is around its mean.

What does the 0th frequency moment of a stream represent?

Answer F0=i=1f(i)>0nf(i)0=i=1f(i)>0n1F_0 = \sum_{\substack{i=1 \\ f(i) > 0}}^{n} f(i)^0 = \sum_{\substack{i=1 \\ f(i) > 0}}^{n} 1

This counts the number of distinct elements in the stream. Each item that has appeared at least once contributes 11 (since f(i)0=1f(i)^0 = 1 for f(i)>0f(i) > 0), and items that never appeared contribute 00. Estimating F0F_0 efficiently is the problem solved by HyperLogLog.

What does the 1st frequency moment represent, and what streaming algorithm computes it?

Answer F1=i=1nf(i)F_1 = \sum_{i=1}^{n} f(i)

This is the total number of elements in the stream — its length mm — since summing all frequencies counts every arrival. The Morris counter is precisely the algorithm for approximating F1F_1 in sublinear space.

AMS Sampling: Output When Sampling a Rare Item

Section titled “AMS Sampling: Output When Sampling a Rare Item”

Consider the stream 1,2,5,1,3,5,5,6,5,1,31, 2, 5, 1, 3, 5, 5, 6, 5, 1, 3 (length m=11m = 11). If the AMS algorithm samples element 66 (which appears only once), what is the output?

Answer

Sampling 66 gives r=1r = 1 (it appears exactly once at or after the sampled position). The second-moment output formula is m(r2(r1)2)m \cdot (r^2 - (r-1)^2):

11(1202)=111=1111 \cdot (1^2 - 0^2) = 11 \cdot 1 = 11

AMS Sampling: Output When Sampling the First Occurrence of a Frequent Item

Section titled “AMS Sampling: Output When Sampling the First Occurrence of a Frequent Item”

In the same stream, if the algorithm samples the first occurrence of element 11 (which appears a total of 3 times), what is the output?

Answer

Sampling the first occurrence of 11 means r=3r = 3 (all three occurrences fall at or after that position). The output is:

11(3222)=11(94)=115=5511 \cdot (3^2 - 2^2) = 11 \cdot (9 - 4) = 11 \cdot 5 = 55

Boosting a Single Unbiased Estimate: General Strategy

Section titled “Boosting a Single Unbiased Estimate: General Strategy”

AMS sampling produces a single estimate that is correct in expectation but has high variance. What is the general strategy for converting this into an (ε,δ)(\varepsilon, \delta)-guarantee?

Answer

The standard two-level strategy from exercise 4.9:

  1. Run T1=O(1/ε2)T_1 = O(1/\varepsilon^2) independent copies of AMS and take their average. This produces a “weak estimate” that is within ε\varepsilon of the true answer with probability at least 3/43/4 (variance is reduced by averaging).
  2. Run T2=O(log(1/δ))T_2 = O(\log(1/\delta)) independent copies of step 1 and take their median. By the median amplification argument, the final output achieves the full (ε,δ)(\varepsilon, \delta)-guarantee.

This mirrors the Morris+ / Morris++ construction applied to AMS sampling.

Applying Chernoff to AMS: What Are We Solving For?

Section titled “Applying Chernoff to AMS: What Are We Solving For?”

When applying the Chernoff bound in the analysis of AMS sampling, what is the goal of the calculation?

Answer

We want to find the minimum number of independent copies TT of the AMS estimator that must be averaged in order to achieve the desired (ε,δ)(\varepsilon, \delta) error guarantee.

We know the estimator is correct in expectation. We apply the generalized Chernoff bound (for random variables bounded in [0,C][0, C] rather than just {0,1}\{0, 1\}) to bound the probability that the average of TT copies deviates from the true answer by more than ε\varepsilon. Solving for the smallest TT such that this probability stays at or below δ\delta gives the required number of replications.

Lecture 19 - AMS Sampling (cont.) & Counting Distinct Elements

Section titled “Lecture 19 - AMS Sampling (cont.) & Counting Distinct Elements”

Why Does Averaging T Copies Preserve the Expected Value?

Section titled “Why Does Averaging T Copies Preserve the Expected Value?”

When you run TT independent copies of AMS sampling and average their outputs to get YY, why is E[Y]=E[X]\mathbb{E}[Y] = \mathbb{E}[X], where XX is the output of one copy?

Answer

By linearity of expectation. Since all TT copies are identically distributed:

E[Y]=E ⁣[X1++XTT]=1Ti=1TE[Xi]=1TTE[X]=E[X]\mathbb{E}[Y] = \mathbb{E}\!\left[\frac{X_1 + \cdots + X_T}{T}\right] = \frac{1}{T}\sum_{i=1}^{T}\mathbb{E}[X_i] = \frac{1}{T} \cdot T \cdot \mathbb{E}[X] = \mathbb{E}[X]

Does Running More Copies Make the Error Probability Better or Worse?

Section titled “Does Running More Copies Make the Error Probability Better or Worse?”

In the AMS analysis, as you increase the number of parallel copies TT being averaged, does the probability of a large error increase or decrease?

Answer

It decreases. Both from intuition (more samples give a better estimate) and from the Chernoff expression: the probability bound is eΩ(ε2T/C)e^{-\Omega(\varepsilon^2 \cdot T / C)}, which shrinks exponentially as TT grows. Running more copies always improves the guarantee.

What Is the Problem with the Exact Expression for TT?

Section titled “What Is the Problem with the Exact Expression for TTT?”

After applying Chernoff to derive the exact minimum number of copies TT needed for an (ε,δ)(\varepsilon, \delta) guarantee, what prevents you from using that formula directly?

Answer

The formula depends on two unknown quantities:

  • ff^* — the frequency of the most frequent item in the stream, which you don’t know ahead of time.
  • F=if(i)2F = \sum_i f(i)^2 — the second frequency moment, which is precisely the quantity you are trying to compute.

Since neither is known before processing the stream, the formula can’t be evaluated. The fix is to upper-bound mf/Fnm f^* / F \leq \sqrt{n} (a provable lemma), replacing the unknown expression with n\sqrt{n}, which is a computable quantity.

What Streams Maximize and Minimize F=f(i)2F = \sum f(i)^2?

Section titled “What Streams Maximize and Minimize F=∑f(i)2F = \sum f(i)^2F=∑f(i)2?”

The second frequency moment F=i=1nf(i)2F = \sum_{i=1}^{n} f(i)^2 depends on how stream elements are distributed. Which stream structure maximizes FF, and which minimizes it?

Answer
  • Maximized when all mm elements are the same item: that item has frequency mm and all others have frequency 00, giving F=m2F = m^2.
  • Minimized when the stream is spread as evenly as possible across all nn items: each item has frequency m/nm/n, giving F=n(m/n)2=m2/nF = n \cdot (m/n)^2 = m^2/n.

So FF is always in the range [m2/n,m2][m^2/n,\, m^2], and a high FF signals a skewed distribution (few items dominate), while a low FF signals a more uniform one.

Counting Distinct Elements Without Space Constraints

Section titled “Counting Distinct Elements Without Space Constraints”

If you had no space constraint, how would you count the number of distinct elements in a stream?

Answer

Maintain a running set SS. Whenever a new element xx arrives, check if xSx \in S. If it is not, insert xx; if it already is, skip it. At any point, output S|S| as the count of distinct elements seen so far.

This works but uses O(f0logn)O(f_0 \log n) bits, where f0f_0 is the number of distinct elements and each element is a number between 11 and nn requiring logn\log n bits — far too much for a streaming setting.

Expected Position of the Minimum of Two Random Darts

Section titled “Expected Position of the Minimum of Two Random Darts”

You throw two darts independently and uniformly at random on the interval [0,1][0, 1]. What is the expected position of the smaller (leftmost) of the two darts?

Answer E[min(X1,X2)]=13\mathbb{E}[\min(X_1, X_2)] = \frac{1}{3}

The intuition: two darts divide the interval into three equal pieces on average, so the left dart lands at 1/31/3 and the right dart at 2/32/3.

More generally, if you throw tt darts uniformly on [0,1][0,1], the expected minimum is 1/(t+1)1/(t+1). This fact is the key to understanding why the Flajolet-Martin ideal algorithm works.

Lecture 20 - Flajolet-Martin Ideal Algorithm

Section titled “Lecture 20 - Flajolet-Martin Ideal Algorithm”

How Many Hash Values Does the FM Ideal Algorithm Compute?

Section titled “How Many Hash Values Does the FM Ideal Algorithm Compute?”

In the Flajolet-Martin ideal algorithm, the stream may have length mm with many repetitions. Over the entire stream, how many distinct hash values h(xi)h(x_i) are actually computed and kept track of?

Answer

Only tt distinct hash values are computed — one per distinct element, since the hash function is fixed and deterministic. If the same element appears ten times, it hashes to the same value every time. So repeated appearances of an item never produce a new hash value; only a genuinely new element does.

The FM ideal algorithm outputs 1X1\frac{1}{X} - 1, where XX is the minimum hash value. Why the 1-1?

Answer

If tt is the number of distinct elements, we proved that:

E[X]=1t+1\mathbb{E}[X] = \frac{1}{t + 1}

So 1/Xt+11/X \approx t + 1 in expectation, and subtracting 1 gives approximately tt — the quantity we want to output.

Exam Question: Tracking the Maximum Hash Instead of the Minimum

Section titled “Exam Question: Tracking the Maximum Hash Instead of the Minimum”

You are implementing the FM ideal algorithm, but your friend accidentally tracks the maximum hash value XX' instead of the minimum. Is the algorithm salvageable? If so, what output formula should be used?

Answer

The algorithm is not doomed. If tt is the number of distinct elements, then:

E[X]=tt+1\mathbb{E}[X'] = \frac{t}{t+1}

So 1X1/(t+1)1 - X' \approx 1/(t+1) in expectation, and therefore 1/(1X)t+11/(1 - X') \approx t + 1. The corrected output is:

11X1\frac{1}{1 - X'} - 1

The proof follows the same structure as the minimum case, but uses the complementary probability: the probability that all tt hashes land to the left of a line at position xx is xtx^t, and integrating 1xt1 - x^t gives the expectation of the max.

Why Is the Continuous-Hash Algorithm Called “Ideal”?

Section titled “Why Is the Continuous-Hash Algorithm Called “Ideal”?”

The Flajolet-Martin analysis begins with a hash function mapping items to real numbers in [0,1][0,1]. Why is this called the “ideal” algorithm?

Answer

Because such a hash function cannot actually be implemented in finite space. A real number in [0,1][0,1] (like 1/21/\sqrt{2}) may require infinite bits to represent exactly. Since the entire goal of the streaming setting is to save space, an algorithm that requires infinite precision to store a single hash value is impractical — hence “ideal.” It serves only to convey the main idea before showing the practical approximation.

The HyperLogLog algorithm tracks the position pp of the leftmost least-significant-bit across all hash bit vectors. Why is the output 2p+12^{p+1} (and not, say, 2p2^p or pp itself)?

Answer

For tt distinct elements, the expected number of elements whose hash ends in exactly p+1p+1 zeros (i.e., whose least-significant-bit is at position pp) is:

t2p+1\frac{t}{2^{p+1}}

The largest position pp we observe is the one where this count drops to roughly 11. Setting t/2p+11t / 2^{p+1} \approx 1 and solving gives t2p+1t \approx 2^{p+1}, which is why we output that quantity.

Lecture 21/22 - HyperLogLog Analysis & Online Algorithms

Section titled “Lecture 21/22 - HyperLogLog Analysis & Online Algorithms”

In the HyperLogLog analysis, why is the variance of each indicator Bernoulli variable YiY_i guaranteed to be no larger than its expectation?

Answer

For a Bernoulli with success probability pp:

Var(Yi)=p(1p)p=E[Yi]\text{Var}(Y_i) = p(1-p) \leq p = \mathbb{E}[Y_i]

since (1p)1(1-p) \leq 1. Multiplying pp by something at most 1 can only make it smaller or equal. This lets you upper-bound the variance of indicator sums by their expectation — a convenient shortcut throughout the HyperLogLog proof.

Ski Rental: What Is the Competitive Ratio?

Section titled “Ski Rental: What Is the Competitive Ratio?”

In the ski rental problem, skis cost \Ptobuyorto buy or$1/daytorent.Youdontknowhowmanydays/day to rent. You don't know how many days Nyoullski.Theonlinealgorithmrentsforthefirstyou'll ski. The online algorithm rents for the firstP-1days,thenbuysondaydays, then buys on dayP$ if you’re still there. What is the competitive ratio?

Answer

At most 2. There are two cases:

  • If N<PN < P: you only rent, spending NN total — exactly what OPT spends. Ratio = 1.
  • If NPN \geq P: you spend (P1)+P=2P1(P-1) + P = 2P - 1 (rent days 1 through P1P-1, buy on day PP). OPT spends PP (just buy on day 1). Ratio = (2P1)/P<2(2P-1)/P < 2.

In the worst case, the ratio approaches 22, so the competitive ratio is at most 2\mathbf{2}.

Pizza Finding: Why Does “Walk to One End First” Have a Bad Competitive Ratio?

Section titled “Pizza Finding: Why Does “Walk to One End First” Have a Bad Competitive Ratio?”

You are in room 0 of a corridor with nn rooms on each side. Pizza is in room jj (unknown). One algorithm: walk all the way to one end; if not found, walk to the other. Why is this algorithm’s competitive ratio bad?

Answer

This algorithm’s cost is at most 2n+j2n + |j|: at most nn to reach one end, nn to return, then j|j| to reach the pizza on the other side.

OPT costs j|j| (walk straight there).

The competitive ratio is:

2n+jj=2nj+1\frac{2n + |j|}{|j|} = \frac{2n}{|j|} + 1

If j|j| is small (e.g., pizza is in room 1), this ratio grows like 2n2n — it depends on nn, the corridor length. A good online algorithm should have a constant competitive ratio, not one that scales with the input size.

Pizza Finding: Why Does the Zigzag Algorithm Still Fail?

Section titled “Pizza Finding: Why Does the Zigzag Algorithm Still Fail?”

The zigzag algorithm alternates between checking room 1, 1-1, 2, 2-2, 33, 3-3, … until the pizza is found. Why does this also have a poor competitive ratio?

Answer

If the pizza is in room jj, this algorithm travels roughly 2j22j^2 total distance before finding it (you visit jj rooms in each direction, multiple times). OPT travels j|j|.

The competitive ratio is at least:

2j2j=2j\frac{2j^2}{j} = 2j

This still depends on jj, which could be as large as nn. A competitive ratio of 2j2j is not a constant, so the zigzag algorithm is not competitive in the desired sense either. The fix is to turn at powers of 2 (i.e., visit rooms 1,1,2,2,4,4,8,8,1, -1, 2, -2, 4, -4, 8, -8, \ldots), which achieves a constant competitive ratio of at most 99.

Lecture 23 - List Update Problem & Multiplicative Weight Updates

Section titled “Lecture 23 - List Update Problem & Multiplicative Weight Updates”

What Is a Bad Sequence for the “Do Nothing” List Algorithm?

Section titled “What Is a Bad Sequence for the “Do Nothing” List Algorithm?”

You maintain a linked list of NN keys. One algorithm never moves any accessed key (it just walks to the key and leaves it in place). What is the worst sequence for this algorithm, and what is its cost?

Answer

The worst sequence is one that repeatedly requests the key at the very end of the list — for example, the sequence N,N,N,,NN, N, N, \ldots, N of length MM.

Since the algorithm never moves anything, the key stays at position NN every time. Each access costs NN, so the total cost is MNM \cdot N.

By contrast, an algorithm that moves the first accessed key to the front of the list would pay NN once and then 11 for each subsequent access of the same key — a total of roughly MM. Its cost on this sequence is at most 2M2M, so the “do nothing” algorithm’s competitive ratio on this sequence is MN/2M=N/2=Ω(N)M \cdot N / 2M = N/2 = \Omega(N).

Move-to-Front: Cost After the First Access

Section titled “Move-to-Front: Cost After the First Access”

In the list update problem, the “move to front” heuristic moves an accessed key to the front of the list every time it is accessed. On the sequence N,N,N,N, N, N, \ldots (always requesting the last element), how much does the algorithm pay for the first access of NN, and how much for each subsequent access?

Answer
  • First access: NN — the algorithm must walk from the front to position NN.
  • Each subsequent access: 11 — after the first access, NN is moved to the front of the list and stays there.

So the total cost over MM accesses is N+(M1)1MN + (M - 1) \cdot 1 \approx M, which is far cheaper than the “do nothing” algorithm’s MNM \cdot N.

What Is a Bad Sequence for the Ordering-by-Frequency Algorithm?

Section titled “What Is a Bad Sequence for the Ordering-by-Frequency Algorithm?”

The “order by frequency” algorithm maintains the linked list in decreasing order of how many times each key has been requested so far. What is a bad access sequence for this algorithm?

Answer

Access item 11 exactly NN times, then item 22 exactly NN times, then item 33 exactly NN times, and so on up to item NN — a total sequence length of N2N^2.

After the first half of the sequence (items 11 through N/2N/2 have each been accessed NN times), the first half of the list is occupied by items 1,2,,N/21, 2, \ldots, N/2, all with frequency NN. Items N/2+1N/2 + 1 through NN have frequency 00 and sit in the second half.

Now accessing item N/2+1N/2 + 1 costs at least N/2N/2 each time (it is in the second half of the list), and this is true for all N2/2N^2/2 remaining accesses. Total cost (N2/2)(N/2)=N3/4=Ω(N3)\geq (N^2/2) \cdot (N/2) = N^3/4 = \Omega(N^3).

The “move to front” algorithm achieves O(N2)O(N^2) on the same sequence, so the ordering-by-frequency competitive ratio is at least N3/N2=Ω(N)N^3 / N^2 = \Omega(N) — just as bad as doing nothing.

What is the formal definition of the competitive ratio of an online algorithm?

Answer

The competitive ratio is the maximum over all possible input sequences of the ratio:

cost of the online algorithm on that sequencecost of OPT on that sequence\frac{\text{cost of the online algorithm on that sequence}}{\text{cost of OPT on that sequence}}

It is not simply the worst-case cost of the algorithm alone — an algorithm can have a high cost on a sequence where OPT also has a high cost, and the ratio might still be small. What matters is how much worse the online algorithm does compared to the offline optimal on the same input.

Can Experts Gain Weight in the Multiplicative Weight Updates Algorithm?

Section titled “Can Experts Gain Weight in the Multiplicative Weight Updates Algorithm?”

In the MWU algorithm, experts start with weight 11 and may be penalized by a factor of (1ε)(1 - \varepsilon) each time they make a mistake. Can an expert ever have its weight increased?

Answer

No. The algorithm only ever decreases weights (multiplying by 1ε<11 - \varepsilon < 1 on mistakes) or leaves them unchanged (on correct predictions). There is no mechanism for an expert’s weight to increase. Consequently, the total weight Φt=iwit\Phi^t = \sum_i w_i^t is non-increasing over time, starting at NN and only going down.

Lecture 24 - Experts Theorem Proof & Paging

Section titled “Lecture 24 - Experts Theorem Proof & Paging”

Simplified Experts Setting: One Expert Is Always Correct

Section titled “Simplified Experts Setting: One Expert Is Always Correct”

Suppose you have NN experts and you are told that one of them is always correct. What strategy guarantees the fewest mistakes, and how many mistakes does it make in the worst case?

Answer

Follow an expert until they make a mistake, then drop them and pick another. Since the one always-correct expert never makes a mistake, you only ever drop the wrong ones. Each expert you drop made at most one mistake (the one that caused you to drop them), and you start with NN experts, so you make at most N1N - 1 mistakes total. After that, only the always-correct expert remains, and you make no further mistakes.

This case is simpler than the general MWU setting, where no expert is guaranteed to be always correct.

Why Is wit=(1ε)mitw_i^t = (1 - \varepsilon)^{m_i^t}?

Section titled “Why Is wit=(1−ε)mitw_i^t = (1 - \varepsilon)^{m_i^t}wit​=(1−ε)mit​?”

In the MWU algorithm, expert ii‘s weight on day tt is (1ε)mit(1 - \varepsilon)^{m_i^t}, where mitm_i^t is the number of mistakes expert ii has made up to day tt. Why?

Answer

Each expert starts with weight 11. Every time expert ii makes a mistake, its weight is multiplied by (1ε)(1 - \varepsilon). Since expert ii has made mitm_i^t mistakes over tt days:

wit=1(1ε)mit=(1ε)mitw_i^t = 1 \cdot (1 - \varepsilon)^{m_i^t} = (1 - \varepsilon)^{m_i^t}

The number of mistakes is exactly the exponent because each mistake contributes exactly one factor of (1ε)(1 - \varepsilon).

Why Does the Potential Function Start at NN?

Section titled “Why Does the Potential Function Start at NNN?”

In the MWU potential function proof, Φt=i=1Nwit\Phi^t = \sum_{i=1}^{N} w_i^t denotes the total weight on day tt. Why is Φ1=N\Phi^1 = N?

Answer

On day 1, every expert has weight 11 (no mistakes have been made yet, so no weights have been penalized). There are NN experts, so:

Φ1=i=1Nwi1=N1=N\Phi^1 = \sum_{i=1}^{N} w_i^1 = N \cdot 1 = N

As days pass, what happens to the potential function Φt\Phi^t?

Answer

Φt\Phi^t either decreases or stays the same — it never increases. This is because expert weights are only ever penalized (multiplied by 1ε<11 - \varepsilon < 1) or left unchanged; they are never increased. Since the total weight is the sum of individual weights that can only go down, the potential can only go down.

More precisely: on any day the algorithm makes a mistake, at least half the total weight was behind the wrong decision, and those experts’ weights get multiplied by (1ε)(1 - \varepsilon). This forces:

Φt+1Φt(1ε2)\Phi^{t+1} \leq \Phi^t \cdot \left(1 - \frac{\varepsilon}{2}\right)

so each mistake shrinks the potential by a factor of at least (1ε/2)(1 - \varepsilon/2).

In the paging problem, both LRU (least recently used) and FIFO (first in, first out) are cache eviction policies. How are they different?

Answer
  • LRU: when a cache miss forces an eviction, evict the item in cache that was requested least recently — i.e., the item that has gone the longest without being asked for.
  • FIFO: evict the item that was brought into cache the earliest — the one that has been sitting in cache the longest, regardless of whether it was requested recently.

They differ when an old item is still actively used. For example: items 1,2,31, 2, 3 are cached (item 11 was brought in first). Item 11 is then requested 50 more times. When a cache miss on item 44 requires an eviction:

  • FIFO evicts item 11 (it was brought in first).
  • LRU evicts item 22 (item 11 was requested most recently).

If you could see the entire future request sequence, what is the optimal algorithm for deciding which cached item to evict on a cache miss?

Answer

Farthest in Future (proved optimal by Bélády): when a cache miss occurs, look at every item currently in cache and find the one that will not be requested again until the farthest point in the future. Evict that item.

Intuitively, keeping items you will need soon and evicting items you won’t need for a long time minimizes future cache misses. Counting frequencies ignores the order of future requests and can perform worse — a very frequent item that is only needed at the very end is not worth keeping now.

Is LRU Being KK-Competitive a Good Result?

Section titled “Is LRU Being KKK-Competitive a Good Result?”

Both LRU and FIFO are proven to be KK-competitive, where KK is the cache size. Is this a good guarantee?

Answer

No — KK can be very large (e.g., thousands for a megabyte-sized cache), making the bound practically useless. Furthermore, it is also proved that no online algorithm can achieve a competitive ratio better than KK, so KK-competitiveness is the best possible in the standard online setting. This makes the paging problem essentially hopeless without additional assumptions.

Resource Augmentation: LRU with a Larger Cache

Section titled “Resource Augmentation: LRU with a Larger Cache”

In the resource augmentation model, the online algorithm has a cache of size KK while the optimal offline algorithm is restricted to a smaller cache of size KK'. If K=2KK = 2K', what is LRU’s competitive ratio under this comparison?

Answer

With K=2KK = 2K', the competitive ratio is:

KKK+1KKK=KK/2=2\frac{K}{K - K' + 1} \approx \frac{K}{K - K'} = \frac{K}{K/2} = 2

So LRU is 2-competitive when given a cache twice the size of OPT’s. The intuition: being unable to see the future is a significant handicap, but doubling the cache size roughly compensates for it, bringing the competitive ratio down to a small constant.