Final Prep: Basic Understanding Questions | CSCI 328

Questions explicitly posed to students across lectures 12-15, 17-24 (March 9 — May 6).

Lecture 12 - Streaming & Uniform Sampling

Why Does the Bloom Filter Have No False Negatives?

Why does the Bloom Filter guarantee no false negatives?

Answer

When a key is inserted, all $k$ of its hash positions are set to 1, and bits are never reset from 1 back to 0. So if a key was ever inserted, all $k$ positions it mapped to must still be 1. Any query for that key will find all $k$ positions set to 1 and correctly return “yes.”

Bloom Filter Hash Function Count for 2% FPR

If asked to build a Bloom Filter with a 2% false positive rate, how many hash functions should you use?

Answer

k = \lceil \log(1/\varepsilon) \rceil = \lceil \log(1/0.02) \rceil = \lceil \log(50) \rceil = 6 \text{ hash functions}

Bloom Filter Bits Per Key for 2% FPR

For the same 2% false positive rate, how many bits per key does the Bloom Filter use?

Answer

\frac{m}{n} = 1.44 \cdot \log(1/\varepsilon) = 1.44 \cdot \log(50) \approx 1.44 \times 6 = 8.64 \implies \lceil 8.64 \rceil = \textbf{9 bits per key}

Finding a Missing Number in O(log n) Space

You’re shown 4,999 of the numbers 1 through 5,000 one at a time in a random order. One number is missing. You’re only allowed to store $O(\log n)$ bits at any time. How do you find the missing number?

Answer

Precompute the total expected sum $S_{\text{total}} = n(n+1)/2$ . Maintain a single running sum $s$ , initialized to 0. As each element $x$ arrives, update $s \leftarrow s + x$ . At the end, the missing number is $S_{\text{total}} - s$ .

The only value stored is the running sum, which is at most $\approx n^2/2$ , requiring $\approx 2\log n$ bits — far fewer than the $O(n)$ bits needed to store the full array.

Uniform Sampling Probability: What Do s and t Represent?

In the uniform sampling algorithm, when the sample size is $s = 2$ and 9 elements have passed so far, any given element has probability $2/9$ of being in the sample. Where does the 2 come from, and where does the 9 come from?

Answer

$2 = s$ , the required sample size
$9 = t$ , the number of elements seen so far (current time)

For the sample to be uniform at time $t$ , all $t$ elements must have an equal $s/t$ chance of being in the sample.

How Is Streaming Sampling Different from Static Sampling?

How is the streaming sampling problem different from the kind of sampling problem you’d encounter in CSCI 323?

Answer

Two key differences:

Space constraint — you cannot store the entire stream, so you can’t collect all elements and then randomly choose $s$ of them after the fact.
Online / at-every-moment guarantee — at every time $t$ , the current sample must already satisfy the uniform guarantee $\Pr[\text{element in sample}] = s/t$ . The sample must continuously evolve as new elements arrive, which a one-shot static algorithm doesn’t need to worry about.

What Happens If You Always Include the New Key?

In the streaming sampling algorithm, if you always put every newly arrived key into the sample (kicking out a random existing key when full), does that give a uniform sample?

Answer

No — your sample would always consist of exactly the last $s$ keys seen. Every earlier key would have zero probability of remaining in the sample at later times, violating the uniformity requirement that all elements have equal probability.

What Must Be Stored Beyond the Sample Itself?

Can the streaming sampling algorithm be implemented by storing only the $s$ sampled keys, or must something else be stored?

Answer

You must also store $i$ , the current count of how many keys have been seen so far. Without knowing $i$ , you cannot compute the acceptance probability $s/i$ for each newly arriving key. Storing $i$ requires only $\log i$ bits, so the total space is $O(s + \log T)$ .

Inductive Proof Base Case: Why $t = s + 1$ and Not $t = 1$ ?

In the inductive correctness proof for the uniform sampling algorithm, the base case is $t = s + 1$ rather than $t = 1$ . Why does this make sense?

Answer

For the first $s$ arrivals ( $t \leq s$ ), the algorithm simply stores every key — no probabilistic decisions are made. Since $s/t = 1$ for $t \leq s$ , all keys have probability 1 of being in the sample, which trivially satisfies the guarantee. The first genuinely interesting case — where the algorithm must decide whether to accept or reject a new key — is when the $(s+1)$ th key arrives, making $t = s + 1$ the natural base case.

Zero Term in the Inductive Step

When proving the inductive step, you use the total probability formula for the event that key $x_j$ is in the sample at time $k+1$ , conditioning on whether it was in the sample at time $k$ . One of the two resulting terms is zero. Why?

Answer

The term $\Pr[x_j \in A_{k+1} \mid x_j \notin A_k] = 0$ , because keys can only enter the sample at the moment they first appear in the stream. If $x_j$ was not accepted when it arrived, it has already passed and there is no mechanism to add it retroactively.

Lecture 13 - Morris Counter & Variance Reduction

How Many Bits to Exactly Count $T$ Items?

How many bits of working memory do you need to keep an exact count of $T$ items that have passed through a stream?

Answer

$\lceil \log T \rceil$ bits — the minimum required to represent the number $T$ in binary.

Midterm Review: Cuckoo Hashing vs. Hashing with Chaining

What is the main advantage of cuckoo hashing over hashing with chaining?

Answer

$O(1)$ worst-case query time. Hashing with chaining guarantees $O(1)$ expected query time, but a single query can take $O(n)$ in the worst case. Cuckoo hashing stores every key in exactly one of two possible positions, so any query checks at most 2 locations.

Midterm Review: Hashing with Chaining vs. FKS

What is the main advantage of hashing with chaining over FKS hashing?

Answer

Hashing with chaining has $O(n)$ worst-case preprocessing (build) time, since each of the $n$ insertions takes $O(1)$ worst-case. FKS hashing only guarantees $O(n)$ expected build time — there is some probability of needing to rebuild, so its build time is not worst-case $O(n)$ .

What Does the Morris Counter Reduce to with Increment Probability 1?

What does the Morris counter algorithm become if you change the increment probability from $1/2^c$ to 1 (always increment)?

Answer

A regular exact counter — it increments by 1 every time a key appears, and at the end its value equals exactly $N$ (the total number of keys seen). This uses $\lceil \log N \rceil$ bits, which is what the Morris counter is designed to improve upon.

Simulating a Biased Coin with Only a Fair Coin

You have only a single fair coin (probability $1/2$ heads). How do you simulate an event that occurs with probability $1/2^k$ ?

Answer

Flip the coin $k$ times. Declare “success” if every flip comes up heads. Since each flip is independent, $\Pr[\text{all heads}] = (1/2)^k = 1/2^k$ .

This matters in implementing the Morris counter: rather than storing and computing the full probability $1/2^c$ (which would require storing $2^c \approx N$ — defeating the purpose), you just flip a fair coin $c$ times and increment only if all $c$ flips are heads.

First Thing to Verify About a New Randomized Algorithm

If you’ve just designed a randomized algorithm that outputs an estimate $\hat{Q}$ , what is the first property you’d want to verify?

Answer

That $\mathbb{E}[\hat{Q}] = Q$ , i.e., the estimate is unbiased — correct on average. If the algorithm produces the wrong answer in expectation, there is little hope of making it useful. Establishing this is typically the first step before analyzing variance or applying concentration inequalities.

Variance of an Average of Independent Estimates

You have $T$ independent random variables $X_1, \ldots, X_T$ , each with variance $\sigma$ . If $Y = (X_1 + \cdots + X_T) / T$ , what is $\text{Var}(Y)$ ?

Answer

\text{Var}(Y) = \frac{\sigma}{T}

Derivation:

\text{Var}(Y) = \text{Var}\!\left(\frac{\sum X_i}{T}\right) = \frac{1}{T^2}\,\text{Var}\!\left(\sum X_i\right) = \frac{1}{T^2} \cdot T\sigma = \frac{\sigma}{T}

where independence lets the variance of the sum equal the sum of variances, and the scaling rule $\text{Var}(cX) = c^2\,\text{Var}(X)$ pulls out the $1/T^2$ .

How to Reduce Variance by Repetition

Suppose you have an unbiased estimator $\hat{Q}$ (correct in expectation) but with high variance. How can you reduce the variance while keeping the estimate unbiased?

Answer

Run $T$ independent copies of the algorithm to get estimates $\hat{Q}_1, \ldots, \hat{Q}_T$ , and output their average $\bar{Q} = (\hat{Q}_1 + \cdots + \hat{Q}_T)/T$ .

$\mathbb{E}[\bar{Q}] = Q$ — the average is still unbiased.
$\text{Var}(\bar{Q}) = \text{Var}(\hat{Q}) / T$ — variance shrinks by a factor of $T$ .

By Chebyshev’s inequality, the probability that $\bar{Q}$ deviates from $Q$ by more than $c$ is at most $\text{Var}(\hat{Q}) / (T c^2)$ , which can be made as small as desired by increasing $T$ .

Lecture 14 - Practice Problems (TA-led)

Lecture 14 was led by a TA (Gibyo) working through textbook exercises 4.9 and 4.5. The main explicit question posed mid-solution:

Expectation of the Number of Bad Weak Estimates (Exercise 4.9)

In the proof of exercise 4.9, we define $Y_i = 1$ if the $i$ -th weak estimate is bad (outside the $\varepsilon$ -relative error range), and $Y_i = 0$ otherwise. We collect $t'$ independent weak estimates. What is $\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right]$ ?

Answer

\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right] = \frac{t'}{4}

Each $Y_i$ is a Bernoulli random variable. A “weak estimate” is defined with $\delta = 1/4$ , meaning each individual estimate is bad with probability at most $1/4$ . By linearity of expectation and independence of the estimates:

\mathbb{E}\!\left[\sum_{i=1}^{t'} Y_i\right] = t' \cdot \mathbb{E}[Y_i] = t' \cdot \frac{1}{4} = \frac{t'}{4}

Lecture 15 - Approximate Median & Morris+/++

Lecture 15 was again led by a TA (Gibyo). The session covered the $(\varepsilon, \delta)$ -approximate median algorithm and the Morris+ / Morris++ algorithms. Unlike the professor’s lectures, the TA rarely posed explicit conceptual questions to the class. There are no substantial Q&A exchanges to include here.

Lecture 17 - Heavy Hitters & Count Min Sketch

How to Find the Approximate Median of a Stream

If you are given a stream of numbers, how would you find an approximate median?

Answer

Take $T = O\!\left(\dfrac{\log(1/\delta)}{\varepsilon^2}\right)$ uniform samples from the stream and return the median of those samples.

This gives an $(\varepsilon, \delta)$ -approximate median — meaning the returned value has rank within $\varepsilon n$ of the true median — with probability at least $1 - \delta$ . No repetition or averaging step is needed beyond choosing a large enough sample size $T$ .

Using Count Min Sketch for Top-K: The Role of a Min-Heap

Suppose you have a Count Min Sketch that can estimate the frequency of any item. How do you use it to maintain the top- $K$ most frequent items as the stream arrives?

Answer

Maintain a min-heap of size $K$ , keyed by estimated frequency. Initialize it with the first $K$ items.

Whenever a new item appears in the stream, query its estimated frequency $\hat{f}$ from the CMS. If $\hat{f}$ exceeds the frequency of the current minimum in the heap (the least popular of the $K$ tracked items), evict that minimum and insert the new item.

This keeps the heap holding the current top- $K$ candidates at all times, using only $O(K)$ extra space and $O(\log K)$ update time per arrival.

CMS Query Output

When querying the Count Min Sketch for the frequency of item $y$ , what is the final output?

Answer

Apply every row’s hash function to $y$ : compute $h_1(y), h_2(y), \ldots, h_R(y)$ to obtain one cell index per row. Read the counter stored in each of those $R$ cells and return the minimum value among them. This is why the data structure is called the count min sketch.

Why Return the Minimum?

Why does the Count Min Sketch return the minimum counter value across all rows, rather than, say, the maximum or the average?

Answer

Whenever two items hash to the same cell in a row, their counts collide and that cell’s counter gets inflated by items other than $y$ . The minimum across rows is the counter that has been inflated the least, making it the closest estimate to the true frequency. Taking the maximum or average would allow a single heavily-collided row to dominate the result.

Can the Count Min Sketch Underestimate?

Can the Count Min Sketch ever return $\hat{f}(y) < f(y)$ — an estimate smaller than the true frequency?

Answer

No. Every time $y$ appeared in the stream, its designated cell in every row was incremented by 1. Counters are never decremented. So each of those $R$ cell values is at least $f(y)$ , and therefore so is their minimum. The Count Min Sketch can only overestimate — it never underestimates.

How Is the CMS Guarantee Different from the Standard Relative-Error Guarantee?

The Count Min Sketch guarantees $\hat{f}(y) \leq f(y) + \varepsilon m$ with probability at least $1 - \delta$ . How does this differ from the usual relative-error $(\varepsilon, \delta)$ guarantee?

Answer

In a standard relative-error guarantee, the error is bounded by $\varepsilon \cdot f(y)$ — a fraction of the true frequency of the queried item. For the CMS, the error bound is $\varepsilon m$ , where $m$ is the total stream length, regardless of how large or small $f(y)$ is.

Since $f(y) \leq m$ for any item, $\varepsilon m \geq \varepsilon f(y)$ , so this is a weaker, absolute error bound. A matching lower bound shows that a relative-error guarantee requires storing the full stream, so the absolute bound is the best achievable in sublinear space.

Applying Markov to a Single CMS Row

The expected error in one row’s counter for item $y$ is at most $m/c$ (where $c = e/\varepsilon$ columns). Using Markov’s inequality, what is the probability that this row’s error exceeds $\varepsilon m$ ?

Answer

By Markov’s inequality, $\Pr[Z > t] \leq \mathbb{E}[Z] / t$ . Here:

\Pr[\text{row error} > \varepsilon m] \leq \frac{m/c}{\varepsilon m} = \frac{m \cdot \varepsilon / e}{\varepsilon m} = \frac{1}{e}

So each row independently has at most a $1/e \approx 1/2.72$ chance of its error exceeding the allowed tolerance.

Why Must All Rows Fail for the CMS to Give a Bad Answer?

For the Count Min Sketch to return an overestimate larger than $f(y) + \varepsilon m$ , what must be true of all $R$ rows simultaneously?

Answer

The CMS output is the minimum counter across all rows. For this minimum to exceed $f(y) + \varepsilon m$ , every row’s counter must exceed that threshold — all $R$ bad events must occur together.

If even one row has an error within $\varepsilon m$ , the minimum is controlled and the guarantee holds. Because the rows use independent hash functions, the probability that all $R$ rows simultaneously have error greater than $\varepsilon m$ is at most $(1/e)^R$ . With $R = \log(1/\delta)$ rows, this is $(1/e)^{\log(1/\delta)} = \delta$ .

Lecture 18 - Frequency Moments & AMS Sampling

Where Does $\mathbb{E}[X^2]$ Show Up?

If you have a random variable $X$ , for what purpose would you ever need to compute $\mathbb{E}[X^2]$ ?

Answer

In computing the variance:

\text{Var}(X) = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2

The second moment $\mathbb{E}[X^2]$ is needed any time you want to measure how spread out a distribution is around its mean.

The 0th Frequency Moment

What does the 0th frequency moment of a stream represent?

Answer

F_0 = \sum_{\substack{i=1 \\ f(i) > 0}}^{n} f(i)^0 = \sum_{\substack{i=1 \\ f(i) > 0}}^{n} 1

This counts the number of distinct elements in the stream. Each item that has appeared at least once contributes $1$ (since $f(i)^0 = 1$ for $f(i) > 0$ ), and items that never appeared contribute $0$ . Estimating $F_0$ efficiently is the problem solved by HyperLogLog.

The 1st Frequency Moment

What does the 1st frequency moment represent, and what streaming algorithm computes it?

Answer

F_1 = \sum_{i=1}^{n} f(i)

This is the total number of elements in the stream — its length $m$ — since summing all frequencies counts every arrival. The Morris counter is precisely the algorithm for approximating $F_1$ in sublinear space.

AMS Sampling: Output When Sampling a Rare Item

Consider the stream $1, 2, 5, 1, 3, 5, 5, 6, 5, 1, 3$ (length $m = 11$ ). If the AMS algorithm samples element $6$ (which appears only once), what is the output?

Answer

Sampling $6$ gives $r = 1$ (it appears exactly once at or after the sampled position). The second-moment output formula is $m \cdot (r^2 - (r-1)^2)$ :

11 \cdot (1^2 - 0^2) = 11 \cdot 1 = 11

AMS Sampling: Output When Sampling the First Occurrence of a Frequent Item

In the same stream, if the algorithm samples the first occurrence of element $1$ (which appears a total of 3 times), what is the output?

Answer

Sampling the first occurrence of $1$ means $r = 3$ (all three occurrences fall at or after that position). The output is:

11 \cdot (3^2 - 2^2) = 11 \cdot (9 - 4) = 11 \cdot 5 = 55

Boosting a Single Unbiased Estimate: General Strategy

AMS sampling produces a single estimate that is correct in expectation but has high variance. What is the general strategy for converting this into an $(\varepsilon, \delta)$ -guarantee?

Answer

The standard two-level strategy from exercise 4.9:

Run $T_1 = O(1/\varepsilon^2)$ independent copies of AMS and take their average. This produces a “weak estimate” that is within $\varepsilon$ of the true answer with probability at least $3/4$ (variance is reduced by averaging).
Run $T_2 = O(\log(1/\delta))$ independent copies of step 1 and take their median. By the median amplification argument, the final output achieves the full $(\varepsilon, \delta)$ -guarantee.

This mirrors the Morris+ / Morris++ construction applied to AMS sampling.

Applying Chernoff to AMS: What Are We Solving For?

When applying the Chernoff bound in the analysis of AMS sampling, what is the goal of the calculation?

Answer

We want to find the minimum number of independent copies $T$ of the AMS estimator that must be averaged in order to achieve the desired $(\varepsilon, \delta)$ error guarantee.

We know the estimator is correct in expectation. We apply the generalized Chernoff bound (for random variables bounded in $[0, C]$ rather than just $\{0, 1\}$ ) to bound the probability that the average of $T$ copies deviates from the true answer by more than $\varepsilon$ . Solving for the smallest $T$ such that this probability stays at or below $\delta$ gives the required number of replications.

Lecture 19 - AMS Sampling (cont.) & Counting Distinct Elements

Why Does Averaging T Copies Preserve the Expected Value?

When you run $T$ independent copies of AMS sampling and average their outputs to get $Y$ , why is $\mathbb{E}[Y] = \mathbb{E}[X]$ , where $X$ is the output of one copy?

Answer

By linearity of expectation. Since all $T$ copies are identically distributed:

\mathbb{E}[Y] = \mathbb{E}\!\left[\frac{X_1 + \cdots + X_T}{T}\right] = \frac{1}{T}\sum_{i=1}^{T}\mathbb{E}[X_i] = \frac{1}{T} \cdot T \cdot \mathbb{E}[X] = \mathbb{E}[X]

Does Running More Copies Make the Error Probability Better or Worse?

In the AMS analysis, as you increase the number of parallel copies $T$ being averaged, does the probability of a large error increase or decrease?

Answer

It decreases. Both from intuition (more samples give a better estimate) and from the Chernoff expression: the probability bound is $e^{-\Omega(\varepsilon^2 \cdot T / C)}$ , which shrinks exponentially as $T$ grows. Running more copies always improves the guarantee.

What Is the Problem with the Exact Expression for $T$ ?

After applying Chernoff to derive the exact minimum number of copies $T$ needed for an $(\varepsilon, \delta)$ guarantee, what prevents you from using that formula directly?

Answer

The formula depends on two unknown quantities:

$f^*$ — the frequency of the most frequent item in the stream, which you don’t know ahead of time.
$F = \sum_i f(i)^2$ — the second frequency moment, which is precisely the quantity you are trying to compute.

Since neither is known before processing the stream, the formula can’t be evaluated. The fix is to upper-bound $m f^* / F \leq \sqrt{n}$ (a provable lemma), replacing the unknown expression with $\sqrt{n}$ , which is a computable quantity.

What Streams Maximize and Minimize $F = \sum f(i)^2$ ?

The second frequency moment $F = \sum_{i=1}^{n} f(i)^2$ depends on how stream elements are distributed. Which stream structure maximizes $F$ , and which minimizes it?

Answer

Maximized when all $m$ elements are the same item: that item has frequency $m$ and all others have frequency $0$ , giving $F = m^2$ .
Minimized when the stream is spread as evenly as possible across all $n$ items: each item has frequency $m/n$ , giving $F = n \cdot (m/n)^2 = m^2/n$ .

So $F$ is always in the range $[m^2/n,\, m^2]$ , and a high $F$ signals a skewed distribution (few items dominate), while a low $F$ signals a more uniform one.

Counting Distinct Elements Without Space Constraints

If you had no space constraint, how would you count the number of distinct elements in a stream?

Answer

Maintain a running set $S$ . Whenever a new element $x$ arrives, check if $x \in S$ . If it is not, insert $x$ ; if it already is, skip it. At any point, output $|S|$ as the count of distinct elements seen so far.

This works but uses $O(f_0 \log n)$ bits, where $f_0$ is the number of distinct elements and each element is a number between $1$ and $n$ requiring $\log n$ bits — far too much for a streaming setting.

Expected Position of the Minimum of Two Random Darts

You throw two darts independently and uniformly at random on the interval $[0, 1]$ . What is the expected position of the smaller (leftmost) of the two darts?

Answer

\mathbb{E}[\min(X_1, X_2)] = \frac{1}{3}

The intuition: two darts divide the interval into three equal pieces on average, so the left dart lands at $1/3$ and the right dart at $2/3$ .

More generally, if you throw $t$ darts uniformly on $[0,1]$ , the expected minimum is $1/(t+1)$ . This fact is the key to understanding why the Flajolet-Martin ideal algorithm works.

Lecture 20 - Flajolet-Martin Ideal Algorithm

How Many Hash Values Does the FM Ideal Algorithm Compute?

In the Flajolet-Martin ideal algorithm, the stream may have length $m$ with many repetitions. Over the entire stream, how many distinct hash values $h(x_i)$ are actually computed and kept track of?

Answer

Only $t$ distinct hash values are computed — one per distinct element, since the hash function is fixed and deterministic. If the same element appears ten times, it hashes to the same value every time. So repeated appearances of an item never produce a new hash value; only a genuinely new element does.

Why Does the FM Algorithm Subtract 1?

The FM ideal algorithm outputs $\frac{1}{X} - 1$ , where $X$ is the minimum hash value. Why the $-1$ ?

Answer

If $t$ is the number of distinct elements, we proved that:

\mathbb{E}[X] = \frac{1}{t + 1}

So $1/X \approx t + 1$ in expectation, and subtracting 1 gives approximately $t$ — the quantity we want to output.

Exam Question: Tracking the Maximum Hash Instead of the Minimum

You are implementing the FM ideal algorithm, but your friend accidentally tracks the maximum hash value $X'$ instead of the minimum. Is the algorithm salvageable? If so, what output formula should be used?

Answer

The algorithm is not doomed. If $t$ is the number of distinct elements, then:

\mathbb{E}[X'] = \frac{t}{t+1}

So $1 - X' \approx 1/(t+1)$ in expectation, and therefore $1/(1 - X') \approx t + 1$ . The corrected output is:

\frac{1}{1 - X'} - 1

The proof follows the same structure as the minimum case, but uses the complementary probability: the probability that all $t$ hashes land to the left of a line at position $x$ is $x^t$ , and integrating $1 - x^t$ gives the expectation of the max.

Why Is the Continuous-Hash Algorithm Called “Ideal”?

The Flajolet-Martin analysis begins with a hash function mapping items to real numbers in $[0,1]$ . Why is this called the “ideal” algorithm?

Answer

Because such a hash function cannot actually be implemented in finite space. A real number in $[0,1]$ (like $1/\sqrt{2}$ ) may require infinite bits to represent exactly. Since the entire goal of the streaming setting is to save space, an algorithm that requires infinite precision to store a single hash value is impractical — hence “ideal.” It serves only to convey the main idea before showing the practical approximation.

HyperLogLog: Why Output $2^{p+1}$ ?

The HyperLogLog algorithm tracks the position $p$ of the leftmost least-significant-bit across all hash bit vectors. Why is the output $2^{p+1}$ (and not, say, $2^p$ or $p$ itself)?

Answer

For $t$ distinct elements, the expected number of elements whose hash ends in exactly $p+1$ zeros (i.e., whose least-significant-bit is at position $p$ ) is:

\frac{t}{2^{p+1}}

The largest position $p$ we observe is the one where this count drops to roughly $1$ . Setting $t / 2^{p+1} \approx 1$ and solving gives $t \approx 2^{p+1}$ , which is why we output that quantity.

Lecture 21/22 - HyperLogLog Analysis & Online Algorithms

Bernoulli Variance vs. Expectation

In the HyperLogLog analysis, why is the variance of each indicator Bernoulli variable $Y_i$ guaranteed to be no larger than its expectation?

Answer

For a Bernoulli with success probability $p$ :

\text{Var}(Y_i) = p(1-p) \leq p = \mathbb{E}[Y_i]

since $(1-p) \leq 1$ . Multiplying $p$ by something at most 1 can only make it smaller or equal. This lets you upper-bound the variance of indicator sums by their expectation — a convenient shortcut throughout the HyperLogLog proof.

Ski Rental: What Is the Competitive Ratio?

In the ski rental problem, skis cost $\$ P $to buy or$ $1 $/day to rent. You don't know how many days$ N $you'll ski. The online algorithm rents for the first$ P-1 $days, then buys on day$ P$ if you’re still there. What is the competitive ratio?

Answer

At most 2. There are two cases:

If $N < P$ : you only rent, spending $N$ total — exactly what OPT spends. Ratio = 1.
If $N \geq P$ : you spend $(P-1) + P = 2P - 1$ (rent days 1 through $P-1$ , buy on day $P$ ). OPT spends $P$ (just buy on day 1). Ratio = $(2P-1)/P < 2$ .

In the worst case, the ratio approaches $2$ , so the competitive ratio is at most $\mathbf{2}$ .

Pizza Finding: Why Does “Walk to One End First” Have a Bad Competitive Ratio?

You are in room 0 of a corridor with $n$ rooms on each side. Pizza is in room $j$ (unknown). One algorithm: walk all the way to one end; if not found, walk to the other. Why is this algorithm’s competitive ratio bad?

Answer

This algorithm’s cost is at most $2n + |j|$ : at most $n$ to reach one end, $n$ to return, then $|j|$ to reach the pizza on the other side.

OPT costs $|j|$ (walk straight there).

The competitive ratio is:

\frac{2n + |j|}{|j|} = \frac{2n}{|j|} + 1

If $|j|$ is small (e.g., pizza is in room 1), this ratio grows like $2n$ — it depends on $n$ , the corridor length. A good online algorithm should have a constant competitive ratio, not one that scales with the input size.

Pizza Finding: Why Does the Zigzag Algorithm Still Fail?

The zigzag algorithm alternates between checking room 1, $-1$ , 2, $-2$ , $3$ , $-3$ , … until the pizza is found. Why does this also have a poor competitive ratio?

Answer

If the pizza is in room $j$ , this algorithm travels roughly $2j^2$ total distance before finding it (you visit $j$ rooms in each direction, multiple times). OPT travels $|j|$ .

The competitive ratio is at least:

\frac{2j^2}{j} = 2j

This still depends on $j$ , which could be as large as $n$ . A competitive ratio of $2j$ is not a constant, so the zigzag algorithm is not competitive in the desired sense either. The fix is to turn at powers of 2 (i.e., visit rooms $1, -1, 2, -2, 4, -4, 8, -8, \ldots$ ), which achieves a constant competitive ratio of at most $9$ .

Lecture 23 - List Update Problem & Multiplicative Weight Updates

What Is a Bad Sequence for the “Do Nothing” List Algorithm?

You maintain a linked list of $N$ keys. One algorithm never moves any accessed key (it just walks to the key and leaves it in place). What is the worst sequence for this algorithm, and what is its cost?

Answer

The worst sequence is one that repeatedly requests the key at the very end of the list — for example, the sequence $N, N, N, \ldots, N$ of length $M$ .

Since the algorithm never moves anything, the key stays at position $N$ every time. Each access costs $N$ , so the total cost is $M \cdot N$ .

By contrast, an algorithm that moves the first accessed key to the front of the list would pay $N$ once and then $1$ for each subsequent access of the same key — a total of roughly $M$ . Its cost on this sequence is at most $2M$ , so the “do nothing” algorithm’s competitive ratio on this sequence is $M \cdot N / 2M = N/2 = \Omega(N)$ .

Move-to-Front: Cost After the First Access

In the list update problem, the “move to front” heuristic moves an accessed key to the front of the list every time it is accessed. On the sequence $N, N, N, \ldots$ (always requesting the last element), how much does the algorithm pay for the first access of $N$ , and how much for each subsequent access?

Answer

First access: $N$ — the algorithm must walk from the front to position $N$ .
Each subsequent access: $1$ — after the first access, $N$ is moved to the front of the list and stays there.

So the total cost over $M$ accesses is $N + (M - 1) \cdot 1 \approx M$ , which is far cheaper than the “do nothing” algorithm’s $M \cdot N$ .

What Is a Bad Sequence for the Ordering-by-Frequency Algorithm?

The “order by frequency” algorithm maintains the linked list in decreasing order of how many times each key has been requested so far. What is a bad access sequence for this algorithm?

Answer

Access item $1$ exactly $N$ times, then item $2$ exactly $N$ times, then item $3$ exactly $N$ times, and so on up to item $N$ — a total sequence length of $N^2$ .

After the first half of the sequence (items $1$ through $N/2$ have each been accessed $N$ times), the first half of the list is occupied by items $1, 2, \ldots, N/2$ , all with frequency $N$ . Items $N/2 + 1$ through $N$ have frequency $0$ and sit in the second half.

Now accessing item $N/2 + 1$ costs at least $N/2$ each time (it is in the second half of the list), and this is true for all $N^2/2$ remaining accesses. Total cost $\geq (N^2/2) \cdot (N/2) = N^3/4 = \Omega(N^3)$ .

The “move to front” algorithm achieves $O(N^2)$ on the same sequence, so the ordering-by-frequency competitive ratio is at least $N^3 / N^2 = \Omega(N)$ — just as bad as doing nothing.

What Exactly Is the Competitive Ratio?

What is the formal definition of the competitive ratio of an online algorithm?

Answer

The competitive ratio is the maximum over all possible input sequences of the ratio:

\frac{\text{cost of the online algorithm on that sequence}}{\text{cost of OPT on that sequence}}

It is not simply the worst-case cost of the algorithm alone — an algorithm can have a high cost on a sequence where OPT also has a high cost, and the ratio might still be small. What matters is how much worse the online algorithm does compared to the offline optimal on the same input.

Can Experts Gain Weight in the Multiplicative Weight Updates Algorithm?

In the MWU algorithm, experts start with weight $1$ and may be penalized by a factor of $(1 - \varepsilon)$ each time they make a mistake. Can an expert ever have its weight increased?

Answer

No. The algorithm only ever decreases weights (multiplying by $1 - \varepsilon < 1$ on mistakes) or leaves them unchanged (on correct predictions). There is no mechanism for an expert’s weight to increase. Consequently, the total weight $\Phi^t = \sum_i w_i^t$ is non-increasing over time, starting at $N$ and only going down.

Lecture 24 - Experts Theorem Proof & Paging

Simplified Experts Setting: One Expert Is Always Correct

Suppose you have $N$ experts and you are told that one of them is always correct. What strategy guarantees the fewest mistakes, and how many mistakes does it make in the worst case?

Answer

Follow an expert until they make a mistake, then drop them and pick another. Since the one always-correct expert never makes a mistake, you only ever drop the wrong ones. Each expert you drop made at most one mistake (the one that caused you to drop them), and you start with $N$ experts, so you make at most $N - 1$ mistakes total. After that, only the always-correct expert remains, and you make no further mistakes.

This case is simpler than the general MWU setting, where no expert is guaranteed to be always correct.

Why Is $w_i^t = (1 - \varepsilon)^{m_i^t}$ ?

In the MWU algorithm, expert $i$ ‘s weight on day $t$ is $(1 - \varepsilon)^{m_i^t}$ , where $m_i^t$ is the number of mistakes expert $i$ has made up to day $t$ . Why?

Answer

Each expert starts with weight $1$ . Every time expert $i$ makes a mistake, its weight is multiplied by $(1 - \varepsilon)$ . Since expert $i$ has made $m_i^t$ mistakes over $t$ days:

w_i^t = 1 \cdot (1 - \varepsilon)^{m_i^t} = (1 - \varepsilon)^{m_i^t}

The number of mistakes is exactly the exponent because each mistake contributes exactly one factor of $(1 - \varepsilon)$ .

Why Does the Potential Function Start at $N$ ?

In the MWU potential function proof, $\Phi^t = \sum_{i=1}^{N} w_i^t$ denotes the total weight on day $t$ . Why is $\Phi^1 = N$ ?

Answer

On day 1, every expert has weight $1$ (no mistakes have been made yet, so no weights have been penalized). There are $N$ experts, so:

\Phi^1 = \sum_{i=1}^{N} w_i^1 = N \cdot 1 = N

How Does $\Phi^t$ Change Over Time?

As days pass, what happens to the potential function $\Phi^t$ ?

Answer

$\Phi^t$ either decreases or stays the same — it never increases. This is because expert weights are only ever penalized (multiplied by $1 - \varepsilon < 1$ ) or left unchanged; they are never increased. Since the total weight is the sum of individual weights that can only go down, the potential can only go down.

More precisely: on any day the algorithm makes a mistake, at least half the total weight was behind the wrong decision, and those experts’ weights get multiplied by $(1 - \varepsilon)$ . This forces:

\Phi^{t+1} \leq \Phi^t \cdot \left(1 - \frac{\varepsilon}{2}\right)

so each mistake shrinks the potential by a factor of at least $(1 - \varepsilon/2)$ .

LRU vs. FIFO: What Is the Difference?

In the paging problem, both LRU (least recently used) and FIFO (first in, first out) are cache eviction policies. How are they different?

Answer

LRU: when a cache miss forces an eviction, evict the item in cache that was requested least recently — i.e., the item that has gone the longest without being asked for.
FIFO: evict the item that was brought into cache the earliest — the one that has been sitting in cache the longest, regardless of whether it was requested recently.

They differ when an old item is still actively used. For example: items $1, 2, 3$ are cached (item $1$ was brought in first). Item $1$ is then requested 50 more times. When a cache miss on item $4$ requires an eviction:

FIFO evicts item $1$ (it was brought in first).
LRU evicts item $2$ (item $1$ was requested most recently).

Optimal Paging with Full Future Knowledge

If you could see the entire future request sequence, what is the optimal algorithm for deciding which cached item to evict on a cache miss?

Answer

Farthest in Future (proved optimal by Bélády): when a cache miss occurs, look at every item currently in cache and find the one that will not be requested again until the farthest point in the future. Evict that item.

Intuitively, keeping items you will need soon and evicting items you won’t need for a long time minimizes future cache misses. Counting frequencies ignores the order of future requests and can perform worse — a very frequent item that is only needed at the very end is not worth keeping now.

Is LRU Being $K$ -Competitive a Good Result?

Both LRU and FIFO are proven to be $K$ -competitive, where $K$ is the cache size. Is this a good guarantee?

Answer

No — $K$ can be very large (e.g., thousands for a megabyte-sized cache), making the bound practically useless. Furthermore, it is also proved that no online algorithm can achieve a competitive ratio better than $K$ , so $K$ -competitiveness is the best possible in the standard online setting. This makes the paging problem essentially hopeless without additional assumptions.

Resource Augmentation: LRU with a Larger Cache

In the resource augmentation model, the online algorithm has a cache of size $K$ while the optimal offline algorithm is restricted to a smaller cache of size $K'$ . If $K = 2K'$ , what is LRU’s competitive ratio under this comparison?

Answer

With $K = 2K'$ , the competitive ratio is:

\frac{K}{K - K' + 1} \approx \frac{K}{K - K'} = \frac{K}{K/2} = 2

So LRU is 2-competitive when given a cache twice the size of OPT’s. The intuition: being unable to see the future is a significant handicap, but doubling the cache size roughly compensates for it, bringing the competitive ratio down to a small constant.

Final Prep: Basic Understanding Questions | CSCI 328

Lecture 12 - Streaming & Uniform Sampling

Why Does the Bloom Filter Have No False Negatives?

Bloom Filter Hash Function Count for 2% FPR

Bloom Filter Bits Per Key for 2% FPR

Finding a Missing Number in O(log n) Space

Uniform Sampling Probability: What Do s and t Represent?

How Is Streaming Sampling Different from Static Sampling?

What Happens If You Always Include the New Key?

What Must Be Stored Beyond the Sample Itself?

Inductive Proof Base Case: Why t=s+1t = s + 1t=s+1 and Not t=1t = 1t=1?

Zero Term in the Inductive Step

Lecture 13 - Morris Counter & Variance Reduction

How Many Bits to Exactly Count TTT Items?

Midterm Review: Cuckoo Hashing vs. Hashing with Chaining

Midterm Review: Hashing with Chaining vs. FKS

What Does the Morris Counter Reduce to with Increment Probability 1?

Simulating a Biased Coin with Only a Fair Coin

First Thing to Verify About a New Randomized Algorithm

Variance of an Average of Independent Estimates

How to Reduce Variance by Repetition

Lecture 14 - Practice Problems (TA-led)

Expectation of the Number of Bad Weak Estimates (Exercise 4.9)

Lecture 15 - Approximate Median & Morris+/++

Lecture 17 - Heavy Hitters & Count Min Sketch

How to Find the Approximate Median of a Stream

Using Count Min Sketch for Top-K: The Role of a Min-Heap

CMS Query Output

Why Return the Minimum?

Can the Count Min Sketch Underestimate?

How Is the CMS Guarantee Different from the Standard Relative-Error Guarantee?

Applying Markov to a Single CMS Row

Why Must All Rows Fail for the CMS to Give a Bad Answer?

Lecture 18 - Frequency Moments & AMS Sampling

Where Does E[X2]\mathbb{E}[X^2]E[X2] Show Up?

The 0th Frequency Moment

The 1st Frequency Moment

AMS Sampling: Output When Sampling a Rare Item

AMS Sampling: Output When Sampling the First Occurrence of a Frequent Item

Boosting a Single Unbiased Estimate: General Strategy

Applying Chernoff to AMS: What Are We Solving For?

Lecture 19 - AMS Sampling (cont.) & Counting Distinct Elements

Why Does Averaging T Copies Preserve the Expected Value?

Does Running More Copies Make the Error Probability Better or Worse?

What Is the Problem with the Exact Expression for TTT?

What Streams Maximize and Minimize F=∑f(i)2F = \sum f(i)^2F=∑f(i)2?

Counting Distinct Elements Without Space Constraints

Expected Position of the Minimum of Two Random Darts

Lecture 20 - Flajolet-Martin Ideal Algorithm

How Many Hash Values Does the FM Ideal Algorithm Compute?

Why Does the FM Algorithm Subtract 1?

Exam Question: Tracking the Maximum Hash Instead of the Minimum

Why Is the Continuous-Hash Algorithm Called “Ideal”?

HyperLogLog: Why Output 2p+12^{p+1}2p+1?

Lecture 21/22 - HyperLogLog Analysis & Online Algorithms

Bernoulli Variance vs. Expectation

Ski Rental: What Is the Competitive Ratio?

Pizza Finding: Why Does “Walk to One End First” Have a Bad Competitive Ratio?

Pizza Finding: Why Does the Zigzag Algorithm Still Fail?

Lecture 23 - List Update Problem & Multiplicative Weight Updates

What Is a Bad Sequence for the “Do Nothing” List Algorithm?

Move-to-Front: Cost After the First Access

What Is a Bad Sequence for the Ordering-by-Frequency Algorithm?

What Exactly Is the Competitive Ratio?

Can Experts Gain Weight in the Multiplicative Weight Updates Algorithm?

Lecture 24 - Experts Theorem Proof & Paging

Simplified Experts Setting: One Expert Is Always Correct

Why Is wit=(1−ε)mitw_i^t = (1 - \varepsilon)^{m_i^t}wit​=(1−ε)mit​?

Why Does the Potential Function Start at NNN?

How Does Φt\Phi^tΦt Change Over Time?

LRU vs. FIFO: What Is the Difference?

Optimal Paging with Full Future Knowledge

Is LRU Being KKK-Competitive a Good Result?

Resource Augmentation: LRU with a Larger Cache

Inductive Proof Base Case: Why $t = s + 1$ and Not $t = 1$ ?

How Many Bits to Exactly Count $T$ Items?

Where Does $\mathbb{E}[X^2]$ Show Up?

What Is the Problem with the Exact Expression for $T$ ?

What Streams Maximize and Minimize $F = \sum f(i)^2$ ?

HyperLogLog: Why Output $2^{p+1}$ ?

Why Is $w_i^t = (1 - \varepsilon)^{m_i^t}$ ?

Why Does the Potential Function Start at $N$ ?

How Does $\Phi^t$ Change Over Time?

Is LRU Being $K$ -Competitive a Good Result?