Lecture 14 on 03/16/2026 - Textbook Exercises: Mean Estimation; Median of Weak Estimates; Fraction Estimation

Consider that the variance of a random variable is denoted by $\sigma^2$ .

A fundamental principle throughout these exercises: when you average $t$ independent samples from a random variable with variance $\sigma^2$ , the variance of the average decreases by a factor of $t$ :

\text{Var}\left(\frac{1}{t}\sum_{i=1}^t X_i\right) = \frac{\sigma^2}{t}

This is why larger samples give more reliable estimates. As $t$ grows, the average becomes increasingly concentrated around the true mean.

Proof

Given independent random variables $X_1, \ldots, X_t$ with $\text{Var}(X_i) = \sigma^2$ for all $1 \le i \le t$ , consider their average:

Y = \frac{X_1 + X_2 + \cdots + X_t}{t}

To find $\text{Var}(Y)$ , we first note that for the sum $S = X_1 + X_2 + \cdots + X_t$ , variance is additive for independent random variables:

\text{Var}(S) = \text{Var}(X_1) + \text{Var}(X_2) + \cdots + \text{Var}(X_t) = t\sigma^2

Now, applying the scaling rule $\text{Var}(aX) = a^2 \text{Var}(X)$ :

\text{Var}(Y) = \text{Var}\left(\frac{S}{t}\right) = \left(\frac{1}{t}\right)^2 \text{Var}(S) = \frac{1}{t^2} \cdot t\sigma^2 = \frac{\sigma^2}{t}

Therefore, averaging $t$ independent copies reduces the variance by a factor of $t$ .

Exercise 4.9

Suppose that we can obtain independent samples $X_1, X_2, \ldots$ of a random variable $X$ and that we want to use these samples to estimate $E[X]$ . Using $t$ samples, we use $\left(\sum_{i=1}^t X_i\right)/t$ for our estimate of $E[X]$ . We want the estimate to be within $\varepsilon E[X]$ from the true value of $E[X]$ with probability at least $1 - \delta$ . We may not be able to use Chernoff’s bound directly to bound how good our estimate is if $X$ is not a 0-1 random variable, and we do not know its moment generating function. We develop an alternative approach that requires only having a bound on the variance of $X$ . Let $r = \sqrt{\text{Var}[X]}\ /\ E[X]$ .

Show using Chebyshev’s inequality that $O(r^2/\varepsilon^2\delta)$ samples are sufficient to solve the problem.

Solution

Goal: To estimate $E(X)$ by $\left(\sum_{i=1}^t X_i\right) / t$ , given $\delta, \varepsilon \in (0, 1)$ .

Let $\hat{X}$ be the estimate mean of $t$ samples of $X$ :
$\hat{X} = \left(\sum_{i=1}^t X_i\right) / t$
By our typical epsilon-delta guarantee, we want the probability that $\hat{X}$ is within $\varepsilon E(X)$ of the true mean to be at least $1 - \delta$ :
$\Pr\!\left(|\hat{X} - E(X)| \le \varepsilon E(X)\right) \ge 1 - \delta$
Likewise, complementing this probability, the prob. that $\hat{X}$ is outside $\varepsilon E(X)$ of the true mean is at most $\delta$ :
$\Pr\!\left(|\hat{X} - E(X)| \ge \varepsilon E(X)\right) \le \delta$
We notice that this is in a similar form to Chebyshev’s inequality. However, we can’t directly apply Chebyshev’s here yet — Chebyshev’s requires the form $\Pr(|Y - E(Y)| \ge k)$ , where the r.v. inside the absolute value and the one whose expectation is taken are the same. Our inequality has $|\hat{X} - E(X)|$ , where $\hat{X}$ and $X$ are different r.v.s. To remedy this, we first note that the expectation of our sample mean equals the true mean:
$E(\hat{X}) = E(X)$
So we can now say that by Chebyshev’s inequality:
$\Pr\!\left(|\hat{X} - E(\hat{X})| \ge \varepsilon E(\hat{X})\right) \le \frac{\text{Var}(\hat{X})}{\left[\varepsilon E(\hat{X})\right]^2}$
Now we have to express $\text{Var}(\hat{X})$ in terms of $\text{Var}(X)$ .
$\text{Var}(\hat{X}) = \text{Var}\!\left(\frac{\sum X_i}{t}\right)$
Applying the scaling rule and the variance additivity properties:
$= \frac{1}{t^2} \cdot \text{Var}\!\left(\sum_{i=1}^{t} X_i\right) = \frac{1}{t^2} \cdot \sum_{i=1}^{t} \text{Var}(X_i)$
Since each $X_i$ is an i.i.d. copy of $X$ , we can say $\text{Var}(X_i) = \text{Var}(X)$ , so:
$= \frac{1}{t^2} \cdot t \cdot \text{Var}(X) = \frac{\text{Var}(X)}{t}$
Substituting $\text{Var}(\hat{X}) = \frac{\text{Var}(X)}{t}$ and $E(\hat{X}) = E(X)$ :
$\Pr\!\left(|\hat{X} - E(\hat{X})| \ge \varepsilon E(\hat{X})\right) \le \frac{\text{Var}(\hat{X})}{\left[\varepsilon E(\hat{X})\right]^2} = \frac{\text{Var}(X)}{t \cdot \left[\varepsilon E(X)\right]^2}$
For this to be $\le \delta$ , we solve for $t$ :
$t \ge \frac{\text{Var}(X)}{\delta \varepsilon^2 E(X)^2}$
Substituting $r$ (specifically $r^2 = \text{Var}(X) / E(X)^2$ ):
$t \ge \frac{r^2}{\delta \varepsilon^2}$
Suppose that we need only a weak estimate that is within $\varepsilon E[X]$ of $E[X]$ with probability at least $3/4$ . Argue that $O(r^2/\varepsilon^2)$ samples are enough for this weak estimate.

Solution

This follows directly from part (a). A “weak estimate” with probability at least $3/4$ means the failure probability is at most $1/4$ , i.e. $\delta = 1/4$ . Substituting into the bound from (a):
$t \ge \frac{r^2}{\delta \varepsilon^2} = \frac{r^2}{(1/4)\varepsilon^2} = \frac{4r^2}{\varepsilon^2} = O\!\left(\frac{r^2}{\varepsilon^2}\right)$
Show that, by taking the median of $O(\log(1/\delta))$ weak estimates, we can obtain an estimate within $\varepsilon E[X]$ of $E[X]$ with probability at least $1 - \delta$ . Conclude that we need only $O((r^2 \log(1/\delta))/\varepsilon^2)$ samples.

Solution

The Boosting Strategy: Taking the Median of Weak Estimates

The claim is that taking the median of $O(\log(1/\delta))$ weak estimates is enough to satisfy the epsilon-delta guarantee. From part (b), each weak estimate costs $O(r^2/\varepsilon^2)$ samples, so $O(\log(1/\delta))$ of them would cost:
$O\!\left(\frac{r^2}{\varepsilon^2} \cdot \log(1/\delta)\right)$
This is a significant improvement over part (a), which required $O(r^2/\varepsilon^2\delta)$ samples for the same $1-\delta$ guarantee. Since $\log(1/\delta) \ll 1/\delta$ for small $\delta$ , the median trick achieves the same guarantee much more efficiently.

“Boosting” is a common technique to improve a weak estimate to a strong estimate by running the weak estimator multiple times independently and taking the median. Why does the median work? Because if most of our independent estimates are good, then the middle estimate will be good too.

Formalizing with indicator variables

Run $t' = O(\log(1/\delta))$ independent weak estimates $\hat{X}_1, \ldots, \hat{X}_{t'}$ , each using $O(r^2/\varepsilon^2)$ samples with failure probability $1/4$ . Define:
$Y_i = \begin{cases} 1 & \text{if the $i$-th weak estimate is bad} \quad & \text{(1/4)}, \\ 0 & \text{otherwise} \quad & \text{(3/4)}. \end{cases}$ $\implies Y_i = \begin{cases} 1 & \text{if } |\hat{X}_i - E[X]| > \varepsilon E[X] \quad & \text{(bad)} \\ 0 & \text{otherwise} \quad & \text{(good)} \end{cases}$
Since each weak estimate fails with probability at most $1/4$ , we have $Y_i \sim \text{Bernoulli}(1/4)$ . Since the $t'$ estimates are independent, the $Y_i$ are also mutually independent.

Let $\hat{m}$ be the median of the $t'$ weak estimates. The median is bad if it falls outside the acceptable range:
$\hat{m} > (1+\varepsilon)E[X] \quad \text{or} \quad \hat{m} < (1-\varepsilon)E[X]$
If $\hat{m}$ is bad, say $\hat{m} > (1+\varepsilon)E[X]$ , then by definition of the median, at least $t'/2$ of the estimates are also $> (1+\varepsilon)E[X]$ , meaning at least $t'/2$ estimates are bad. The same argument applies if $\hat{m} < (1-\varepsilon)E[X]$ . So:
$\hat{m} \text{ is bad} \implies \sum_{i=1}^{t'} Y_i \ge \frac{t'}{2}$
In both cases (upper and lower), the bad event for the median is a subset of $\{\sum_i Y_i \ge t'/2\}$ , so:
$\Pr(\hat{m} \text{ is bad}) \le \Pr\!\left(\sum_i Y_i \ge \frac{t'}{2}\right)$
Since each $Y_i$ is a binary/bernoulli random variable, we can apply Chernoff’s bound. To do so, we need to express the threshold $t'/2$ as some fraction of the expectation of $\sum_i Y_i$ . Since $Y_i \sim \text{Bernoulli}(1/4)$ , by linearity of expectation:
$E\!\left[\sum_i Y_i\right] = \sum_i E[Y_i] = \sum_{i=1}^{t'} \frac{1}{4} = \frac{t'}{4}$
Now that we have $E[\sum_i Y_i] = t'/4$ , we can express the threshold $t'/2$ as a multiple of this expectation: $t'/2 = 2 \cdot (t'/4)$

Note: $\delta$ is the Chernoff parameter controlling how far above the mean we bound - to make it distinct from the failure probability $\delta$ in our epsilon-delta guarantee, we’ll denote the Chernoff parameter as $\delta'$ here.
For $0 < \delta' \leq 1$ :
$\Pr(X \geq (1+\delta')\mu) \leq e^{-\mu\delta'^2/3}$

With $2 \cdot (t'/4)$ as our threshold, we have $\delta' = 1$ in the Chernoff bound:
$\begin{aligned} \Pr(\hat{m} \text{ is bad}) & \le \Pr\!\left(\sum_i Y_i \ge 2 \cdot \frac{t'}{4}\right) \\ & = \Pr\!\left(\sum_i Y_i \ge (1+1)\cdot E\!\left[\sum_i Y_i\right]\right) \le e^{-\frac{1}{3} \cdot 1^2 \cdot \frac{t'}{4}} < \delta \end{aligned}$
The last inequality $< \delta$ holds when $t' \ge O(\ln(1/\delta))$ , since:
$e^{-t'/12} < \delta \iff t' > 12\ln(1/\delta) = O(\ln(1/\delta))$
Therefore, $t' = O(\log(1/\delta))$ weak estimates suffice. Since each weak estimate requires $O(r^2/\varepsilon^2)$ samples, the total number of samples needed is:
$O\!\left(\frac{r^2}{\varepsilon^2} \cdot \log\frac{1}{\delta}\right)$
This is a significant improvement over part (a), which used a naive approach and required $O(r^2/\varepsilon^2\delta)$ samples for the same guarantee. To see just how much better $\log(1/\delta)$ is than $1/\delta$ :

$\delta$ $1/\delta$ $\log(1/\delta)$
$0.01$ $100$ $\approx 7$
$0.001$ $1000$ $\approx 10$

$\delta$	$1/\delta$	$\log(1/\delta)$
$0.01$	$100$	$\approx 7$
$0.001$	$1000$	$\approx 10$

Exercise 4.5

We plan to conduct an opinion poll to find out the percentage of people in a community who want its president impeached. Assume that every person answers either yes or no. If the actual fraction of people who want the president impeached is $p$ , we want to find an estimate $X$ of $p$ such that

\Pr(|X - p| \le \varepsilon p) > 1 - \delta

for a given $\varepsilon$ and $\delta$ , with $0 < \varepsilon, \delta < 1$ .

We query $N$ people chosen independently and uniformly at random from the community and output the fraction of them who want the president impeached. How large should $N$ be for our result to be a suitable estimator of $p$ ? Use Chernoff bounds, and express $N$ in terms of $p$ , $\varepsilon$ , and $\delta$ . Calculate the value of $N$ from your bound if $\varepsilon = 0.1$ and $\delta = 0.05$ and if you know that $p$ is between 0.2 and 0.8.

Solution

Step 1: Express the estimate in terms of a count.

Let $S$ be the number of red balls in our sample of $N$ balls. Our estimate is $X = S/N$ . So the desired guarantee becomes:

\Pr\!\left(\left|\frac{S}{N} - p\right| \le \varepsilon p\right) > 1 - \delta

Step 2: Multiply through by $N$ to get a count-based condition.

Multiplying both sides of the inner inequality by $N$ :

\Pr\!\left(|S - Np| \le \varepsilon \cdot Np\right) > 1 - \delta

Step 3: Recognize $Np$ as $E[S]$ .

Since each of the $N$ sampled balls is independently red with probability $p$ , we have $S \sim \text{Binomial}(N, p)$ , and so $E[S] = Np$ . The condition becomes:

\Pr\!\left(|S - E[S]| \le \varepsilon \cdot E[S]\right) > 1 - \delta

This is exactly the Chernoff bound form. We now bound the complementary “bad” probability.

Step 4: Apply Chernoff bounds to the bad probability.

Each ball is an independent Bernoulli trial (red with probability $p$ , blue otherwise), so we can apply a two-sided Chernoff bound (combining both tails with the factor of 2).

Setting $X = S$ , $E[X] = E[S] = Np$ , and $\delta' = \varepsilon$ :

\Pr\!\left(|S - E[S]| > \varepsilon \cdot E[S]\right) \le 2e^{-\frac{1}{3}\varepsilon^2 Np}

Step 5: Solve for $N$ .

We want this bad probability to be less than $\delta$ , and solve for $N$ :

\begin{aligned} & 2e^{-\frac{1}{3}\varepsilon^2 Np} < \delta \\ \implies & e^{-\frac{1}{3}\varepsilon^2 Np} < \frac{\delta}{2} \\ \implies & -\frac{1}{3}\varepsilon^2 Np < \ln\frac{\delta}{2} \\ \implies & N > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta} \end{aligned}

So if $N > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta}$ , we can obtain a good estimate of the true fraction of red balls.

Numerical calculation for $\varepsilon = 0.1$ , $\delta = 0.05$ , $p \in [0.2, 0.8]$ .

Since $N > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta}$ decreases as $p$ increases, the worst case (largest required $N$ ) is at $p = 0.2$ :

N > \frac{3}{0.2 \cdot (0.1)^2} \cdot \ln\frac{2}{0.05} = \frac{3}{0.002} \cdot \ln 40 \approx 1500 \times 3.689 \approx 5534