Skip to content

Lecture 14 on 03/16/2026 - Textbook Exercises: Mean Estimation; Median of Weak Estimates; Fraction Estimation

Suppose that we can obtain independent samples X1,X2,X_1, X_2, \ldots of a random variable XX and that we want to use these samples to estimate E[X]E[X]. Using tt samples, we use (i=1tXi)/t\left(\sum_{i=1}^t X_i\right)/t for our estimate of E[X]E[X]. We want the estimate to be within εE[X]\varepsilon E[X] from the true value of E[X]E[X] with probability at least 1δ1 - \delta. We may not be able to use Chernoff’s bound directly to bound how good our estimate is if XX is not a 0-1 random variable, and we do not know its moment generating function. We develop an alternative approach that requires only having a bound on the variance of XX. Let r=Var[X] / E[X]r = \sqrt{\text{Var}[X]}\ /\ E[X].

  • Show using Chebyshev’s inequality that O(r2/ε2δ)O(r^2/\varepsilon^2\delta) samples are sufficient to solve the problem.

    Solution

    Goal: To estimate E(X)E(X) by (i=1tXi)/t\left(\sum_{i=1}^t X_i\right) / t, given δ,ε(0,1)\delta, \varepsilon \in (0, 1).

    Let X^\hat{X} be the estimate mean of tt samples of XX:

    X^=(i=1tXi)/t\hat{X} = \left(\sum_{i=1}^t X_i\right) / t

    By our typical epsilon-delta guarantee, we want the probability that X^\hat{X} is within εE(X)\varepsilon E(X) of the true mean to be at least 1δ1 - \delta:

    Pr ⁣(X^E(X)εE(X))1δ\Pr\!\left(|\hat{X} - E(X)| \le \varepsilon E(X)\right) \ge 1 - \delta

    Likewise, complementing this probability, the prob. that X^\hat{X} is outside εE(X)\varepsilon E(X) of the true mean is at most δ\delta:

    Pr ⁣(X^E(X)εE(X))δ\Pr\!\left(|\hat{X} - E(X)| \ge \varepsilon E(X)\right) \le \delta

    We notice that this is in a similar form to Chebyshev’s inequality. However, we can’t directly apply Chebyshev’s here yet — Chebyshev’s requires the form Pr(YE(Y)k)\Pr(|Y - E(Y)| \ge k), where the r.v. inside the absolute value and the one whose expectation is taken are the same. Our inequality has X^E(X)|\hat{X} - E(X)|, where X^\hat{X} and XX are different r.v.s. To remedy this, we first note that the expectation of our sample mean equals the true mean:

    E(X^)=E(X)E(\hat{X}) = E(X)

    So we can now say that by Chebyshev’s inequality:

    Pr ⁣(X^E(X^)εE(X^))Var(X^)[εE(X^)]2\Pr\!\left(|\hat{X} - E(\hat{X})| \ge \varepsilon E(\hat{X})\right) \le \frac{\text{Var}(\hat{X})}{\left[\varepsilon E(\hat{X})\right]^2}

    Now we have to express Var(X^)\text{Var}(\hat{X}) in terms of Var(X)\text{Var}(X).

    Var(X^)=Var ⁣(Xit)\text{Var}(\hat{X}) = \text{Var}\!\left(\frac{\sum X_i}{t}\right)

    Applying the scaling rule and the variance additivity properties:

    =1t2Var ⁣(i=1tXi)=1t2i=1tVar(Xi)= \frac{1}{t^2} \cdot \text{Var}\!\left(\sum_{i=1}^{t} X_i\right) = \frac{1}{t^2} \cdot \sum_{i=1}^{t} \text{Var}(X_i)

    Since each XiX_i is an i.i.d. copy of XX, we can say Var(Xi)=Var(X)\text{Var}(X_i) = \text{Var}(X), so:

    =1t2tVar(X)=Var(X)t= \frac{1}{t^2} \cdot t \cdot \text{Var}(X) = \frac{\text{Var}(X)}{t}

    Substituting Var(X^)=Var(X)t\text{Var}(\hat{X}) = \frac{\text{Var}(X)}{t} and E(X^)=E(X)E(\hat{X}) = E(X):

    Pr ⁣(X^E(X^)εE(X^))Var(X^)[εE(X^)]2=Var(X)t[εE(X)]2\Pr\!\left(|\hat{X} - E(\hat{X})| \ge \varepsilon E(\hat{X})\right) \le \frac{\text{Var}(\hat{X})}{\left[\varepsilon E(\hat{X})\right]^2} = \frac{\text{Var}(X)}{t \cdot \left[\varepsilon E(X)\right]^2}

    For this to be δ\le \delta, we solve for tt:

    tVar(X)δε2E(X)2t \ge \frac{\text{Var}(X)}{\delta \varepsilon^2 E(X)^2}

    Substituting rr (specifically r2=Var(X)/E(X)2r^2 = \text{Var}(X) / E(X)^2):

    tr2δε2t \ge \frac{r^2}{\delta \varepsilon^2}
  • Suppose that we need only a weak estimate that is within εE[X]\varepsilon E[X] of E[X]E[X] with probability at least 3/43/4. Argue that O(r2/ε2)O(r^2/\varepsilon^2) samples are enough for this weak estimate.

    Solution

    This follows directly from part (a). A “weak estimate” with probability at least 3/43/4 means the failure probability is at most 1/41/4, i.e. δ=1/4\delta = 1/4. Substituting into the bound from (a):

    tr2δε2=r2(1/4)ε2=4r2ε2=O ⁣(r2ε2)t \ge \frac{r^2}{\delta \varepsilon^2} = \frac{r^2}{(1/4)\varepsilon^2} = \frac{4r^2}{\varepsilon^2} = O\!\left(\frac{r^2}{\varepsilon^2}\right)
  • Show that, by taking the median of O(log(1/δ))O(\log(1/\delta)) weak estimates, we can obtain an estimate within εE[X]\varepsilon E[X] of E[X]E[X] with probability at least 1δ1 - \delta. Conclude that we need only O((r2log(1/δ))/ε2)O((r^2 \log(1/\delta))/\varepsilon^2) samples.

    Solution

    The Boosting Strategy: Taking the Median of Weak Estimates

    The claim is that taking the median of O(log(1/δ))O(\log(1/\delta)) weak estimates is enough to satisfy the epsilon-delta guarantee. From part (b), each weak estimate costs O(r2/ε2)O(r^2/\varepsilon^2) samples, so O(log(1/δ))O(\log(1/\delta)) of them would cost:

    O ⁣(r2ε2log(1/δ))O\!\left(\frac{r^2}{\varepsilon^2} \cdot \log(1/\delta)\right)

    This is a significant improvement over part (a), which required O(r2/ε2δ)O(r^2/\varepsilon^2\delta) samples for the same 1δ1-\delta guarantee. Since log(1/δ)1/δ\log(1/\delta) \ll 1/\delta for small δ\delta, the median trick achieves the same guarantee much more efficiently.

    “Boosting” is a common technique to improve a weak estimate to a strong estimate by running the weak estimator multiple times independently and taking the median. Why does the median work? Because if most of our independent estimates are good, then the middle estimate will be good too.

    Formalizing with indicator variables

    Run t=O(log(1/δ))t' = O(\log(1/\delta)) independent weak estimates X^1,,X^t\hat{X}_1, \ldots, \hat{X}_{t'}, each using O(r2/ε2)O(r^2/\varepsilon^2) samples with failure probability 1/41/4. Define:

    Yi={1if the i-th weak estimate is bad(1/4),0otherwise(3/4).Y_i = \begin{cases} 1 & \text{if the $i$-th weak estimate is bad} \quad & \text{(1/4)}, \\ 0 & \text{otherwise} \quad & \text{(3/4)}. \end{cases}     Yi={1if X^iE[X]>εE[X](bad)0otherwise(good)\implies Y_i = \begin{cases} 1 & \text{if } |\hat{X}_i - E[X]| > \varepsilon E[X] \quad & \text{(bad)} \\ 0 & \text{otherwise} \quad & \text{(good)} \end{cases}

    Since each weak estimate fails with probability at most 1/41/4, we have YiBernoulli(1/4)Y_i \sim \text{Bernoulli}(1/4). Since the tt' estimates are independent, the YiY_i are also mutually independent.

    Let m^\hat{m} be the median of the tt' weak estimates. The median is bad if it falls outside the acceptable range:

    m^>(1+ε)E[X]orm^<(1ε)E[X]\hat{m} > (1+\varepsilon)E[X] \quad \text{or} \quad \hat{m} < (1-\varepsilon)E[X]

    If m^\hat{m} is bad, say m^>(1+ε)E[X]\hat{m} > (1+\varepsilon)E[X], then by definition of the median, at least t/2t'/2 of the estimates are also >(1+ε)E[X]> (1+\varepsilon)E[X], meaning at least t/2t'/2 estimates are bad. The same argument applies if m^<(1ε)E[X]\hat{m} < (1-\varepsilon)E[X]. So:

    m^ is bad    i=1tYit2\hat{m} \text{ is bad} \implies \sum_{i=1}^{t'} Y_i \ge \frac{t'}{2}

    In both cases (upper and lower), the bad event for the median is a subset of {iYit/2}\{\sum_i Y_i \ge t'/2\}, so:

    Pr(m^ is bad)Pr ⁣(iYit2)\Pr(\hat{m} \text{ is bad}) \le \Pr\!\left(\sum_i Y_i \ge \frac{t'}{2}\right)

    Since each YiY_i is a binary/bernoulli random variable, we can apply Chernoff’s bound. To do so, we need to express the threshold t/2t'/2 as some fraction of the expectation of iYi\sum_i Y_i. Since YiBernoulli(1/4)Y_i \sim \text{Bernoulli}(1/4), by linearity of expectation:

    E ⁣[iYi]=iE[Yi]=i=1t14=t4E\!\left[\sum_i Y_i\right] = \sum_i E[Y_i] = \sum_{i=1}^{t'} \frac{1}{4} = \frac{t'}{4}

    Now that we have E[iYi]=t/4E[\sum_i Y_i] = t'/4, we can express the threshold t/2t'/2 as a multiple of this expectation: t/2=2(t/4)t'/2 = 2 \cdot (t'/4)

    With 2(t/4)2 \cdot (t'/4) as our threshold, we have δ=1\delta' = 1 in the Chernoff bound:

    Pr(m^ is bad)Pr ⁣(iYi2t4)=Pr ⁣(iYi(1+1)E ⁣[iYi])e1312t4<δ\begin{aligned} \Pr(\hat{m} \text{ is bad}) & \le \Pr\!\left(\sum_i Y_i \ge 2 \cdot \frac{t'}{4}\right) \\ & = \Pr\!\left(\sum_i Y_i \ge (1+1)\cdot E\!\left[\sum_i Y_i\right]\right) \le e^{-\frac{1}{3} \cdot 1^2 \cdot \frac{t'}{4}} < \delta \end{aligned}

    The last inequality <δ< \delta holds when tO(ln(1/δ))t' \ge O(\ln(1/\delta)), since:

    et/12<δ    t>12ln(1/δ)=O(ln(1/δ))e^{-t'/12} < \delta \iff t' > 12\ln(1/\delta) = O(\ln(1/\delta))

    Therefore, t=O(log(1/δ))t' = O(\log(1/\delta)) weak estimates suffice. Since each weak estimate requires O(r2/ε2)O(r^2/\varepsilon^2) samples, the total number of samples needed is:

    O ⁣(r2ε2log1δ)O\!\left(\frac{r^2}{\varepsilon^2} \cdot \log\frac{1}{\delta}\right)

    This is a significant improvement over part (a), which used a naive approach and required O(r2/ε2δ)O(r^2/\varepsilon^2\delta) samples for the same guarantee. To see just how much better log(1/δ)\log(1/\delta) is than 1/δ1/\delta:

    δ\delta1/δ1/\deltalog(1/δ)\log(1/\delta)
    0.010.011001007\approx 7
    0.0010.0011000100010\approx 10

We plan to conduct an opinion poll to find out the percentage of people in a community who want its president impeached. Assume that every person answers either yes or no. If the actual fraction of people who want the president impeached is pp, we want to find an estimate XX of pp such that

Pr(Xpεp)>1δ\Pr(|X - p| \le \varepsilon p) > 1 - \delta

for a given ε\varepsilon and δ\delta, with 0<ε,δ<10 < \varepsilon, \delta < 1.

We query NN people chosen independently and uniformly at random from the community and output the fraction of them who want the president impeached. How large should NN be for our result to be a suitable estimator of pp? Use Chernoff bounds, and express NN in terms of pp, ε\varepsilon, and δ\delta. Calculate the value of NN from your bound if ε=0.1\varepsilon = 0.1 and δ=0.05\delta = 0.05 and if you know that pp is between 0.2 and 0.8.

Solution

Step 1: Express the estimate in terms of a count.

Let SS be the number of red balls in our sample of NN balls. Our estimate is X=S/NX = S/N. So the desired guarantee becomes:

Pr ⁣(SNpεp)>1δ\Pr\!\left(\left|\frac{S}{N} - p\right| \le \varepsilon p\right) > 1 - \delta

Step 2: Multiply through by NN to get a count-based condition.

Multiplying both sides of the inner inequality by NN:

Pr ⁣(SNpεNp)>1δ\Pr\!\left(|S - Np| \le \varepsilon \cdot Np\right) > 1 - \delta

Step 3: Recognize NpNp as E[S]E[S].

Since each of the NN sampled balls is independently red with probability pp, we have SBinomial(N,p)S \sim \text{Binomial}(N, p), and so E[S]=NpE[S] = Np. The condition becomes:

Pr ⁣(SE[S]εE[S])>1δ\Pr\!\left(|S - E[S]| \le \varepsilon \cdot E[S]\right) > 1 - \delta

This is exactly the Chernoff bound form. We now bound the complementary “bad” probability.

Step 4: Apply Chernoff bounds to the bad probability.

Each ball is an independent Bernoulli trial (red with probability pp, blue otherwise), so we can apply a two-sided Chernoff bound (combining both tails with the factor of 2).

Setting X=SX = S, E[X]=E[S]=NpE[X] = E[S] = Np, and δ=ε\delta' = \varepsilon:

Pr ⁣(SE[S]>εE[S])2e13ε2Np\Pr\!\left(|S - E[S]| > \varepsilon \cdot E[S]\right) \le 2e^{-\frac{1}{3}\varepsilon^2 Np}

Step 5: Solve for NN.

We want this bad probability to be less than δ\delta, and solve for NN:

2e13ε2Np<δ    e13ε2Np<δ2    13ε2Np<lnδ2    N>3pε2ln2δ\begin{aligned} & 2e^{-\frac{1}{3}\varepsilon^2 Np} < \delta \\ \implies & e^{-\frac{1}{3}\varepsilon^2 Np} < \frac{\delta}{2} \\ \implies & -\frac{1}{3}\varepsilon^2 Np < \ln\frac{\delta}{2} \\ \implies & N > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta} \end{aligned}

So if N>3pε2ln2δN > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta}, we can obtain a good estimate of the true fraction of red balls.

Numerical calculation for ε=0.1\varepsilon = 0.1, δ=0.05\delta = 0.05, p[0.2,0.8]p \in [0.2, 0.8].

Since N>3pε2ln2δN > \frac{3}{p\varepsilon^2} \cdot \ln\frac{2}{\delta} decreases as pp increases, the worst case (largest required NN) is at p=0.2p = 0.2:

N>30.2(0.1)2ln20.05=30.002ln401500×3.6895534N > \frac{3}{0.2 \cdot (0.1)^2} \cdot \ln\frac{2}{0.05} = \frac{3}{0.002} \cdot \ln 40 \approx 1500 \times 3.689 \approx 5534