Suppose that we can obtain independent samples X1,X2,… of a random variable X and that we want to use these samples to estimate E[X]. Using t samples, we use (∑i=1tXi)/t for our estimate of E[X]. We want the estimate to be within εE[X] from the true value of E[X] with probability at least 1−δ. We may not be able to use Chernoff’s bound directly to bound how good our estimate is if X is not a 0-1 random variable, and we do not know its moment generating function. We develop an alternative approach that requires only having a bound on the variance of X. Let r=Var[X]/E[X].
Show using Chebyshev’s inequality that O(r2/ε2δ) samples are sufficient to solve the problem.
Solution
Goal: To estimate E(X) by (∑i=1tXi)/t, given δ,ε∈(0,1).
Let X^ be the estimate mean of t samples of X:
X^=(i=1∑tXi)/t
By our typical epsilon-delta guarantee, we want the probability that X^ is within εE(X) of the true mean to be at least 1−δ:
Pr(∣X^−E(X)∣≤εE(X))≥1−δ
Likewise, complementing this probability, the prob. that X^ is outside εE(X) of the true mean is at most δ:
Pr(∣X^−E(X)∣≥εE(X))≤δ
We notice that this is in a similar form to Chebyshev’s inequality. However, we can’t directly apply Chebyshev’s here yet — Chebyshev’s requires the form Pr(∣Y−E(Y)∣≥k), where the r.v. inside the absolute value and the one whose expectation is taken are the same. Our inequality has ∣X^−E(X)∣, where X^ and X are different r.v.s. To remedy this, we first note that the expectation of our sample mean equals the true mean:
E(X^)=E(X)
So we can now say that by Chebyshev’s inequality:
Pr(∣X^−E(X^)∣≥εE(X^))≤[εE(X^)]2Var(X^)
Now we have to express Var(X^) in terms of Var(X).
Var(X^)=Var(t∑Xi)
Applying the scaling rule and the variance additivity properties:
=t21⋅Var(i=1∑tXi)=t21⋅i=1∑tVar(Xi)
Since each Xi is an i.i.d. copy of X, we can say Var(Xi)=Var(X), so:
Suppose that we need only a weak estimate that is within εE[X] of E[X] with probability at least 3/4. Argue that O(r2/ε2) samples are enough for this weak estimate.
Solution
This follows directly from part (a). A “weak estimate” with probability at least 3/4 means the failure probability is at most 1/4, i.e. δ=1/4. Substituting into the bound from (a):
t≥δε2r2=(1/4)ε2r2=ε24r2=O(ε2r2)
Show that, by taking the median of O(log(1/δ)) weak estimates, we can obtain an estimate within εE[X] of E[X] with probability at least 1−δ. Conclude that we need only O((r2log(1/δ))/ε2) samples.
Solution
The Boosting Strategy: Taking the Median of Weak Estimates
The claim is that taking the median of O(log(1/δ)) weak estimates is enough to satisfy the epsilon-delta guarantee. From part (b), each weak estimate costs O(r2/ε2) samples, so O(log(1/δ)) of them would cost:
O(ε2r2⋅log(1/δ))
This is a significant improvement over part (a), which required O(r2/ε2δ) samples for the same 1−δ guarantee. Since log(1/δ)≪1/δ for small δ, the median trick achieves the same guarantee much more efficiently.
“Boosting” is a common technique to improve a weak estimate to a strong estimate by running the weak estimator multiple times independently and taking the median. Why does the median work? Because if most of our independent estimates are good, then the middle estimate will be good too.
Formalizing with indicator variables
Run t′=O(log(1/δ)) independent weak estimates X^1,…,X^t′, each using O(r2/ε2) samples with failure probability 1/4. Define:
Yi={10if the i-th weak estimate is badotherwise(1/4),(3/4).⟹Yi={10if ∣X^i−E[X]∣>εE[X]otherwise(bad)(good)
Since each weak estimate fails with probability at most 1/4, we have Yi∼Bernoulli(1/4). Since the t′ estimates are independent, the Yi are also mutually independent.
Let m^ be the median of the t′ weak estimates. The median is bad if it falls outside the acceptable range:
m^>(1+ε)E[X]orm^<(1−ε)E[X]
If m^ is bad, say m^>(1+ε)E[X], then by definition of the median, at least t′/2 of the estimates are also >(1+ε)E[X], meaning at least t′/2 estimates are bad. The same argument applies if m^<(1−ε)E[X]. So:
m^ is bad⟹i=1∑t′Yi≥2t′
In both cases (upper and lower), the bad event for the median is a subset of {∑iYi≥t′/2}, so:
Pr(m^ is bad)≤Pr(i∑Yi≥2t′)
Since each Yi is a binary/bernoulli random variable, we can apply Chernoff’s bound. To do so, we need to express the threshold t′/2 as some fraction of the expectation of ∑iYi. Since Yi∼Bernoulli(1/4), by linearity of expectation:
E[i∑Yi]=i∑E[Yi]=i=1∑t′41=4t′
Now that we have E[∑iYi]=t′/4, we can express the threshold t′/2 as a multiple of this expectation: t′/2=2⋅(t′/4)
With 2⋅(t′/4) as our threshold, we have δ′=1 in the Chernoff bound:
Pr(m^ is bad)≤Pr(i∑Yi≥2⋅4t′)=Pr(i∑Yi≥(1+1)⋅E[i∑Yi])≤e−31⋅12⋅4t′<δ
The last inequality <δ holds when t′≥O(ln(1/δ)), since:
e−t′/12<δ⟺t′>12ln(1/δ)=O(ln(1/δ))
Therefore, t′=O(log(1/δ)) weak estimates suffice. Since each weak estimate requires O(r2/ε2) samples, the total number of samples needed is:
O(ε2r2⋅logδ1)
This is a significant improvement over part (a), which used a naive approach and required O(r2/ε2δ) samples for the same guarantee. To see just how much better log(1/δ) is than 1/δ:
We plan to conduct an opinion poll to find out the percentage of people in a community who want its president impeached. Assume that every person answers either yes or no. If the actual fraction of people who want the president impeached is p, we want to find an estimate X of p such that
Pr(∣X−p∣≤εp)>1−δ
for a given ε and δ, with 0<ε,δ<1.
We query N people chosen independently and uniformly at random from the community and output the fraction of them who want the president impeached. How large should N be for our result to be a suitable estimator of p? Use Chernoff bounds, and express N in terms of p, ε, and δ. Calculate the value of N from your bound if ε=0.1 and δ=0.05 and if you know that p is between 0.2 and 0.8.
Solution
Step 1: Express the estimate in terms of a count.
Let S be the number of red balls in our sample of N balls. Our estimate is X=S/N. So the desired guarantee becomes:
Pr(NS−p≤εp)>1−δ
Step 2: Multiply through by N to get a count-based condition.
Multiplying both sides of the inner inequality by N:
Pr(∣S−Np∣≤ε⋅Np)>1−δ
Step 3: Recognize Np as E[S].
Since each of the N sampled balls is independently red with probability p, we have S∼Binomial(N,p), and so E[S]=Np. The condition becomes:
Pr(∣S−E[S]∣≤ε⋅E[S])>1−δ
This is exactly the Chernoff bound form. We now bound the complementary “bad” probability.
Step 4: Apply Chernoff bounds to the bad probability.
Each ball is an independent Bernoulli trial (red with probability p, blue otherwise), so we can apply a two-sided Chernoff bound (combining both tails with the factor of 2).
Setting X=S, E[X]=E[S]=Np, and δ′=ε:
Pr(∣S−E[S]∣>ε⋅E[S])≤2e−31ε2Np
Step 5: Solve for N.
We want this bad probability to be less than δ, and solve for N: