Skip to content

Lecture 4 on 02/04/2026 - Query Time Analysis and Tail Bounds

Scribes: Ayesha Jamal and Kyoshi Noda

  • Query time analysis (E[Q]E[Q])
  • Variance analysis (Var(Q)\mathrm{Var}(Q))
  • Tail Bounds (Concept)
  • Markov’s Inequality
  • Chebyshev’s Inequality

We analyze the query time QQ for a specific element qq. Let XiX_i be an indicator random variable for the ii-th element in the database (1in1 \le i \le n).

Xi={1if h(q)=h(xi)(collision occurs)0otherwiseX_i = \begin{cases} 1 & \text{if } h(q) = h(x_i) \quad \text{(collision occurs)} \\ 0 & \text{otherwise} \end{cases}

Since we assume the hash function hh is perfectly random, the probability of any element colliding with qq is distributed over the mm buckets:

P(h(q)=h(xi))=1mP(h(q) = h(x_i)) = \frac{1}{m}

The total query time QQ is the sum of all collisions:

Q=X1+X2++XnQ = X_1 + X_2 + \dots + X_n

Using Linearity of Expectation:

E[Q]=i=1nE[Xi]=i=1n(1Pr(Xi=1)+0Pr(Xi=0))=i=1n1m=n1m=nm\begin{align} E[Q] &= \sum_{i=1}^{n} E[X_i] \\ &= \sum_{i=1}^{n} \left( 1 \cdot \Pr(X_i=1) + 0 \cdot \Pr(X_i=0) \right) \\ &= \sum_{i=1}^{n} \frac{1}{m} \\ &= n \cdot \frac{1}{m} = \frac{n}{m} \end{align}

Conclusion: To achieve a constant expected query time E[Q]=O(1)E[Q] = O(1), we set the number of buckets mm proportional to the number of items nn (i.e., m=Θ(n)m = \Theta(n)). This results in:

E[Q]=nn=1E[Q] = \frac{n}{n} = 1

To understand how much the query time fluctuates from the average, we calculate the Variance.

Computing Var(Xi)\mathrm{Var}(X_i) for a Single Item

Section titled “Computing Var(Xi)\mathrm{Var}(X_i)Var(Xi​) for a Single Item”

For a Bernoulli (indicator) variable XiX_i with probability p=1/mp = 1/m:

Var(Xi)=E[Xi2](E[Xi])2=pp2(since 12=1 and 02=0E[X2]=E[X]=p)=p(1p)=pq(where q=(1p))\begin{align} \mathrm{Var}(X_i) &= E[X_i^2] - (E[X_i])^2 \\ &= p - p^2 \quad \text{(since $1^2=1$ and $0^2=0$, $E[X^2] = E[X] = p$)} \\ &= p(1 - p) \\ &= pq \quad \text{(where $q = (1-p)$)} \end{align}

Since the keys are hashed independently:

Var(Q)=i=1nVar(Xi)=n1m(11m)\mathrm{Var}(Q) = \sum_{i=1}^{n} \mathrm{Var}(X_i) = n \cdot \frac{1}{m} \left( 1 - \frac{1}{m} \right)

If we assume n=mn=m, then nm=1\frac{n}{m} = 1 and the term (11m)<1(1 - \frac{1}{m}) < 1. Therefore:

Var(Q)nm=1\mathrm{Var}(Q) \le \frac{n}{m} = 1

Knowing the Variance is small (1\le 1) is important for using Chebyshev’s Inequality later.

Although the expected query time is constant, the worst-case query time is still O(n)O(n). The natural question arises: how often will the query time significantly exceed its expectation?

In computer science, we typically focus on the upper tail, since we want to bound how long an algorithm can take.

In finance, the lower tail is often more important, as it represents potential losses.

While E[Q]=O(1)E[Q] = O(1), a user might be concerned about the worst case.

  • Worst Case: All nn items hash to the same bucket. Q=O(n)Q = O(n).
  • Tail Bounds: The “tail” refers to the region of extreme outcomes in the probability distribution (e.g., Q>50Q > 50).

Ideally, we would calculate the exact probability:

Pr(Xj)=k=jPr(X=k)\Pr(X \ge j) = \sum_{k=j}^{\infty} \Pr(X=k)

However, this often has no nice closed form. Instead, we use inequalities to bound this probability.

Markov’s inequality is one of the most fundamental results in probability theory. It provides a simple bound on the tail probability of a non-negative random variable using only its expectation.

Requirement: You only need to know the expectation E[X]\mathbb{E}[X], and XX must be non-negative (X0X \ge 0).

Formula:

Pr(X>t)E[X]t\Pr(X > t) \le \frac{E[X]}{t}

If we want to know the chance the query takes more than 50 steps (t=50t=50) given E[Q]=1E[Q]=1:

P(Q>50)150=2%P(Q > 50) \le \frac{1}{50} = 2\%

This gives a “loose” bound. It guarantees the failure rate is at most 2%.

Setup: Consider a class with 33 students. After a midterm exam, we are told that the average score is 60. No other information about the distribution of scores is provided. How many students scored at least 90?

Let XX denote the score of a randomly selected student. Then E[X]=60E[X] = 60. Applying Markov’s inequality to T=90T = 90:

P(X90)E[X]T=6090=23P(X \geq 90) \leq \frac{E[X]}{T} = \frac{60}{90} = \frac{2}{3}

The expected number of students scoring at least 90 is:

3323=2233 \cdot \frac{2}{3} = 22

Conclusion: At most 22 students can have scored 90 or higher.

In simple words, Markov’s inequality basically says if the expectation of a non-negative random variable is fixed, then there is a maximum possible probability that the variable can exceed any threshold tt. If too much probability were above tt, the average would have to be larger than it actually is.

Requirement: You must know both Expectation E[X]E[X] and Variance Var(X)\mathrm{Var}(X). This provides a “tighter” bound because it accounts for how spread out the data is.

Formula:

P(XE[X]>t)Var(X)t2P(|X - E[X]| > t) \le \frac{\mathrm{Var}(X)}{t^2}

Equivalently, this can be written as:

P(XE[X]+t OR XE[X]t)Var(X)t2P\left(X \geq E[X] + t \text{ OR } X \leq E[X] - t\right) \leq \frac{\mathrm{Var}(X)}{t^2}

The left-hand side of Chebyshev’s inequality measures the probability that XX deviates from its mean E[X]E[X] by at least tt in either direction. This is called a two-sided bound because it accounts for both the upper tail and the lower tail.

We can decompose the event XE[X]t|X - E[X]| \geq t into two cases:

Case 1: XE[X]0X - E[X] \geq 0 (positive deviation)

XE[X]=XE[X]t    XE[X]+t|X - E[X]| = X - E[X] \geq t \implies X \geq E[X] + t

This is the right tail (upper tail).

Case 2: XE[X]<0X - E[X] < 0 (negative deviation)

XE[X]=E[X]Xt    XE[X]t|X - E[X]| = E[X] - X \geq t \implies X \leq E[X] - t

This is the left tail (lower tail).

Together, these two cases cover all outcomes where XX deviates from its mean by at least tt.

Using our calculated Var(Q)1\mathrm{Var}(Q) \le 1 and expected value roughly 1:

P(Q>t)1t2P(Q > t) \leq \frac{1}{t^2}

For the same example of t=50t=50:

P(Q>50)1502=12500=0.04%P(Q > 50) \le \frac{1}{50^2} = \frac{1}{2500} = 0.04\%

In simple terms, Chebyshev’s inequality basically says if the variance of a random variable is fixed, then there is a maximum possible probability that the variable can be far away from its mean by more than some amount tt. If too much probability were far from the mean, the spread (variance) would have to be larger than it actually is.

  • Markov: 1/50\le 1/50 (Linear decay) \to 2% chance.
  • Chebyshev: 1/502\le 1/50^2 (Quadratic decay) \to 0.04% chance.

By using more information (Variance), we proved the probability of a slow query is significantly lower than Markov suggested.