Lecture 4 on 02/04/2026 - Query Time Analysis and Tail Bounds

Scribes: Ayesha Jamal and Kyoshi Noda

Summary of the Lecture

Query time analysis ( $E[Q]$ )
Variance analysis ( $\mathrm{Var}(Q)$ )
Tail Bounds (Concept)
Markov’s Inequality
Chebyshev’s Inequality

Query Time Analysis

Setup and Indicator Variables

We analyze the query time $Q$ for a specific element $q$ . Let $X_i$ be an indicator random variable for the $i$ -th element in the database ( $1 \le i \le n$ ).

X_i = \begin{cases} 1 & \text{if } h(q) = h(x_i) \quad \text{(collision occurs)} \\ 0 & \text{otherwise} \end{cases}

Since we assume the hash function $h$ is perfectly random, the probability of any element colliding with $q$ is distributed over the $m$ buckets:

P(h(q) = h(x_i)) = \frac{1}{m}

Computing Expectation $E[Q]$

The total query time $Q$ is the sum of all collisions:

Q = X_1 + X_2 + \dots + X_n

Using Linearity of Expectation:

\begin{align} E[Q] &= \sum_{i=1}^{n} E[X_i] \\ &= \sum_{i=1}^{n} \left( 1 \cdot \Pr(X_i=1) + 0 \cdot \Pr(X_i=0) \right) \\ &= \sum_{i=1}^{n} \frac{1}{m} \\ &= n \cdot \frac{1}{m} = \frac{n}{m} \end{align}

Conclusion: To achieve a constant expected query time $E[Q] = O(1)$ , we set the number of buckets $m$ proportional to the number of items $n$ (i.e., $m = \Theta(n)$ ). This results in:

E[Q] = \frac{n}{n} = 1

Variance Analysis

To understand how much the query time fluctuates from the average, we calculate the Variance.

Computing $\mathrm{Var}(X_i)$ for a Single Item

For a Bernoulli (indicator) variable $X_i$ with probability $p = 1/m$ :

\begin{align} \mathrm{Var}(X_i) &= E[X_i^2] - (E[X_i])^2 \\ &= p - p^2 \quad \text{(since $1^2=1$ and $0^2=0$, $E[X^2] = E[X] = p$)} \\ &= p(1 - p) \\ &= pq \quad \text{(where $q = (1-p)$)} \end{align}

Computing $\mathrm{Var}(Q)$

Since the keys are hashed independently:

\mathrm{Var}(Q) = \sum_{i=1}^{n} \mathrm{Var}(X_i) = n \cdot \frac{1}{m} \left( 1 - \frac{1}{m} \right)

If we assume $n=m$ , then $\frac{n}{m} = 1$ and the term $(1 - \frac{1}{m}) < 1$ . Therefore:

\mathrm{Var}(Q) \le \frac{n}{m} = 1

Knowing the Variance is small ( $\le 1$ ) is important for using Chebyshev’s Inequality later.

Tail Bounds

Concept

Although the expected query time is constant, the worst-case query time is still $O(n)$ . The natural question arises: how often will the query time significantly exceed its expectation?

In computer science, we typically focus on the upper tail, since we want to bound how long an algorithm can take.

In finance, the lower tail is often more important, as it represents potential losses.

While $E[Q] = O(1)$ , a user might be concerned about the worst case.

Worst Case: All $n$ items hash to the same bucket. $Q = O(n)$ .
Tail Bounds: The “tail” refers to the region of extreme outcomes in the probability distribution (e.g., $Q > 50$ ).

Ideally, we would calculate the exact probability:

\Pr(X \ge j) = \sum_{k=j}^{\infty} \Pr(X=k)

However, this often has no nice closed form. Instead, we use inequalities to bound this probability.

Markov’s Inequality

Markov’s inequality is one of the most fundamental results in probability theory. It provides a simple bound on the tail probability of a non-negative random variable using only its expectation.

Requirement: You only need to know the expectation $\mathbb{E}[X]$ , and $X$ must be non-negative ( $X \ge 0$ ).

Formula:

\Pr(X > t) \le \frac{E[X]}{t}

Example 1: Hashing With Chaining

If we want to know the chance the query takes more than 50 steps ( $t=50$ ) given $E[Q]=1$ :

P(Q > 50) \le \frac{1}{50} = 2\%

This gives a “loose” bound. It guarantees the failure rate is at most 2%.

Example 2: Exam Scores

Setup: Consider a class with 33 students. After a midterm exam, we are told that the average score is 60. No other information about the distribution of scores is provided. How many students scored at least 90?

Let $X$ denote the score of a randomly selected student. Then $E[X] = 60$ . Applying Markov’s inequality to $T = 90$ :

P(X \geq 90) \leq \frac{E[X]}{T} = \frac{60}{90} = \frac{2}{3}

The expected number of students scoring at least 90 is:

33 \cdot \frac{2}{3} = 22

Conclusion: At most 22 students can have scored 90 or higher.

In simple words, Markov’s inequality basically says if the expectation of a non-negative random variable is fixed, then there is a maximum possible probability that the variable can exceed any threshold $t$ . If too much probability were above $t$ , the average would have to be larger than it actually is.

Chebyshev’s Inequality

Requirement: You must know both Expectation $E[X]$ and Variance $\mathrm{Var}(X)$ . This provides a “tighter” bound because it accounts for how spread out the data is.

Formula:

P(|X - E[X]| > t) \le \frac{\mathrm{Var}(X)}{t^2}

Equivalently, this can be written as:

P\left(X \geq E[X] + t \text{ OR } X \leq E[X] - t\right) \leq \frac{\mathrm{Var}(X)}{t^2}

Interpretation

The left-hand side of Chebyshev’s inequality measures the probability that $X$ deviates from its mean $E[X]$ by at least $t$ in either direction. This is called a two-sided bound because it accounts for both the upper tail and the lower tail.

We can decompose the event $|X - E[X]| \geq t$ into two cases:

Case 1: $X - E[X] \geq 0$ (positive deviation)

|X - E[X]| = X - E[X] \geq t \implies X \geq E[X] + t

This is the right tail (upper tail).

Case 2: $X - E[X] < 0$ (negative deviation)

|X - E[X]| = E[X] - X \geq t \implies X \leq E[X] - t

This is the left tail (lower tail).

Together, these two cases cover all outcomes where $X$ deviates from its mean by at least $t$ .

Example: Hashing with Chaining

Using our calculated $\mathrm{Var}(Q) \le 1$ and expected value roughly 1:

P(Q > t) \leq \frac{1}{t^2}

For the same example of $t=50$ :

P(Q > 50) \le \frac{1}{50^2} = \frac{1}{2500} = 0.04\%

In simple terms, Chebyshev’s inequality basically says if the variance of a random variable is fixed, then there is a maximum possible probability that the variable can be far away from its mean by more than some amount $t$ . If too much probability were far from the mean, the spread (variance) would have to be larger than it actually is.

Comparison

Markov: $\le 1/50$ (Linear decay) $\to$ 2% chance.
Chebyshev: $\le 1/50^2$ (Quadratic decay) $\to$ 0.04% chance.

By using more information (Variance), we proved the probability of a slow query is significantly lower than Markov suggested.