Skip to content

Lecture 02/04/2026 - Query Time Analysis and Tail Bounds

Scribes: Ayesha Jamal and Kyoshi Noda

  • Query time analysis ()
  • Variance analysis ()
  • Tail Bounds (Concept)
  • Markov’s Inequality
  • Chebyshev’s Inequality

We analyze the query time for a specific element . Let be an indicator random variable for the -th element in the database ().

Since we assume the hash function is perfectly random, the probability of any element colliding with is distributed over the buckets:

The total query time is the sum of all collisions:

Using Linearity of Expectation:

Conclusion: To achieve a constant expected query time , we set the number of buckets proportional to the number of items (i.e., ). This results in:

To understand how much the query time fluctuates from the average, we calculate the Variance.

For a Bernoulli (indicator) variable with probability :

Since the keys are hashed independently:

If we assume , then and the term . Therefore:

Knowing the Variance is small () is important for using Chebyshev’s Inequality later.

Although the expected query time is constant, the worst-case query time is still . The natural question arises: how often will the query time significantly exceed its expectation?

In computer science, we typically focus on the upper tail, since we want to bound how long an algorithm can take.

In finance, the lower tail is often more important, as it represents potential losses.

While , a user might be concerned about the worst case.

  • Worst Case: All items hash to the same bucket. .
  • Tail Bounds: The “tail” refers to the region of extreme outcomes in the probability distribution (e.g., ).

Ideally, we would calculate the exact probability:

However, this often has no nice closed form. Instead, we use inequalities to bound this probability.

Markov’s inequality is one of the most fundamental results in probability theory. It provides a simple bound on the tail probability of a non-negative random variable using only its expectation.

Requirement: You only need to know the expectation , and must be non-negative ().

Formula:

If we want to know the chance the query takes more than 50 steps () given :

This gives a “loose” bound. It guarantees the failure rate is at most 2%.

Setup: Consider a class with 33 students. After a midterm exam, we are told that the average score is 60. No other information about the distribution of scores is provided. How many students scored at least 90?

Let denote the score of a randomly selected student. Then . Applying Markov’s inequality to :

The expected number of students scoring at least 90 is:

Conclusion: At most 22 students can have scored 90 or higher.

In simple words, Markov’s inequality basically says if the expectation of a non-negative random variable is fixed, then there is a maximum possible probability that the variable can exceed any threshold . If too much probability were above , the average would have to be larger than it actually is.

Requirement: You must know both Expectation and Variance . This provides a “tighter” bound because it accounts for how spread out the data is.

Formula:

Equivalently, this can be written as:

The left-hand side of Chebyshev’s inequality measures the probability that deviates from its mean by at least in either direction. This is called a two-sided bound because it accounts for both the upper tail and the lower tail.

We can decompose the event into two cases:

Case 1: (positive deviation)

This is the right tail (upper tail).

Case 2: (negative deviation)

This is the left tail (lower tail).

Together, these two cases cover all outcomes where deviates from its mean by at least .

Using our calculated and expected value roughly 1:

For the same example of :

In simple terms, Chebyshev’s inequality basically says if the variance of a random variable is fixed, then there is a maximum possible probability that the variable can be far away from its mean by more than some amount . If too much probability were far from the mean, the spread (variance) would have to be larger than it actually is.

  • Markov: (Linear decay) 2% chance.
  • Chebyshev: (Quadratic decay) 0.04% chance.

By using more information (Variance), we proved the probability of a slow query is significantly lower than Markov suggested.