Skip to content

CSCI 328 - Algorithms for Big Data

Scribes: Allen Singleton and Hanan Latiff

  • What is Big Data?
  • Basic Probability Concepts
  • Discrete Random Variables

In modern computing, many problems involve big data, meaning datasets that are extremely large in size, often too large to store entirely in memory or to process using traditional deterministic algorithms.

Big data problems are characterized by:

  • Massive input sizes (millions or billions of elements)
  • Limited memory and time constraints
  • Data arriving in streams rather than all at once

Because of these constraints, classical algorithms that require multiple passes over the data or exact computations may be infeasible.

Randomized algorithms provide a powerful tool for dealing with big data. These algorithms use randomness in their logic, typically by making random choices during execution.

The key idea is that randomness allows us to:

  • Process only a small portion of the data
  • Approximate answers instead of computing exact results
  • Achieve good performance with high probability

Instead of guaranteeing correctness in every execution, randomized algorithms guarantee correctness with high probability. This tradeoff is acceptable in many big data applications where speed and scalability are more important than absolute precision.

Randomized algorithms are especially useful when:

  • Exact solutions are too slow or memory-intensive
  • Approximate answers are sufficient
  • The data is noisy or inherently uncertain

Before analyzing randomized algorithms, it is important to review basic probability concepts that will be used throughout the course. These concepts help us reason about uncertainty and quantify the likelihood of different outcomes.

A probability space consists of:

  • A sample space, which is the set of all possible outcomes
  • Events, which are subsets of the sample space
  • A probability measure that assigns a value between 0 and 1 to each event

The probability of an event represents how likely it is to occur. Probabilities satisfy the following basic properties:

  • The probability of any event is between 0 and 1
  • The probability of the entire sample space is 1
  • The probability of an impossible event is 0

In the context of randomized algorithms, probabilities are used to analyze:

  • The likelihood that an algorithm produces a correct result
  • The expected behavior of an algorithm over random choices
  • The chance that a rare or undesirable outcome occurs

Rather than guaranteeing deterministic outcomes, randomized algorithms rely on probabilistic guarantees. This means we analyze how often an algorithm succeeds or fails over many possible random executions.

Let be an event in a probability space. The complement of , denoted by , represents the event that does not occur.

The complement rule states that the probability of an event not occurring is equal to one minus the probability that the event occurs:

This rule follows directly from the fact that an event and its complement together cover the entire sample space. Since the probability of the sample space is , the probabilities of an event and its complement must add up to .

In practice, the complement rule is often useful when computing directly is difficult, but computing is easier. In such cases, we compute the probability of the complement first and subtract it from .

In the analysis of randomized algorithms, the complement rule is frequently used to:

  • Bound the probability that an algorithm fails
  • Analyze rare or undesirable events
  • Convert success probabilities into failure probabilities, or vice versa

Two events and are said to be independent if the occurrence of one event does not affect the probability of the other event.

Formally, events and are independent if :

This definition captures the idea that knowing whether occurs gives no information about whether occurs, and vice versa. If the equality above does not hold, then the events are dependent.

Independence is a fundamental assumption in many randomized algorithms. Random choices made by an algorithm are often designed to be independent so that probabilities can be multiplied and analyzed more easily.

It is important to note that independence is a strong condition. Even if two events seem unrelated, they may still be dependent unless the product rule above holds exactly.

In algorithm analysis, independence allows us to:

  • Compute probabilities of multiple events occurring together
  • Analyze repeated random trials
  • Simplify probability calculations by separating events

The binomial coefficient is used to count the number of ways to choose a fixed number of elements from a larger set, without regard to order.

For integers and , the binomial coefficient is denoted by:

and represents the number of ways to choose elements from a set of elements.

The binomial coefficient is defined as:

In probability, binomial coefficients arise naturally when analyzing repeated independent trials, where each trial has two possible outcomes, such as success or failure.

They are especially useful when computing probabilities involving:

  • The number of ways a certain outcome can occur
  • Multiple independent random choices
  • Counting events before assigning probabilities

In the context of randomized algorithms, binomial coefficients help quantify how many different ways an algorithm’s random decisions can lead to the same result. This allows us to combine counting arguments with probability calculations.

A random variable is a function , where is the sample space consisting of all the possible outcomes of the event that models.

A random variable can be discrete, meaning that its support is finite or countably infinite, or continuous, meaning that its support is uncountable. In this course, we will primarily focus on discrete random variables, as they arise naturally in the analysis of randomized algorithms.

A discrete random variable takes on a finite or countably infinite set of values, each with an associated probability.

Example (Rolling a fair die): Consider the experiment of rolling a fair six-sided die. The sample space consists of the six possible outcomes. The random variable maps each outcome to a numerical value in its support:

Since the die is fair, each outcome occurs with equal probability. We can describe the random variable as follows:

This notation makes explicit both the possible values of the random variable and the probability with which each value occurs.

A discrete random variable is fully described by its probability mass function (PMF), which assigns a probability to each value in its support.

The probabilities assigned to a random variable satisfy the following basic properties:

  • For every value in the support,
  • The sum of the probabilities over all possible values is equal to 1

That is,

In randomized algorithms, random variables are used to model quantities such as running time, the number of correct outputs, or whether an algorithm succeeds or fails. By defining appropriate random variables, we can analyze the behavior of an algorithm using probability.

This perspective allows us to reason about algorithm performance in terms of likelihood and expectation, rather than exact deterministic outcomes.

The expectation of a discrete random variable, denoted as , is a weighted average of all the possible values that takes on. The expectation is given by:

Property 1 (Shift and Scale of Expectation): If is a random variable with finite expectation, then:

where .

Property 2 (Linearity of Expectation): If are random variables, each with finite expectation , then:

The variance of a discrete random variable, denoted as 𝕒𝕣, is the squared average distance by which the values that takes on deviate from . The variance is given by:

𝕒𝕣

Equivalently, the variance can be computed as:

𝕒𝕣

Property 1 (Shift and Scale of Variance): If is a random variable with finite variance, then:

𝕒𝕣𝕒𝕣

where .

Property 2 (Linearity of Variance): If are independent random variables, each with finite variance 𝕒𝕣𝕒𝕣𝕒𝕣, then:

𝕒𝕣𝕒𝕣𝕒𝕣𝕒𝕣

A Bernoulli random variable models the number of successes that occur in a single trial with a fixed probability of success, .

By convention, we say that the Bernoulli random variable can take on either (indicating a failure) or (indicating a success). Thus, the support of the Bernoulli distribution is .

Example (Flipping a fair coin once): If we define a success as landing heads and a failure as landing tails, then a single flip of a fair coin is an example of a Bernoulli trial with a fixed probability of success, . We denote this random variable as follows:

The probability mass function of the Bernoulli distribution is given by:

Equivalently, we can express the Bernoulli distribution in closed form as:

If , then the expectation of is given by:

Proof

Using the closed-form definition of the Bernoulli distribution, we have:

If , then the variance of is given by:

𝕒𝕣

Proof

Using the closed-form definition of the Bernoulli distribution, we have:

𝕒𝕣

A Binomial random variable models the number of successes that occur in independent Bernoulli trials with a fixed probability of success, .

Since it models a count of the number of successes in trials, the Binomial random variable can take on any whole number between and . Thus, the support of the Binomial distribution is .

Example (Flipping a fair coin 50 times): If we define a success as landing heads and a failure as landing tails, then 50 flips of a fair coin is an example of a Binomial procedure with a fixed probability of success in each trial, . We denote this random variable as follows:

The probability mass function of the Binomial distribution is given by:

Alternative Definition of the Binomial Distribution

Section titled “Alternative Definition of the Binomial Distribution”

Very often, it is useful to interpret a Binomial random variable as the sum of independent Bernoulli random variables, with probability of success . In other words:

If , then .

Expressing the Binomial random variable in this way allows for certain expressions to be simplified, such as computing the expectation and the variance of the Binomial distribution.

If , then the expectation of is given by:

Proof

Using the alternative definition of the Binomial distribution, we can express:

If , then the variance of is given by:

𝕒𝕣

Proof

Using the alternative definition of the Binomial distribution, we can express:

𝕒𝕣𝕒𝕣𝕒𝕣𝕒𝕣𝕒𝕣

Scribes: Joshua Sin and Ye Htut Muang

  • Geometric Random Variables, what are they and how are they useful
  • The Coupon Collector Problem, its problem statement and how we can solve it using geometric random variables
  • The Membership/Dictionary Problem, its problem statement and how we can solve it using hashing with chaining
  • What is hashing with chaining and how we can use it to solve the Membership/Dictionary Problem

and are independent if

means the joint distribution of and . Independence of a random variable means that one r.v doesn’t affect the other variable.

For a sequence of random variables , there are two notions of independence:

(1) Pairwise independence

(2) Mutual independence

Pairwise Independence / 2-wise Independence

Section titled “Pairwise Independence / 2-wise Independence”

, and are independent. This means finding does not help in finding . However, finding and together can help finding .

Finding all but (or finding ) does not give any information about . This type of independence is very rare. In computer science, finding purely independent random variables in perfect randomness is impossible.

Linearity of Variance for a Sequence of Random Variables

Section titled “Linearity of Variance for a Sequence of Random Variables”

For any independent random variables and :

𝕒𝕣𝕒𝕣𝕒𝕣

Suppose are pairwise independent (we just need pairwise to use this property):

𝕒𝕣𝕒𝕣𝕒𝕣

By applying linearity of variance, let and :

𝕒𝕣𝕒𝕣

Say we have a coin which turns up heads with probability . We toss it times until we get a heads. is random such that . Then, .

Expectation and Variance of Geometric Random Variable

Section titled “Expectation and Variance of Geometric Random Variable”

𝕒𝕣

Let there be 20 unique characters in Pokemon. Pokemon is collaborating with a cereal company, and when you buy cereal, you get one Pokemon character randomly. How many boxes of cereal in expectation do you need to buy to get all 20 Pokemon characters?

Let’s start with a simple question: If I already have 19 characters, what is the chance that I get the 20th unique character in my next cereal box? The answer is because we are choosing the 1 unseen character out of 20 total characters.

Now, let’s find the pattern of the probability of each case, as the probability will change every time we pick a new character:

  • If we have 0 characters, we have or 100% chance that we get a new character.
  • If we have 1 character, we have chance that we get a new character.
  • If we have 2 characters, we have chance that we get a new character.
  • If we have 19 characters, we have chance that we get a new character.

Let be the number of cereal boxes bought until we collect all 20 Pokemon characters. So, . We break down into . Consider to be the number of boxes bought after getting () characters but before getting the -th character. The variables are geometric random variables.

  • = the number of boxes bought before the 1st new character; and
  • = the number of boxes bought after getting the 1st character but before the 2nd new character; and
  • = the number of boxes bought after getting the 2nd character but before the 3rd new character; and
  • = the number of boxes bought after getting the 19th character but before the 20th new character; and

Now, we find the expectation of the total number of cereal boxes we need to buy to get all 20 Pokemon characters:

Since :

When we talk about expected runtime, it will be .

Generalization of Coupon Collector Problem

Section titled “Generalization of Coupon Collector Problem”

By the linearity of expectation, the total expected number of boxes is:

Since :

In the membership/dictionary problem, we have a set of keys, , from a universe . The goal is to store in a data structure such that, given a query element , we can quickly determine whether or not.

Brute Force: We take and compare it with every key . If it exists, we return the key.

  • Runtime:

Binary Search: First, we sort which takes preprocessing time (), and then we do the query using binary search. If the element exists, we return the key.

  • Runtime:

Is it possible to solve this problem and achieve constant querying time ()?

Given a hash function (where is the space we will use), we assume is perfectly random (meaning every possible hash value is equally likely to be selected from a large hash family set). For any key and any :

Algorithm:

(1) Initialize linked lists, one for each bucket

(2) For to :

  • compute
  • append to the linked list in bucket

Note: Query time depends on which bucket you have to search.

Suppose we want to support membership queries on the set

using a hash table of size with hash function

Each table entry stores a chain of elements that hash to the same index.

IndexStored Keys
05
1(empty)
2
3(empty)
4(empty)

To answer a membership query for a key , we compute and search only the corresponding chain.

  • Query : . Searching the chain at index 2 finds 22, so .
  • Query : . The chain at index 4 is empty, so .

Scribes: Ayesha Jamal and Kyoshi Noda

  • Query time analysis ()
  • Variance analysis ()
  • Tail Bounds (Concept)
  • Markov’s Inequality
  • Chebyshev’s Inequality

We analyze the query time for a specific element . Let be an indicator random variable for the -th element in the database ().

Since we assume the hash function is perfectly random, the probability of any element colliding with is distributed over the buckets:

The total query time is the sum of all collisions:

Using Linearity of Expectation:

Conclusion: To achieve a constant expected query time , we set the number of buckets proportional to the number of items (i.e., ). This results in:

To understand how much the query time fluctuates from the average, we calculate the Variance.

For a Bernoulli (indicator) variable with probability :

Since the keys are hashed independently:

If we assume , then and the term . Therefore:

Knowing the Variance is small () is important for using Chebyshev’s Inequality later.

Although the expected query time is constant, the worst-case query time is still . The natural question arises: how often will the query time significantly exceed its expectation?

In computer science, we typically focus on the upper tail, since we want to bound how long an algorithm can take.

In finance, the lower tail is often more important, as it represents potential losses.

While , a user might be concerned about the worst case.

  • Worst Case: All items hash to the same bucket. .
  • Tail Bounds: The “tail” refers to the region of extreme outcomes in the probability distribution (e.g., ).

Ideally, we would calculate the exact probability:

However, this often has no nice closed form. Instead, we use inequalities to bound this probability.

Markov’s inequality is one of the most fundamental results in probability theory. It provides a simple bound on the tail probability of a non-negative random variable using only its expectation.

Requirement: You only need to know the expectation , and must be non-negative ().

Formula:

If we want to know the chance the query takes more than 50 steps () given :

This gives a “loose” bound. It guarantees the failure rate is at most 2%.

Setup: Consider a class with 33 students. After a midterm exam, we are told that the average score is 60. No other information about the distribution of scores is provided. How many students scored at least 90?

Let denote the score of a randomly selected student. Then . Applying Markov’s inequality to :

The expected number of students scoring at least 90 is:

Conclusion: At most 22 students can have scored 90 or higher.

In simple words, Markov’s inequality basically says if the expectation of a non-negative random variable is fixed, then there is a maximum possible probability that the variable can exceed any threshold . If too much probability were above , the average would have to be larger than it actually is.

Requirement: You must know both Expectation and Variance . This provides a “tighter” bound because it accounts for how spread out the data is.

Formula:

Equivalently, this can be written as:

The left-hand side of Chebyshev’s inequality measures the probability that deviates from its mean by at least in either direction. This is called a two-sided bound because it accounts for both the upper tail and the lower tail.

We can decompose the event into two cases:

Case 1: (positive deviation)

This is the right tail (upper tail).

Case 2: (negative deviation)

This is the left tail (lower tail).

Together, these two cases cover all outcomes where deviates from its mean by at least .

Using our calculated and expected value roughly 1:

For the same example of :

In simple terms, Chebyshev’s inequality basically says if the variance of a random variable is fixed, then there is a maximum possible probability that the variable can be far away from its mean by more than some amount . If too much probability were far from the mean, the spread (variance) would have to be larger than it actually is.

  • Markov: (Linear decay) 2% chance.
  • Chebyshev: (Quadratic decay) 0.04% chance.

By using more information (Variance), we proved the probability of a slow query is significantly lower than Markov suggested.

Scribes: Carlos Aucacama and Mohammed Zaid

  • Hashing with Chaining has expected constant query time.
  • Using stronger inequalities gives dramatically better tail bounds.
  • Recognizing when a random variable is a sum of independent Bernoullis is powerful.

Professor Goswami began by discussing the 3 guarantees of hashing and chaining:

1. Expected Query Time

Let denote the query time. Then

If , then

2. Markov Inequality Guarantee

3. Chebyshev Inequality Guarantee

Assuming , we have

Applying Chebyshev:

Variance Analysis of Query Time in Hashing with Chaining

Section titled “Variance Analysis of Query Time in Hashing with Chaining”

Query Time as a Sum of Bernoulli Variables

Section titled “Query Time as a Sum of Bernoulli Variables”

Let denote the query time. We write

where

Thus, counts the number of keys that hash to the same bucket as the query.

  • The chance that the -th key hashes to the same bucket as the query is .
  • The variance of the query time is the sum of the variance of each .
  • Variance of a Bernoulli is , where . Summing over variables gives

Since , replacing it by gives an upper bound:

If , then

We compute the variance of to apply Chebyshev’s inequality. Markov’s inequality only requires expectation, but Chebyshev requires both expectation and variance. Assuming , we have

Applying Chebyshev:

Since is equivalent to , we get

This is much stronger than the Markov bound,

For example, when , Markov gives approximately 2%, while Chebyshev gives 0.04%. There is no contradiction: Chebyshev uses more information (variance), so it gives a tighter bound.

Tail Bounds for Random Variables in Hashing with Chaining

Section titled “Tail Bounds for Random Variables in Hashing with Chaining”

  • Applies to any positive random variable.
  • Only requires the expectation .

  • Applies to any random variable.
  • Requires expectation and variance .

  • Only applies to sums of independent Bernoulli random variables.
  • Gives a much tighter bound than Markov or Chebyshev for large deviations.
  • Estimate extreme events (tails) when exact probabilities are difficult.
  • Choice depends on type of random variable:
    • Markov: expectation only
    • Chebyshev: expectation + variance
    • Chernoff: sum of independent Bernoullis

Binary search has query time

which increases as increases.

Hashing with chaining improves this:

  • Preprocessing time: (always)
  • Query time: in expectation

However, the worst-case query time is not constant because collisions may occur.

We now design a hashing scheme with:

This scheme is called FKS hashing (Fredman–Komlós–Szemerédi). The randomness is moved entirely into preprocessing.

FKS Hashing Two-Level Hashing Construction

Section titled “FKS Hashing Two-Level Hashing Construction”

Let be the set of keys.

Pick a perfectly random hash function

Hash all keys into buckets.

For each bucket , define

Sum of all elements in each bucket is :

Square each bucket:

If the sum of squares is greater than :

Discard hash function and choose a new hash function. Repeat this process until the sum of squares is less than or equal to :

This completes Step 1.

For each bucket :

  • There are keys in bucket .
  • Allocate a second-level table of size for each bucket
  • Choose a random hash function mapping those keys into this table.
  • If any collision occurs, discard and choose another hash function.

Repeat until all keys map to distinct cells.

Thus, every second-level table satisfies:

(1) Each cell contains at most one key.

(2) No collisions occur.

This completes preprocessing.

First-level table uses:

Second-level tables use:

From Step 1:

Therefore:

Total space:

Thus total space is

Given a query key :

  • Compute .
  • Compute .
  • Inspect the cell in bucket at position .

Since second-level tables contain no collisions:

  • If the cell contains , return YES.
  • Otherwise, return NO.

The query performs:

  • One evaluation of
  • One evaluation of
  • One table lookup

Therefore:

Scribes: Olivia Xu and Laura Torres

  • Comparison of FKS Hashing vs. Chaining (Worst-case vs. Expected)
  • Mathematical Intuition: Minimizing Sum of Squares for Equality
  • Analysis of Step 2: Probability of Second-level Collisions
  • Analysis of Step 1: Expected Collisions and Markov Bound

Professor Goswami began by distinguishing the guarantees:

  • Hashing with Chaining: Query time is expected. Preprocessing is always .
  • FKS Hashing: Query time is worst-case. Preprocessing is expected.

First a Brief Summary of FKS Hashing Preprocessing Phase

Section titled “First a Brief Summary of FKS Hashing Preprocessing Phase”

Before analyzing the FKS hashing preprocessing, let’s review its steps.

  • Step 2: We take each row from the first step and find a hash function for each row that can distribute the keys into a row that is in length such that there are no collisions.
  • By the end, we will have used hash functions: 1 in step 1 and in step 2.

To understand why Step 1 limits the sum of squares, we consider a calculus problem:

Problem: Given and , minimize .

  • Substituting , we get .
  • To minimize it, we differentiate it once and set it equal to zero:

  • Since the second derivative is positive, , this means that and is a minimum.
  • Insight: Minimizing the sum of squares is a mathematical way to enforce an equal distribution.
  • In FKS, . The sum of squares ranges from (perfectly equal, ) to (all in one bucket). FKS accepts any where .
    • The most equal distribution will be when there are no collisions and each is equal to 1. Then :

  • The most unequal distribution will occur when all keys go into the same bucket, so all the other buckets have 0 keys:

Step 2 Analysis: Second-level Collision Probability

Section titled “Step 2 Analysis: Second-level Collision Probability”

We need to prove that Step 2 terminates quickly.

Scenario: We hash keys into cells.

  • Let be the total number of collisions in a bucket.
  • Using a Universal Hash Family, for any pair of keys, .
  • Total number of pairs is .

  • By Markov’s Inequality: .
  • Conclusion: Since the failure probability is , the number of tries follows a Geometric Random Variable with success . Expected tries .

We prove that we don’t need to resample too many times.

  • Total collisions .
  • It can be shown that .
  • As shown in the “pink fact”: When hashing keys into buckets, the expected number of collisions .
  • (More accurately, ).
  • We know .
  • Thus, .
  • We want to find the probability that Step 1 fails, i.e., .
  • By Markov: .
  • Conclusion: Expected number of tries for Step 1 is . Total expected work is .
  • Query: worst-case (two hash function evaluations).
  • Space: (since from Step 1’s condition).
  • Preprocessing: expected (Geometric trials for both levels).

Scribe: Mauricio Monje

  • Recap FKS Hashing and HWC Analysis
  • The Chernoff Bound
  • The Birthday Paradox
  • Balls and Bins Framework

Review Previous Lecture: FKS Hashing Analysis

Section titled “Review Previous Lecture: FKS Hashing Analysis”

Previously, we concluded that steps 1 and 2 of FKS Hashing take time in expectation.

OperationFKSHWC
Preprocessing expected
Query expected

We’ve proved that FKS hashing has worst-case query time and expected preprocessing time.

We also saw that, by Markov:

And that by Chebyshev:

We also briefly discussed that, by Chernoff:

If we were to concisely discuss what the Chernoff bound is, then we would say that it bounds tail probabilities by . We should also note that the Chernoff bound is only applicable when the random variable is a sum of independent random variables.

In a room we have 24 people, what is the probability that some 2 people share the same birthday?

When an event’s probability is difficult to compute directly, we can instead compute the probability of its complement and subtract from 1.

We consider the complement, that no two people share a birthday, meaning that everyone has a different birthday:

Therefore:

So in a room with just 24 people, there is more than a 50% chance that some two will share a birthday.

Imagine you are a scheduler with 10 machines. People come to you with different tasks (codes they want to run), and your job is to assign a machine to each task. These are called scheduling problems: you have a certain number of jobs and a certain number of machines, and you want to assign jobs to machines to finish them in the least amount of time (or optimize some other objective).

One of the simplest approaches is random scheduling: when someone comes with a task, you just randomly pick one of your machines and assign the task to it.

To analyze these randomized algorithms, we use a framework called balls and bins. This framework is also useful for understanding hashing algorithms, since a hash function takes keys and hashes them into cells of a hash table, where all cells are equally likely.

When we throw balls into bins, the probability that no bin has 2 or more balls is:

Interpretation of Each Term: Each term in the product represents the probability that the next ball goes into a new bin. The second ball has probability of not going into the first ball’s bin, which would be a collision. The third ball has probability of not going into either of the first two balls’ bins. This continues until the -th ball has probability of going into a new bin, since bins are already occupied by the previous balls.

Note: This formula assumes (the number of balls does not exceed the number of bins). If , then by the pigeonhole principle, we are guaranteed to have at least one collision, so .

Connection to Birthday Paradox: If we relate people to balls and birthdays to bins, then throwing 24 balls into 365 bins gives us the birthday paradox. Here, 365 corresponds to the number of days in a year, and we’re asking for the probability that no date gets assigned more than one person, which is the same as everyone having a different birthday. When we plug in and into the product above, we get the same 0.4616 that we calculated earlier for the birthday paradox.

To simplify this product, we use the following limits:

Note: We won’t go into the proof of these limits, but for now we’ll just take them as given.

For large enough, we can approximate:

If we raise both sides to the power of , we get:

We can generalize the above to:

This is an important approximation to remember, as it applies to any term in our product formula.

Now we apply this to the product formula we saw earlier. The probability that no bin has 2 or more balls is:

Using our approximation, we replace each term:

When we multiply exponentials with the same base, we add the exponents:

The sum (for large ), so we get:

Therefore, the probability that some bin has at least 2 balls is:

So, instead of manually calculating the probability for specific values of and in the original product, we can use this approximation to quickly estimate the probability of collisions in the balls and bins framework.

For and :

This confirms that the approximation works well for the birthday paradox with 23 people.

If we flip the question around, then instead of being given the number of people and asked for the probability, we instead ask:

How many people should we have in a room so that ?

In this case, is unknown and . We know that:

If we rearrange this equation by isolating the exponential term, we get:

Taking the natural logarithm of both sides and dividing by -1, we have:

Since we cannot have a fractional number of people, then we can say that if we have 35 people in a room, there is an 80% chance that some two will share a birthday.

  • Chernoff bounds for sums of independent Bernoulli random variables
  • Expectation and concentration (tail bounds intuition)
  • Applying Chernoff bounds to hashing with chaining
  • High-probability guarantees on bucket sizes (chain lengths)

Let be independent Bernoulli random variables, where

Define the sum

This represents the number of successes (e.g., heads in biased coin tosses).

By linearity of expectation,

The Chernoff bounds apply because the variables are independent. Independence is essential for obtaining strong concentration guarantees.

The goal is to bound the probability that deviates significantly from its expectation (tail probabilities). Chernoff bounds provide exponentially small probabilities for large deviations.

For ,

For ,

These inequalities show that is highly concentrated around its expectation . As grows, the probability of large deviations decreases exponentially.

Intuitively, although each trial is random, averaging many independent trials makes extreme outcomes extremely unlikely.

An event occurs with high probability if it happens with probability at least (or another quantity approaching as grows).

We hash elements into buckets using a random hash function, where each element independently lands in any bucket with probability .

Fix a particular bucket. For each element , define

Then the bucket size is

Since , we obtain

The quantity is called the load factor (expected chain length).

Because the bucket size is a sum of independent Bernoulli indicator variables, Chernoff bounds can be applied directly to analyze load balance.

Using the upper-tail Chernoff bound,

Thus, the probability that a bucket becomes much larger than its expected size decreases exponentially in .

From One Bucket to All Buckets (Union Bound)

Section titled “From One Bucket to All Buckets (Union Bound)”

The Chernoff bound controls the size of a fixed bucket. To guarantee that no bucket becomes too large, we apply a union bound over all buckets.

This shows that with high probability, every bucket size remains close to its expectation.

A classical result states that when elements are hashed into buckets, the maximum bucket size is

This implies that hashing with chaining supports operations in nearly constant time with high probability.

With high probability:

  • Bucket sizes remain close to the expected load
  • Very long chains are unlikely
  • Hash table operations (search, insert, delete) remain efficient

Chernoff bounds are a central tool in analyzing randomized algorithms and data structures. They show strong concentration for sums of independent random variables and provide high-probability guarantees, such as bounding chain lengths in hashing with chaining.

Randomized algorithms often achieve near-deterministic performance: undesirable outcomes occur only with exponentially small probability.