In modern computing, many problems involve big data, meaning datasets that are extremely large in size, often too large to store entirely in memory or to process using traditional deterministic algorithms.
Big data problems are characterized by:
Massive input sizes (millions or billions of elements)
Limited memory and time constraints
Data arriving in streams rather than all at once
Because of these constraints, classical algorithms that require multiple passes over the data or exact computations may be infeasible.
Randomized algorithms provide a powerful tool for dealing with big data. These algorithms use randomness in their logic, typically by making random choices during execution.
The key idea is that randomness allows us to:
Process only a small portion of the data
Approximate answers instead of computing exact results
Achieve good performance with high probability
Instead of guaranteeing correctness in every execution, randomized algorithms guarantee correctness with high probability. This tradeoff is acceptable in many big data applications where speed and scalability are more important than absolute precision.
Before analyzing randomized algorithms, it is important to review basic probability concepts that will be used throughout the course. These concepts help us reason about uncertainty and quantify the likelihood of different outcomes.
A probability space consists of:
A sample space, which is the set of all possible outcomes
Events, which are subsets of the sample space
A probability measure that assigns a value between 0 and 1 to each event
The probability of an event represents how likely it is to occur. Probabilities satisfy the following basic properties:
The probability of any event is between 0 and 1
The probability of the entire sample space is 1
The probability of an impossible event is 0
In the context of randomized algorithms, probabilities are used to analyze:
The likelihood that an algorithm produces a correct result
The expected behavior of an algorithm over random choices
The chance that a rare or undesirable outcome occurs
Rather than guaranteeing deterministic outcomes, randomized algorithms rely on probabilistic guarantees. This means we analyze how often an algorithm succeeds or fails over many possible random executions.
Let be an event in a probability space. The complement of , denoted by , represents the event that does not occur.
The complement rule states that the probability of an event not occurring is equal to one minus the probability that the event occurs:
This rule follows directly from the fact that an event and its complement together cover the entire sample space. Since the probability of the sample space is , the probabilities of an event and its complement must add up to .
In practice, the complement rule is often useful when computing directly is difficult, but computing is easier. In such cases, we compute the probability of the complement first and subtract it from .
In the analysis of randomized algorithms, the complement rule is frequently used to:
Bound the probability that an algorithm fails
Analyze rare or undesirable events
Convert success probabilities into failure probabilities, or vice versa
Two events and are said to be independent if the occurrence of one event does not affect the probability of the other event.
Formally, events and are independent if :
This definition captures the idea that knowing whether occurs gives no information about whether occurs, and vice versa. If the equality above does not hold, then the events are dependent.
Independence is a fundamental assumption in many randomized algorithms. Random choices made by an algorithm are often designed to be independent so that probabilities can be multiplied and analyzed more easily.
It is important to note that independence is a strong condition. Even if two events seem unrelated, they may still be dependent unless the product rule above holds exactly.
In algorithm analysis, independence allows us to:
Compute probabilities of multiple events occurring together
Analyze repeated random trials
Simplify probability calculations by separating events
The binomial coefficient is used to count the number of ways to choose a fixed number of elements from a larger set, without regard to order.
For integers and , the binomial coefficient is denoted by:
and represents the number of ways to choose elements from a set of elements.
The binomial coefficient is defined as:
In probability, binomial coefficients arise naturally when analyzing repeated independent trials, where each trial has two possible outcomes, such as success or failure.
They are especially useful when computing probabilities involving:
The number of ways a certain outcome can occur
Multiple independent random choices
Counting events before assigning probabilities
In the context of randomized algorithms, binomial coefficients help quantify how many different ways an algorithm’s random decisions can lead to the same result. This allows us to combine counting arguments with probability calculations.
A random variable is a function , where is the sample space consisting of all the possible outcomes of the event that models.
A random variable can be discrete, meaning that its support is finite or countably infinite, or continuous, meaning that its support is uncountable. In this course, we will primarily focus on discrete random variables, as they arise naturally in the analysis of randomized algorithms.
A discrete random variable takes on a finite or countably infinite set of values, each with an associated probability.
Example (Rolling a fair die): Consider the experiment of rolling a fair six-sided die. The sample space consists of the six possible outcomes. The random variable maps each outcome to a numerical value in its support:
Since the die is fair, each outcome occurs with equal probability. We can describe the random variable as follows:
This notation makes explicit both the possible values of the random variable and the probability with which each value occurs.
A discrete random variable is fully described by its probability mass function (PMF), which assigns a probability to each value in its support.
The probabilities assigned to a random variable satisfy the following basic properties:
For every value in the support,
The sum of the probabilities over all possible values is equal to 1
That is,
In randomized algorithms, random variables are used to model quantities such as running time, the number of correct outputs, or whether an algorithm succeeds or fails. By defining appropriate random variables, we can analyze the behavior of an algorithm using probability.
This perspective allows us to reason about algorithm performance in terms of likelihood and expectation, rather than exact deterministic outcomes.
The expectation of a discrete random variable, denoted as , is a weighted average of all the possible values that takes on. The expectation is given by:
Property 1 (Shift and Scale of Expectation): If is a random variable with finite expectation, then:
where .
Property 2 (Linearity of Expectation): If are random variables, each with finite expectation , then:
The variance of a discrete random variable, denoted as 𝕒𝕣, is the squared average distance by which the values that takes on deviate from . The variance is given by:
A Bernoulli random variable models the number of successes that occur in a single trial with a fixed probability of success, .
By convention, we say that the Bernoulli random variable can take on either (indicating a failure) or (indicating a success). Thus, the support of the Bernoulli distribution is .
Example (Flipping a fair coin once): If we define a success as landing heads and a failure as landing tails, then a single flip of a fair coin is an example of a Bernoulli trial with a fixed probability of success, . We denote this random variable as follows:
A Binomial random variable models the number of successes that occur in independent Bernoulli trials with a fixed probability of success, .
Since it models a count of the number of successes in trials, the Binomial random variable can take on any whole number between and . Thus, the support of the Binomial distribution is .
Example (Flipping a fair coin 50 times): If we define a success as landing heads and a failure as landing tails, then 50 flips of a fair coin is an example of a Binomial procedure with a fixed probability of success in each trial, . We denote this random variable as follows:
Very often, it is useful to interpret a Binomial random variable as the sum of independent Bernoulli random variables, with probability of success . In other words:
If , then .
Expressing the Binomial random variable in this way allows for certain expressions to be simplified, such as computing the expectation and the variance of the Binomial distribution.
Finding all but (or finding ) does not give any information about . This type of independence is very rare. In computer science, finding purely independent random variables in perfect randomness is impossible.
Linearity of Variance for a Sequence of Random Variables
Let there be 20 unique characters in Pokemon. Pokemon is collaborating with a cereal company, and when you buy cereal, you get one Pokemon character randomly. How many boxes of cereal in expectation do you need to buy to get all 20 Pokemon characters?
Let’s start with a simple question: If I already have 19 characters, what is the chance that I get the 20th unique character in my next cereal box? The answer is because we are choosing the 1 unseen character out of 20 total characters.
Now, let’s find the pattern of the probability of each case, as the probability will change every time we pick a new character:
If we have 0 characters, we have or 100% chance that we get a new character.
If we have 1 character, we have chance that we get a new character.
If we have 2 characters, we have chance that we get a new character.
If we have 19 characters, we have chance that we get a new character.
Let be the number of cereal boxes bought until we collect all 20 Pokemon characters. So, . We break down into . Consider to be the number of boxes bought after getting () characters but before getting the -th character. The variables are geometric random variables.
= the number of boxes bought before the 1st new character; and
= the number of boxes bought after getting the 1st character but before the 2nd new character; and
= the number of boxes bought after getting the 2nd character but before the 3rd new character; and
= the number of boxes bought after getting the 19th character but before the 20th new character; and
Now, we find the expectation of the total number of cereal boxes we need to buy to get all 20 Pokemon characters:
In the membership/dictionary problem, we have a set of keys, , from a universe . The goal is to store in a data structure such that, given a query element , we can quickly determine whether or not.
Brute Force: We take and compare it with every key . If it exists, we return the key.
Runtime:
Binary Search: First, we sort which takes preprocessing time (), and then we do the query using binary search. If the element exists, we return the key.
Runtime:
Is it possible to solve this problem and achieve constant querying time ()?
Given a hash function (where is the space we will use), we assume is perfectly random (meaning every possible hash value is equally likely to be selected from a large hash family set). For any key and any :
Algorithm:
(1) Initialize linked lists, one for each bucket
(2) For to :
compute
append to the linked list in bucket
Note: Query time depends on which bucket you have to search.
Although the expected query time is constant, the worst-case query time is still . The natural question arises: how often will the query time significantly exceed its expectation?
In computer science, we typically focus on the upper tail, since we want to bound how long an algorithm can take.
In finance, the lower tail is often more important, as it represents potential losses.
While , a user might be concerned about the worst case.
Worst Case: All items hash to the same bucket. .
Tail Bounds: The “tail” refers to the region of extreme outcomes in the probability distribution (e.g., ).
Ideally, we would calculate the exact probability:
However, this often has no nice closed form. Instead, we use inequalities to bound this probability.
Markov’s inequality is one of the most fundamental results in probability theory. It provides a simple bound on the tail probability of a non-negative random variable using only its expectation.
Requirement: You only need to know the expectation , and must be non-negative ().
Setup: Consider a class with 33 students. After a midterm exam, we are told that the average score is 60. No other information about the distribution of scores is provided. How many students scored at least 90?
Let denote the score of a randomly selected student. Then . Applying Markov’s inequality to :
The expected number of students scoring at least 90 is:
Conclusion: At most 22 students can have scored 90 or higher.
In simple words, Markov’s inequality basically says if the expectation of a non-negative random variable is fixed, then there is a maximum possible probability that the variable can exceed any threshold . If too much probability were above , the average would have to be larger than it actually is.
The left-hand side of Chebyshev’s inequality measures the probability that deviates from its mean by at least in either direction. This is called a two-sided bound because it accounts for both the upper tail and the lower tail.
We can decompose the event into two cases:
Case 1: (positive deviation)
This is the right tail (upper tail).
Case 2: (negative deviation)
This is the left tail (lower tail).
Together, these two cases cover all outcomes where deviates from its mean by at least .
Using our calculated and expected value roughly 1:
For the same example of :
In simple terms, Chebyshev’s inequality basically says if the variance of a random variable is fixed, then there is a maximum possible probability that the variable can be far away from its mean by more than some amount . If too much probability were far from the mean, the spread (variance) would have to be larger than it actually is.
Thus, counts the number of keys that hash to the same bucket as the query.
The chance that the -th key hashes to the same bucket as the query is .
The variance of the query time is the sum of the variance of each .
Variance of a Bernoulli is , where . Summing over variables gives
Since , replacing it by gives an upper bound:
If , then
We compute the variance of to apply Chebyshev’s inequality. Markov’s inequality only requires expectation, but Chebyshev requires both expectation and variance. Assuming , we have
Applying Chebyshev:
Since is equivalent to , we get
This is much stronger than the Markov bound,
For example, when , Markov gives approximately 2%, while Chebyshev gives 0.04%. There is no contradiction: Chebyshev uses more information (variance), so it gives a tighter bound.
Tail Bounds for Random Variables in Hashing with Chaining
Before analyzing the FKS hashing preprocessing, let’s review its steps.
Step 2: We take each row from the first step and find a hash function for each row that can distribute the keys into a row that is in length such that there are no collisions.
By the end, we will have used hash functions: 1 in step 1 and in step 2.
Previously, we concluded that steps 1 and 2 of FKS Hashing take time in expectation.
Operation
FKS
HWC
Preprocessing
expected
Query
expected
We’ve proved that FKS hashing has worst-case query time and expected preprocessing time.
We also saw that, by Markov:
And that by Chebyshev:
We also briefly discussed that, by Chernoff:
If we were to concisely discuss what the Chernoff bound is, then we would say that it bounds tail probabilities by . We should also note that the Chernoff bound is only applicable when the random variable is a sum of independent random variables.
Imagine you are a scheduler with 10 machines. People come to you with different tasks (codes they want to run), and your job is to assign a machine to each task. These are called scheduling problems: you have a certain number of jobs and a certain number of machines, and you want to assign jobs to machines to finish them in the least amount of time (or optimize some other objective).
One of the simplest approaches is random scheduling: when someone comes with a task, you just randomly pick one of your machines and assign the task to it.
To analyze these randomized algorithms, we use a framework called balls and bins. This framework is also useful for understanding hashing algorithms, since a hash function takes keys and hashes them into cells of a hash table, where all cells are equally likely.
When we throw balls into bins, the probability that no bin has 2 or more balls is:
Interpretation of Each Term: Each term in the product represents the probability that the next ball goes into a new bin. The second ball has probability of not going into the first ball’s bin, which would be a collision. The third ball has probability of not going into either of the first two balls’ bins. This continues until the -th ball has probability of going into a new bin, since bins are already occupied by the previous balls.
Note: This formula assumes (the number of balls does not exceed the number of bins). If , then by the pigeonhole principle, we are guaranteed to have at least one collision, so .
Connection to Birthday Paradox: If we relate people to balls and birthdays to bins, then throwing 24 balls into 365 bins gives us the birthday paradox. Here, 365 corresponds to the number of days in a year, and we’re asking for the probability that no date gets assigned more than one person, which is the same as everyone having a different birthday. When we plug in and into the product above, we get the same 0.4616 that we calculated earlier for the birthday paradox.
This is an important approximation to remember, as it applies to any term in our product formula.
Now we apply this to the product formula we saw earlier. The probability that no bin has 2 or more balls is:
Using our approximation, we replace each term:
When we multiply exponentials with the same base, we add the exponents:
The sum (for large ), so we get:
Therefore, the probability that some bin has at least 2 balls is:
So, instead of manually calculating the probability for specific values of and in the original product, we can use this approximation to quickly estimate the probability of collisions in the balls and bins framework.
If we flip the question around, then instead of being given the number of people and asked for the probability, we instead ask:
How many people should we have in a room so that ?
In this case, is unknown and . We know that:
If we rearrange this equation by isolating the exponential term, we get:
Taking the natural logarithm of both sides and dividing by -1, we have:
Since we cannot have a fractional number of people, then we can say that if we have 35 people in a room, there is an 80% chance that some two will share a birthday.
Let be independent Bernoulli random variables, where
Define the sum
This represents the number of successes (e.g., heads in biased coin tosses).
By linearity of expectation,
The Chernoff bounds apply because the variables are
independent. Independence is essential for obtaining strong
concentration guarantees.
The goal is to bound the probability that deviates significantly from
its expectation (tail probabilities). Chernoff bounds provide
exponentially small probabilities for large deviations.
Chernoff bounds are a central tool in analyzing randomized algorithms
and data structures. They show strong concentration for sums of
independent random variables and provide high-probability guarantees,
such as bounding chain lengths in hashing with chaining.