Lecture 3 on 02/02/2026 - Independence, Geometric Random Variables, and Hashing
Scribes: Joshua Sin and Ye Htut Muang
Summary of the Lecture
Section titled “Summary of the Lecture”- Geometric Random Variables, what are they and how are they useful
- The Coupon Collector Problem, its problem statement and how we can solve it using geometric random variables
- The Membership/Dictionary Problem, its problem statement and how we can solve it using hashing with chaining
- What is hashing with chaining and how we can use it to solve the Membership/Dictionary Problem
Independence
Section titled “Independence”and are independent if
means the joint distribution of and . Independence of a random variable means that one r.v doesn’t affect the other variable.
Types of Independence
Section titled “Types of Independence”For a sequence of random variables , there are two notions of independence:
-
- Pairwise independence
- Mutual independence
Pairwise Independence / 2-wise Independence
Section titled “Pairwise Independence / 2-wise Independence”, and are independent. This means finding does not help in finding . However, finding and together can help finding .
Mutual Independence / n-wise Independence
Section titled “Mutual Independence / n-wise Independence”Finding all but (or finding ) does not give any information about . This type of independence is very rare. In computer science, finding purely independent random variables in perfect randomness is impossible.
Linearity of Variance for a Sequence of Random Variables
Section titled “Linearity of Variance for a Sequence of Random Variables”For any two independent random variables and :
This is different from linearity of expectation, which holds even without independence. For variances, we do need independence—but it turns out we only need pairwise independence, not the stronger mutual independence. This is a useful fact because pairwise independence is much easier to achieve in practice.
Suppose are pairwise independent:
By applying linearity of variance, let and :
Geometric Random Variable
Section titled “Geometric Random Variable”Say we have a coin which turns up heads with probability . We toss it times until we get a heads. is random such that . Then, .
This tells us: to get heads on exactly the -th toss, we must get tails on the first tosses (each with probability ) and then heads on the -th toss (with probability ).
Expectation - Alternative Definition
Section titled “Expectation - Alternative Definition”Computing the expectation using the standard definition would require complicated algebraic manipulation. Instead, there is a much more useful alternative formula:
Using this formula and the geometric series, we can derive that the expectation simplifies to:
Expectation and Variance of Geometric Random Variable
Section titled “Expectation and Variance of Geometric Random Variable”Intuitively, this makes sense: if the probability of success on each trial is , then on average we need trials to see a success. For example, with a fair coin (), we expect to need 2 tosses to see a heads.
Coupon Collector Problem
Section titled “Coupon Collector Problem”Pokemon Character Example
Section titled “Pokemon Character Example”Let there be 20 unique characters in Pokemon. Pokemon is collaborating with a cereal company, and when you buy cereal, you get one Pokemon character randomly. How many boxes of cereal in expectation do you need to buy to get all 20 Pokemon characters?
Let’s start with a simple question: If I already have 19 characters, what is the chance that I get the 20th unique character in my next cereal box? The answer is because we are choosing the 1 unseen character out of 20 total characters.
Now, let’s find the pattern of the probability of each case, as the probability will change every time we pick a new character:
- If we have 0 characters, we have or 100% chance that we get a new character.
- If we have 1 character, we have chance that we get a new character.
- If we have 2 characters, we have chance that we get a new character.
- If we have 19 characters, we have chance that we get a new character.
Let be the number of cereal boxes bought until we collect all 20 Pokemon characters. So, . We break down into . Consider to be the number of boxes bought after getting () characters but before getting the -th character. The variables are geometric random variables.
- = the number of boxes bought before the 1st new character; and
- = the number of boxes bought after getting the 1st character but before the 2nd new character; and
- = the number of boxes bought after getting the 2nd character but before the 3rd new character; and
- = the number of boxes bought after getting the 19th character but before the 20th new character; and
Now, we find the expectation of the total number of cereal boxes we need to buy to get all 20 Pokemon characters:
Now comes an important observation. The sum inside the parentheses is a harmonic series. It turns out (and you may recall from calculus) that:
This is not a coincidence: the derivative of is , so the integral . The harmonic series is just the discrete version of this integral. Therefore:
When we talk about expected runtime, it will be .
Logarithms appear in computer science for two fundamental reasons. The first is divide-and-conquer algorithms like binary search or merge sort, which repeatedly divide the problem in half. The second is through harmonic series like this one, which naturally arise when analyzing algorithms involving probabilities and geometric random variables.
Generalization of Coupon Collector Problem
Section titled “Generalization of Coupon Collector Problem”By the linearity of expectation, the total expected number of boxes is:
Since :
Membership/Dictionary Problem
Section titled “Membership/Dictionary Problem”Problem Statement
Section titled “Problem Statement”In the membership/dictionary problem, we have a set of keys, , from a universe . The goal is to store in a data structure such that, given a query element , we can quickly determine whether or not.
Methods to Solve
Section titled “Methods to Solve”Brute Force: We take and compare it with every key . If it exists, we return the key.
- Runtime:
Binary Search: First, we sort which takes preprocessing time (), and then we do the query using binary search. If the element exists, we return the key.
- Runtime:
Is it possible to solve this problem and achieve constant querying time ()?
Algorithm 1 - Hashing with Chaining
Section titled “Algorithm 1 - Hashing with Chaining”To solve the membership problem efficiently, we use a technique called hashing. The idea is to use a hash function to distribute keys across multiple buckets (or cells), reducing the amount of searching needed.
A hash function maps the universe to a set of buckets:
where is chosen to be much smaller than .
The Concept of a Hash Family
Section titled “The Concept of a Hash Family”Strictly speaking, we don’t just have a single hash function. Instead, we work with a hash family—a collection of many different hash functions. We assume this family is perfectly random, meaning that if we pick a random function from this family, for any key and any bucket , the probability that hashes to bucket is uniform:
This probability is averaged over all functions in the family. For any fixed function, it deterministically maps to some bucket, but on average (across the family), all buckets are equally likely.
Preprocessing Phase
Section titled “Preprocessing Phase”-
- Initialize linked lists, one for each bucket
- For each key in our input set: - Compute the bucket index
- Append to the linked list in that bucket
This phase takes time and creates the hash table.
Query Phase
Section titled “Query Phase”When a query arrives, we want to know if :
-
- Compute the bucket index
- Search through the linked list in bucket
- Return “yes” if is found, “no” otherwise
The beauty of this approach is that if was in our original set, it must hash to the same bucket by definition (since the hash function is deterministic). So we only need to search the specific bucket, rather than comparing against all keys.
Note: The query time depends on how many collisions occur in the bucket we search. With the right choice of and a good hash function, we can keep the expected number of collisions small, leading to fast queries.
Example
Section titled “Example”Suppose we want to support membership queries on the set
using a hash table of size with hash function
Each table entry stores a chain of elements that hash to the same index.
| Index | Stored Keys |
|---|---|
| 0 | 5 |
| 1 | (empty) |
| 2 | |
| 3 | (empty) |
| 4 | (empty) |
To answer a membership query for a key , we compute and search only the corresponding chain.
Example Queries
Section titled “Example Queries”- Query : . Searching the chain at index 2 finds 22, so .
- Query : . The chain at index 4 is empty, so .