Lecture 3 on 02/02/2026 - Independence, Geometric Random Variables, and Hashing

Scribes: Joshua Sin and Ye Htut Muang

Summary of the Lecture

Geometric Random Variables, what are they and how are they useful
The Coupon Collector Problem, its problem statement and how we can solve it using geometric random variables
The Membership/Dictionary Problem, its problem statement and how we can solve it using hashing with chaining
What is hashing with chaining and how we can use it to solve the Membership/Dictionary Problem

Independence

$X$ and $Y$ are independent if

\forall x,y : \mathbb{P}(X=x, Y=y) = \mathbb{P}(X = x) \mathbb{P}(Y=y)

$\mathbb{P}(X=x, Y=y)$ means the joint distribution of $X$ and $Y$ . Independence of a random variable means that one r.v doesn’t affect the other variable.

Types of Independence

For a sequence of random variables $\{X_1, \dots, X_n\}$ , there are two notions of independence:

Pairwise independence
Mutual independence

Pairwise Independence / 2-wise Independence

$\forall \quad 1\le i \neq j \le n$ , $X_i$ and $X_j$ are independent. This means finding $X_i$ does not help in finding $X_j$ . However, finding $X_i$ and $X_j$ together can help finding $X_k$ .

Mutual Independence / n-wise Independence

Finding all but $X_j$ (or finding $X_1, X_2, \dots, X_{j-1}, X_{j+1}, \dots, X_n$ ) does not give any information about $X_j$ . This type of independence is very rare. In computer science, finding purely independent random variables in perfect randomness is impossible.

Linearity of Variance for a Sequence of Random Variables

For any two independent random variables $X$ and $Y$ :

\mathbb{Var}[X + Y] = \mathbb{Var}[X] + \mathbb{Var}[Y]

This is different from linearity of expectation, which holds even without independence. For variances, we do need independence—but it turns out we only need pairwise independence, not the stronger mutual independence. This is a useful fact because pairwise independence is much easier to achieve in practice.

Suppose $X_1, \dots, X_n$ are pairwise independent:

\mathbb{Var}[X_1 + \dots + X_n] = \mathbb{Var}[X_1] + \dots + \mathbb{Var}[X_n]

By applying linearity of variance, let $X \sim \text{Bin}(n,p)$ and $X_i \sim \text{Bern}(p)$ :

X = X_1 + \dots + X_n

\mathbb{Var}[X] = \sum_{i\in[1,n]}{\mathbb{Var}[X_i]}

= \sum_{i\in[1,n]}{p(1-p)}

= np(1-p)

Geometric Random Variable

Say we have a coin which turns up heads with probability $p$ . We toss it $X$ times until we get a heads. $X$ is random such that $X = 1, 2, 3, \dots$ . Then, $X \sim \text{Geometric}(p)$ .

\mathbb{P}(X = j) = (1-p)^{j-1}p \quad \text{for } j \ge 1

This tells us: to get heads on exactly the $j$ -th toss, we must get tails on the first $j-1$ tosses (each with probability $1-p$ ) and then heads on the $j$ -th toss (with probability $p$ ).

Expectation - Alternative Definition

Computing the expectation using the standard definition $\mathbb{E}[X] = \sum_{j=1}^{\infty} j \cdot \mathbb{P}(X = j)$ would require complicated algebraic manipulation. Instead, there is a much more useful alternative formula:

\mathbb{E}[X] = \sum_{j=1}^{\infty} \mathbb{P}(X\ge j)

Using this formula and the geometric series, we can derive that the expectation simplifies to:

Expectation and Variance of Geometric Random Variable

\mathbb{E}[X] = \frac{1}{p}

Intuitively, this makes sense: if the probability of success on each trial is $p$ , then on average we need $\frac{1}{p}$ trials to see a success. For example, with a fair coin ( $p = \frac{1}{2}$ ), we expect to need 2 tosses to see a heads.

\mathbb{Var}[X] = \frac{1-p}{p^2}

Coupon Collector Problem

Pokemon Character Example

Let there be 20 unique characters in Pokemon. Pokemon is collaborating with a cereal company, and when you buy cereal, you get one Pokemon character randomly. How many boxes of cereal in expectation do you need to buy to get all 20 Pokemon characters?

Let’s start with a simple question: If I already have 19 characters, what is the chance that I get the 20th unique character in my next cereal box? The answer is $\frac{1}{20}$ because we are choosing the 1 unseen character out of 20 total characters.

Now, let’s find the pattern of the probability of each case, as the probability will change every time we pick a new character:

If we have 0 characters, we have $\frac{20}{20}$ or 100% chance that we get a new character.
If we have 1 character, we have $\frac{19}{20}$ chance that we get a new character.
If we have 2 characters, we have $\frac{18}{20}$ chance that we get a new character.
$\vdots$
If we have 19 characters, we have $\frac{1}{20}$ chance that we get a new character.

Let $X$ be the number of cereal boxes bought until we collect all 20 Pokemon characters. So, $X \ge 20$ . We break down $X$ into $X = X_1 + X_2 + \dots + X_{20}$ . Consider $X_i$ to be the number of boxes bought after getting ( $i - 1$ ) characters but before getting the $i$ -th character. The variables $X_i$ are geometric random variables.

$X_1$ = the number of boxes bought before the 1st new character; $\mathbb{P}(X_1) = 1$ and $\mathbb{E}[X_1] = 1$
$X_2$ = the number of boxes bought after getting the 1st character but before the 2nd new character; $\mathbb{P}(X_2) = \frac{19}{20}$ and $\mathbb{E}[X_2] = \frac{20}{19}$
$X_3$ = the number of boxes bought after getting the 2nd character but before the 3rd new character; $\mathbb{P}(X_3) = \frac{18}{20}$ and $\mathbb{E}[X_3] = \frac{20}{18}$
$\vdots$
$X_{20}$ = the number of boxes bought after getting the 19th character but before the 20th new character; $\mathbb{P}(X_{20}) = \frac{1}{20}$ and $\mathbb{E}[X_{20}] = 20$

Now, we find the expectation of the total number of cereal boxes we need to buy to get all 20 Pokemon characters:

\mathbb{E}[X] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + \dots + \mathbb{E}[X_{20}]

= \frac{20}{20} + \frac{20}{19} + \frac{20}{18} + \frac{20}{17} + \dots + \frac{20}{1}

= 20 \cdot \left(\frac{1}{20} + \frac{1}{19} + \frac{1}{18} + \dots + 1\right)

Now comes an important observation. The sum inside the parentheses is a harmonic series. It turns out (and you may recall from calculus) that:

1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} \approx \ln(n)

This is not a coincidence: the derivative of $\ln(x)$ is $\frac{1}{x}$ , so the integral $\int_1^n \frac{1}{x} \, dx = \ln(n)$ . The harmonic series is just the discrete version of this integral. Therefore:

\mathbb{E}[X] = 20 \ln(20) \approx 60

When we talk about expected runtime, it will be $O(20 \ln(20))$ .

Logarithms appear in computer science for two fundamental reasons. The first is divide-and-conquer algorithms like binary search or merge sort, which repeatedly divide the problem in half. The second is through harmonic series like this one, which naturally arise when analyzing algorithms involving probabilities and geometric random variables.

Generalization of Coupon Collector Problem

\mathbb{P}(X_i) = \frac{n - (i-1)}{n} = 1 - \frac{i-1}{n}

X_i \sim \text{Geometric}\left(1 - \frac{i-1}{n}\right)

\mathbb{E}[X_i] = \frac{1}{1 - \frac{i-1}{n}} = \frac{n}{n - i + 1}

By the linearity of expectation, the total expected number of boxes $\mathbb{E}[X]$ is:

\mathbb{E}[X] = \sum_{i=1}^{n} \mathbb{E}[X_i] = \sum_{i=1}^{n} \frac{n}{n - i + 1}

= n \left( \frac{1}{n} + \frac{1}{n-1} + \dots + \frac{1}{1} \right)

Since $1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \ln(n)$ :

\mathbb{E}[X] = n \ln(n)

Membership/Dictionary Problem

Problem Statement

In the membership/dictionary problem, we have a set $S$ of $n$ keys, $\{x_1, x_2, x_3, \dots, x_n\}$ , from a universe $[U] = \{1, 2, 3, \dots, U\}$ . The goal is to store $S$ in a data structure such that, given a query element $q \in [U]$ , we can quickly determine whether $q \in S$ or not.

Methods to Solve

Brute Force: We take $q$ and compare it with every key $\{x_1, x_2, x_3, \dots, x_n\}$ . If it exists, we return the key.

Runtime: $O(n)$

Binary Search: First, we sort $S$ which takes preprocessing time ( $O(n\log n)$ ), and then we do the query using binary search. If the element exists, we return the key.

Runtime: $O(\log n)$

Is it possible to solve this problem and achieve constant querying time ( $O(1)$ )?

Algorithm 1 - Hashing with Chaining

To solve the membership problem efficiently, we use a technique called hashing. The idea is to use a hash function to distribute keys across multiple buckets (or cells), reducing the amount of searching needed.

A hash function $h$ maps the universe $U$ to a set of $m$ buckets:

h: U \to \{1, 2, 3, \dots, m\}

where $m$ is chosen to be much smaller than $|U|$ .

The Concept of a Hash Family

Strictly speaking, we don’t just have a single hash function. Instead, we work with a hash family—a collection of many different hash functions. We assume this family is perfectly random, meaning that if we pick a random function from this family, for any key $x \in U$ and any bucket $i \in \{1, 2, 3, \dots, m\}$ , the probability that $x$ hashes to bucket $i$ is uniform:

\mathbb{P}(h(x) = i) = \frac{1}{m}

This probability is averaged over all functions in the family. For any fixed function, it deterministically maps $x$ to some bucket, but on average (across the family), all buckets are equally likely.

Preprocessing Phase

Initialize $m$ linked lists, one for each bucket
For each key $x_i$ $x_{i}$ in our input set: - Compute the bucket index $h(x_i)$ $h (x_{i})$
- Append $x_i$ to the linked list in that bucket

This phase takes $O(n)$ time and creates the hash table.

Query Phase

When a query $q$ arrives, we want to know if $q \in S$ :

Compute the bucket index $h(q)$
Search through the linked list in bucket $h(q)$
Return “yes” if $q$ is found, “no” otherwise

The beauty of this approach is that if $q$ was in our original set, it must hash to the same bucket by definition (since the hash function is deterministic). So we only need to search the specific bucket, rather than comparing $q$ against all $n$ keys.

Note: The query time depends on how many collisions occur in the bucket we search. With the right choice of $m$ and a good hash function, we can keep the expected number of collisions small, leading to fast queries.

Example

Suppose we want to support membership queries on the set

S = \{12, 7, 22, 17, 5\}

using a hash table of size $m = 5$ with hash function

h(x) = x \bmod 5

Each table entry stores a chain of elements that hash to the same index.

Index	Stored Keys
0	5
1	(empty)
2	$12 \to 7 \to 22 \to 17$
3	(empty)
4	(empty)

To answer a membership query for a key $q$ , we compute $h(q)$ and search only the corresponding chain.

Example Queries

Query $q = 22$ : $h(22) = 2$ . Searching the chain at index 2 finds 22, so $22 \in S$ .
Query $q = 9$ : $h(9) = 4$ . The chain at index 4 is empty, so $9 \notin S$ .