Skip to content

Lecture 3 on 02/02/2026 - Independence, Geometric Random Variables, and Hashing

Scribes: Joshua Sin and Ye Htut Muang

  • Geometric Random Variables, what are they and how are they useful
  • The Coupon Collector Problem, its problem statement and how we can solve it using geometric random variables
  • The Membership/Dictionary Problem, its problem statement and how we can solve it using hashing with chaining
  • What is hashing with chaining and how we can use it to solve the Membership/Dictionary Problem

XX and YY are independent if

x,y:P(X=x,Y=y)=P(X=x)P(Y=y)\forall x,y : \mathbb{P}(X=x, Y=y) = \mathbb{P}(X = x) \mathbb{P}(Y=y)

P(X=x,Y=y)\mathbb{P}(X=x, Y=y) means the joint distribution of XX and YY. Independence of a random variable means that one r.v doesn’t affect the other variable.

For a sequence of random variables {X1,,Xn}\{X_1, \dots, X_n\}, there are two notions of independence:

  • Pairwise independence
  • Mutual independence

Pairwise Independence / 2-wise Independence

Section titled “Pairwise Independence / 2-wise Independence”

1ijn\forall \quad 1\le i \neq j \le n, XiX_i and XjX_j are independent. This means finding XiX_i does not help in finding XjX_j. However, finding XiX_i and XjX_j together can help finding XkX_k.

Finding all but XjX_j (or finding X1,X2,,Xj1,Xj+1,,XnX_1, X_2, \dots, X_{j-1}, X_{j+1}, \dots, X_n) does not give any information about XjX_j. This type of independence is very rare. In computer science, finding purely independent random variables in perfect randomness is impossible.

Linearity of Variance for a Sequence of Random Variables

Section titled “Linearity of Variance for a Sequence of Random Variables”

For any two independent random variables XX and YY:

Var[X+Y]=Var[X]+Var[Y]\mathbb{Var}[X + Y] = \mathbb{Var}[X] + \mathbb{Var}[Y]

This is different from linearity of expectation, which holds even without independence. For variances, we do need independence—but it turns out we only need pairwise independence, not the stronger mutual independence. This is a useful fact because pairwise independence is much easier to achieve in practice.

Suppose X1,,XnX_1, \dots, X_n are pairwise independent:

Var[X1++Xn]=Var[X1]++Var[Xn]\mathbb{Var}[X_1 + \dots + X_n] = \mathbb{Var}[X_1] + \dots + \mathbb{Var}[X_n]

By applying linearity of variance, let XBin(n,p)X \sim \text{Bin}(n,p) and XiBern(p)X_i \sim \text{Bern}(p):

X=X1++XnX = X_1 + \dots + X_n Var[X]=i[1,n]Var[Xi]\mathbb{Var}[X] = \sum_{i\in[1,n]}{\mathbb{Var}[X_i]} =i[1,n]p(1p)= \sum_{i\in[1,n]}{p(1-p)} =np(1p)= np(1-p)

Say we have a coin which turns up heads with probability pp. We toss it XX times until we get a heads. XX is random such that X=1,2,3,X = 1, 2, 3, \dots. Then, XGeometric(p)X \sim \text{Geometric}(p).

P(X=j)=(1p)j1pfor j1\mathbb{P}(X = j) = (1-p)^{j-1}p \quad \text{for } j \ge 1

This tells us: to get heads on exactly the jj-th toss, we must get tails on the first j1j-1 tosses (each with probability 1p1-p) and then heads on the jj-th toss (with probability pp).

Computing the expectation using the standard definition E[X]=j=1jP(X=j)\mathbb{E}[X] = \sum_{j=1}^{\infty} j \cdot \mathbb{P}(X = j) would require complicated algebraic manipulation. Instead, there is a much more useful alternative formula:

E[X]=j=1P(Xj)\mathbb{E}[X] = \sum_{j=1}^{\infty} \mathbb{P}(X\ge j)

Using this formula and the geometric series, we can derive that the expectation simplifies to:

Expectation and Variance of Geometric Random Variable

Section titled “Expectation and Variance of Geometric Random Variable”
E[X]=1p\mathbb{E}[X] = \frac{1}{p}

Intuitively, this makes sense: if the probability of success on each trial is pp, then on average we need 1p\frac{1}{p} trials to see a success. For example, with a fair coin (p=12p = \frac{1}{2}), we expect to need 2 tosses to see a heads.

Var[X]=1pp2\mathbb{Var}[X] = \frac{1-p}{p^2}

Let there be 20 unique characters in Pokemon. Pokemon is collaborating with a cereal company, and when you buy cereal, you get one Pokemon character randomly. How many boxes of cereal in expectation do you need to buy to get all 20 Pokemon characters?

Let’s start with a simple question: If I already have 19 characters, what is the chance that I get the 20th unique character in my next cereal box? The answer is 120\frac{1}{20} because we are choosing the 1 unseen character out of 20 total characters.

Now, let’s find the pattern of the probability of each case, as the probability will change every time we pick a new character:

  • If we have 0 characters, we have 2020\frac{20}{20} or 100% chance that we get a new character.
  • If we have 1 character, we have 1920\frac{19}{20} chance that we get a new character.
  • If we have 2 characters, we have 1820\frac{18}{20} chance that we get a new character.
  • \vdots
  • If we have 19 characters, we have 120\frac{1}{20} chance that we get a new character.

Let XX be the number of cereal boxes bought until we collect all 20 Pokemon characters. So, X20X \ge 20. We break down XX into X=X1+X2++X20X = X_1 + X_2 + \dots + X_{20}. Consider XiX_i to be the number of boxes bought after getting (i1i - 1) characters but before getting the ii-th character. The variables XiX_i are geometric random variables.

  • X1X_1 = the number of boxes bought before the 1st new character; P(X1)=1\mathbb{P}(X_1) = 1 and E[X1]=1\mathbb{E}[X_1] = 1
  • X2X_2 = the number of boxes bought after getting the 1st character but before the 2nd new character; P(X2)=1920\mathbb{P}(X_2) = \frac{19}{20} and E[X2]=2019\mathbb{E}[X_2] = \frac{20}{19}
  • X3X_3 = the number of boxes bought after getting the 2nd character but before the 3rd new character; P(X3)=1820\mathbb{P}(X_3) = \frac{18}{20} and E[X3]=2018\mathbb{E}[X_3] = \frac{20}{18}
  • \vdots
  • X20X_{20} = the number of boxes bought after getting the 19th character but before the 20th new character; P(X20)=120\mathbb{P}(X_{20}) = \frac{1}{20} and E[X20]=20\mathbb{E}[X_{20}] = 20

Now, we find the expectation of the total number of cereal boxes we need to buy to get all 20 Pokemon characters:

E[X]=E[X1]+E[X2]++E[X20]\mathbb{E}[X] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + \dots + \mathbb{E}[X_{20}] =2020+2019+2018+2017++201= \frac{20}{20} + \frac{20}{19} + \frac{20}{18} + \frac{20}{17} + \dots + \frac{20}{1} =20(120+119+118++1)= 20 \cdot \left(\frac{1}{20} + \frac{1}{19} + \frac{1}{18} + \dots + 1\right)

Now comes an important observation. The sum inside the parentheses is a harmonic series. It turns out (and you may recall from calculus) that:

1+12+13++1nln(n)1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} \approx \ln(n)

This is not a coincidence: the derivative of ln(x)\ln(x) is 1x\frac{1}{x}, so the integral 1n1xdx=ln(n)\int_1^n \frac{1}{x} \, dx = \ln(n). The harmonic series is just the discrete version of this integral. Therefore:

E[X]=20ln(20)60\mathbb{E}[X] = 20 \ln(20) \approx 60

When we talk about expected runtime, it will be O(20ln(20))O(20 \ln(20)).

Logarithms appear in computer science for two fundamental reasons. The first is divide-and-conquer algorithms like binary search or merge sort, which repeatedly divide the problem in half. The second is through harmonic series like this one, which naturally arise when analyzing algorithms involving probabilities and geometric random variables.

Generalization of Coupon Collector Problem

Section titled “Generalization of Coupon Collector Problem”
P(Xi)=n(i1)n=1i1n\mathbb{P}(X_i) = \frac{n - (i-1)}{n} = 1 - \frac{i-1}{n} XiGeometric(1i1n)X_i \sim \text{Geometric}\left(1 - \frac{i-1}{n}\right) E[Xi]=11i1n=nni+1\mathbb{E}[X_i] = \frac{1}{1 - \frac{i-1}{n}} = \frac{n}{n - i + 1}

By the linearity of expectation, the total expected number of boxes E[X]\mathbb{E}[X] is:

E[X]=i=1nE[Xi]=i=1nnni+1\mathbb{E}[X] = \sum_{i=1}^{n} \mathbb{E}[X_i] = \sum_{i=1}^{n} \frac{n}{n - i + 1} =n(1n+1n1++11)= n \left( \frac{1}{n} + \frac{1}{n-1} + \dots + \frac{1}{1} \right)

Since 1+12+13++1n=ln(n)1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{n} = \ln(n):

E[X]=nln(n)\mathbb{E}[X] = n \ln(n)

In the membership/dictionary problem, we have a set SS of nn keys, {x1,x2,x3,,xn}\{x_1, x_2, x_3, \dots, x_n\}, from a universe [U]={1,2,3,,U}[U] = \{1, 2, 3, \dots, U\}. The goal is to store SS in a data structure such that, given a query element q[U]q \in [U], we can quickly determine whether qSq \in S or not.

Brute Force: We take qq and compare it with every key {x1,x2,x3,,xn}\{x_1, x_2, x_3, \dots, x_n\}. If it exists, we return the key.

  • Runtime: O(n)O(n)

Binary Search: First, we sort SS which takes preprocessing time (O(nlogn)O(n\log n)), and then we do the query using binary search. If the element exists, we return the key.

  • Runtime: O(logn)O(\log n)

Is it possible to solve this problem and achieve constant querying time (O(1)O(1))?

To solve the membership problem efficiently, we use a technique called hashing. The idea is to use a hash function to distribute keys across multiple buckets (or cells), reducing the amount of searching needed.

A hash function hh maps the universe UU to a set of mm buckets:

h:U{1,2,3,,m}h: U \to \{1, 2, 3, \dots, m\}

where mm is chosen to be much smaller than U|U|.

Strictly speaking, we don’t just have a single hash function. Instead, we work with a hash family—a collection of many different hash functions. We assume this family is perfectly random, meaning that if we pick a random function from this family, for any key xUx \in U and any bucket i{1,2,3,,m}i \in \{1, 2, 3, \dots, m\}, the probability that xx hashes to bucket ii is uniform:

P(h(x)=i)=1m\mathbb{P}(h(x) = i) = \frac{1}{m}

This probability is averaged over all functions in the family. For any fixed function, it deterministically maps xx to some bucket, but on average (across the family), all buckets are equally likely.

  • Initialize mm linked lists, one for each bucket
  • For each key xix_i in our input set: - Compute the bucket index h(xi)h(x_i)
    • Append xix_i to the linked list in that bucket

This phase takes O(n)O(n) time and creates the hash table.

When a query qq arrives, we want to know if qSq \in S:

  • Compute the bucket index h(q)h(q)
  • Search through the linked list in bucket h(q)h(q)
  • Return “yes” if qq is found, “no” otherwise

The beauty of this approach is that if qq was in our original set, it must hash to the same bucket by definition (since the hash function is deterministic). So we only need to search the specific bucket, rather than comparing qq against all nn keys.

Note: The query time depends on how many collisions occur in the bucket we search. With the right choice of mm and a good hash function, we can keep the expected number of collisions small, leading to fast queries.

Suppose we want to support membership queries on the set

S={12,7,22,17,5}S = \{12, 7, 22, 17, 5\}

using a hash table of size m=5m = 5 with hash function

h(x)=xmod5h(x) = x \bmod 5

Each table entry stores a chain of elements that hash to the same index.

IndexStored Keys
05
1(empty)
2127221712 \to 7 \to 22 \to 17
3(empty)
4(empty)

To answer a membership query for a key qq, we compute h(q)h(q) and search only the corresponding chain.

  • Query q=22q = 22: h(22)=2h(22) = 2. Searching the chain at index 2 finds 22, so 22S22 \in S.
  • Query q=9q = 9: h(9)=4h(9) = 4. The chain at index 4 is empty, so 9S9 \notin S.