With a finite amount of memory, we want to process an infinite stream of data:
x 1 , x 2 , … , x n x_1, x_2, \ldots, x_n x 1 , x 2 , … , x n
We cannot store the entire stream. Instead, we want to answer queries using as little space as possible.
Goal: Given a sample size S S S , we want to store S S S keys from the stream such that at any time t t t , our sample contains S S S keys chosen uniformly at random from x 1 , … , x t x_1, \ldots, x_t x 1 , … , x t .
Note: Another way we can think of t t t is as the number of keys we’ve seen so far in the stream.
Example: S = 2 S = 2 S = 2
Stream: 5 , 8 , 2 , 3 , 1 , 4 , 9 , 10 , 6 , … 5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots 5 , 8 , 2 , 3 , 1 , 4 , 9 , 10 , 6 , … (at t = 9 t = 9 t = 9 )
P ( 8 ∈ Sample ) = 2 9 P(8 \in \text{Sample}) = \frac{2}{9} P ( 8 ∈ Sample ) = 9 2
Assumption: Sample size S S S . We’ll store the first S S S keys from the stream.
Store the first S S S keys: x 1 , … , x S x_1, \ldots, x_S x 1 , … , x S
When a new key x i x_i x i arrives:
Select it with probability S t \frac{S}{t} t S and add to our sample
If not selected, the sample stays the same
If selected, we add x i x_i x i and randomly remove one of the S S S keys from the sample
Example with S = 2 S = 2 S = 2 :
Stream: 5 , 8 , 2 , 3 , 1 , 4 , 9 , 10 , 6 , … 5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots 5 , 8 , 2 , 3 , 1 , 4 , 9 , 10 , 6 , …
We maintain a sample of size 2 as we process the stream. For each new element, we decide whether to include it with probability S t = 2 t \frac{S}{t} = \frac{2}{t} t S = t 2 . If we include it, we randomly evict one of the current sample elements to maintain the sample size.
Time t t t Stream Item Sample Before Select ProbabilityDecision Sample After 1 5 ∅ \emptyset ∅ - Select with probability 1 {5} 2 8 {5} - Select with probability 1 {5, 8} 3 2 {5, 8} 2 3 \frac{2}{3} 3 2 Select; evict 5 w.p. 1 2 \frac{1}{2} 2 1 {2, 8} 4 3 {2, 8} 2 4 = 1 2 \frac{2}{4} = \frac{1}{2} 4 2 = 2 1 Don’t select (w.p. 1 2 \frac{1}{2} 2 1 ) {2, 8} 5 1 {2, 8} 2 5 \frac{2}{5} 5 2 - -
Conditional Probability Refresher
The conditional probability of A A A given B B B (the probability of A A A occurring given that B B B has occurred) is:
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A \mid B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B ) The law of total probability states:
P ( A ) = P ( A ∣ B ) ⋅ P ( B ) + P ( A ∣ B c ) ⋅ P ( B c ) P(A) = P(A \mid B) \cdot P(B) + P(A \mid B^c) \cdot P(B^c) P ( A ) = P ( A ∣ B ) ⋅ P ( B ) + P ( A ∣ B c ) ⋅ P ( B c ) Note that P ( A ∣ B ) ↔ P ( B ∣ A ) P(A \mid B) \leftrightarrow P(B \mid A) P ( A ∣ B ) ↔ P ( B ∣ A ) relates conditional probabilities in opposite directions (connected by Bayes’ theorem).
Theorem: The sampling algorithm always, at any time, maintains a uniformly randomly sampled set of size S S S .
Proof: By induction on t t t (the number of keys seen so far).
We have seen the first S + 1 S + 1 S + 1 keys: x 1 , … , x S , x S + 1 x_1, \ldots, x_S, x_{S+1} x 1 , … , x S , x S + 1 .
Let A ( t ) A(t) A ( t ) denote our sample after processing x t x_t x t . Initially, A ( S ) = { x 1 , … , x S } A(S) = \{x_1, \ldots, x_S\} A ( S ) = { x 1 , … , x S } .
For x S + 1 x_{S+1} x S + 1 :
P ( x S + 1 ∈ A ( S + 1 ) ) = S S + 1 ✓ P(x_{S+1} \in A(S+1)) = \frac{S}{S+1}\ \ \checkmark P ( x S + 1 ∈ A ( S + 1 )) = S + 1 S ✓
For x j x_j x j where j ∈ { 1 , … , S } j \in \{1, \ldots, S\} j ∈ { 1 , … , S } :
P ( x j ∈ A ( S + 1 ) ) = 1 − P ( x j ∉ A ( S + 1 ) ) = 1 − P ( x S + 1 selected and evicted x j ) = 1 − ( S S + 1 ⋅ 1 S ) = 1 − 1 S + 1 = S S + 1 ✓ \begin{align}
P(x_j \in A(S+1)) &= 1 - P(x_j \notin A(S+1)) \\
&= 1 - P(\text{$x_{S+1}$ selected and evicted } x_j) \\
&= 1 - \left(\frac{S}{S+1} \cdot \frac{1}{S}\right) \\
&= 1 - \frac{1}{S+1} = \frac{S}{S+1}\ \ \checkmark
\end{align} P ( x j ∈ A ( S + 1 )) = 1 − P ( x j ∈ / A ( S + 1 )) = 1 − P ( x S + 1 selected and evicted x j ) = 1 − ( S + 1 S ⋅ S 1 ) = 1 − S + 1 1 = S + 1 S ✓
Inductive Hypothesis: At time t = k t = k t = k , all keys x 1 , … , x k x_1, \ldots, x_k x 1 , … , x k have equal probability S k \frac{S}{k} k S of being in A ( k ) A(k) A ( k ) .
To Prove: At time t = k + 1 t = k+1 t = k + 1 , all keys x 1 , … , x k + 1 x_1, \ldots, x_{k+1} x 1 , … , x k + 1 have equal probability S k + 1 \frac{S}{k+1} k + 1 S of being in A ( k + 1 ) A(k+1) A ( k + 1 ) .
For x k + 1 x_{k+1} x k + 1 :
P ( x k + 1 ∈ A ( k + 1 ) ) = S k + 1 ✓ P(x_{k+1} \in A(k+1)) = \frac{S}{k+1}\ \ \checkmark P ( x k + 1 ∈ A ( k + 1 )) = k + 1 S ✓
For x j x_j x j where j < k + 1 j \lt k + 1 j < k + 1 :
By the law of total probability:
P ( x j ∈ A ( k + 1 ) ) = P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k ) ) ⋅ P ( x j ∈ A ( k ) ) + P ( x j ∈ A ( k + 1 ) ∣ x j ∉ A ( k ) ) ⋅ P ( x j ∉ A ( k ) ) \begin{align}
P(x_j \in A(k+1)) &=\quad P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k)) \\
&\ \quad + P(x_j \in A(k+1) \mid x_j \notin A(k)) \cdot P(x_j \notin A(k))
\end{align} P ( x j ∈ A ( k + 1 )) = P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k )) ⋅ P ( x j ∈ A ( k )) + P ( x j ∈ A ( k + 1 ) ∣ x j ∈ / A ( k )) ⋅ P ( x j ∈ / A ( k ))
The second term is 0 (if x j x_j x j is not in the sample at time k k k , then it has already been evicted, so it cannot be in A ( k + 1 ) A(k+1) A ( k + 1 ) ).
Thus:
P ( x j ∈ A ( k + 1 ) ) = P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k ) ) ⋅ P ( x j ∈ A ( k ) ) P(x_j \in A(k+1)) = P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k)) P ( x j ∈ A ( k + 1 )) = P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k )) ⋅ P ( x j ∈ A ( k ))
where P ( x j ∈ A ( k ) ) P(x_j \in A(k)) P ( x j ∈ A ( k )) , by our inductive hypothesis, is S k \frac{S}{k} k S .
Therefore:
( 1 − P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k ) ) ) × S k = 1 − ( S k + 1 ⏟ prob. new key is selected × 1 S ⏟ prob. it kicks out x j ) × S k = 1 − 1 k + 1 = k k + 1 \begin{align}
& (1 - P(x_j \in A(k+1) \mid x_j \in A(k))) \times \frac{S}{k}\\
&= 1 - \left(\underbrace{\frac{S}{k+1}}_{\text{prob. new key is selected}} \times \underbrace{\frac{1}{S}}_{\text{prob. it kicks out } x_j}\right) \times \frac{S}{k} \\
&= 1 - \frac{1}{k+1} = \frac{k}{k+1}
\end{align} ( 1 − P ( x j ∈ A ( k + 1 ) ∣ x j ∈ A ( k ))) × k S = 1 − prob. new key is selected k + 1 S × prob. it kicks out x j S 1 × k S = 1 − k + 1 1 = k + 1 k
By induction, the algorithm maintains a uniformly random sample of size S S S at all times.