Lecture 12 on 03/09/2026 - Finish Bloom Filter; Introduce Streaming Algorithms & Uniform Sampling

Streaming Algorithms

Problem Setup

With a finite amount of memory, we want to process an infinite stream of data:

x_1, x_2, \ldots, x_n

We cannot store the entire stream. Instead, we want to answer queries using as little space as possible.

Sampling

Goal: Given a sample size $S$ , we want to store $S$ keys from the stream such that at any time $t$ , our sample contains $S$ keys chosen uniformly at random from $x_1, \ldots, x_t$ .

Note: Another way we can think of $t$ is as the number of keys we’ve seen so far in the stream.

Example: $S = 2$

Stream: $5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots$ (at $t = 9$ )

P(8 \in \text{Sample}) = \frac{2}{9}

The Algorithm

Assumption: Sample size $S$ . We’ll store the first $S$ keys from the stream.

Store the first $S$ keys: $x_1, \ldots, x_S$
When a new key $x_i$ $x_{i}$ arrives:
- Select it with probability $\frac{S}{t}$ and add to our sample
- If not selected, the sample stays the same
- If selected, we add $x_i$ and randomly remove one of the $S$ keys from the sample

Example with $S = 2$ :

Stream: $5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots$

We maintain a sample of size 2 as we process the stream. For each new element, we decide whether to include it with probability $\frac{S}{t} = \frac{2}{t}$ . If we include it, we randomly evict one of the current sample elements to maintain the sample size.

Time $t$	Stream Item	Sample Before	Select Probability	Decision	Sample After
1	5	$\emptyset$	-	Select with probability 1	{5}
2	8	{5}	-	Select with probability 1	{5, 8}
3	2	{5, 8}	$\frac{2}{3}$	Select; evict 5 w.p. $\frac{1}{2}$	{2, 8}
4	3	{2, 8}	$\frac{2}{4} = \frac{1}{2}$	Don’t select (w.p. $\frac{1}{2}$ )	{2, 8}
5	1	{2, 8}	$\frac{2}{5}$	-	-

Correctness of the Sampling Algorithm

Theorem: The sampling algorithm always, at any time, maintains a uniformly randomly sampled set of size $S$ .

Proof: By induction on $t$ (the number of keys seen so far).

Base Case: t = S + 1

We have seen the first $S + 1$ keys: $x_1, \ldots, x_S, x_{S+1}$ .

Let $A(t)$ denote our sample after processing $x_t$ . Initially, $A(S) = \{x_1, \ldots, x_S\}$ .

For $x_{S+1}$ :

P(x_{S+1} \in A(S+1)) = \frac{S}{S+1}\ \ \checkmark

For $x_j$ where $j \in \{1, \ldots, S\}$ :

\begin{align} P(x_j \in A(S+1)) &= 1 - P(x_j \notin A(S+1)) \\ &= 1 - P(\text{$x_{S+1}$ selected and evicted } x_j) \\ &= 1 - \left(\frac{S}{S+1} \cdot \frac{1}{S}\right) \\ &= 1 - \frac{1}{S+1} = \frac{S}{S+1}\ \ \checkmark \end{align}

Inductive Step

Inductive Hypothesis: At time $t = k$ , all keys $x_1, \ldots, x_k$ have equal probability $\frac{S}{k}$ of being in $A(k)$ .

To Prove: At time $t = k+1$ , all keys $x_1, \ldots, x_{k+1}$ have equal probability $\frac{S}{k+1}$ of being in $A(k+1)$ .

For $x_{k+1}$ :

P(x_{k+1} \in A(k+1)) = \frac{S}{k+1}\ \ \checkmark

For $x_j$ where $j \lt k + 1$ :

By the law of total probability:

\begin{align} P(x_j \in A(k+1)) &=\quad P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k)) \\ &\ \quad + P(x_j \in A(k+1) \mid x_j \notin A(k)) \cdot P(x_j \notin A(k)) \end{align}

The second term is 0 (if $x_j$ is not in the sample at time $k$ , then it has already been evicted, so it cannot be in $A(k+1)$ ).

Thus:

P(x_j \in A(k+1)) = P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k))

where $P(x_j \in A(k))$ , by our inductive hypothesis, is $\frac{S}{k}$ .

Therefore:

\begin{align} & (1 - P(x_j \in A(k+1) \mid x_j \in A(k))) \times \frac{S}{k}\\ &= 1 - \left(\underbrace{\frac{S}{k+1}}_{\text{prob. new key is selected}} \times \underbrace{\frac{1}{S}}_{\text{prob. it kicks out } x_j}\right) \times \frac{S}{k} \\ &= 1 - \frac{1}{k+1} = \frac{k}{k+1} \end{align}

By induction, the algorithm maintains a uniformly random sample of size $S$ at all times.