Skip to content

Lecture 12 on 03/09/2026 - Finish Bloom Filter; Introduce Streaming Algorithms & Uniform Sampling

With a finite amount of memory, we want to process an infinite stream of data:

x1,x2,,xnx_1, x_2, \ldots, x_n

We cannot store the entire stream. Instead, we want to answer queries using as little space as possible.

Goal: Given a sample size SS, we want to store SS keys from the stream such that at any time tt, our sample contains SS keys chosen uniformly at random from x1,,xtx_1, \ldots, x_t.

Note: Another way we can think of tt is as the number of keys we’ve seen so far in the stream.

Example: S=2S = 2

Stream: 5,8,2,3,1,4,9,10,6,5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots (at t=9t = 9)

P(8Sample)=29P(8 \in \text{Sample}) = \frac{2}{9}

Assumption: Sample size SS. We’ll store the first SS keys from the stream.

  1. Store the first SS keys: x1,,xSx_1, \ldots, x_S
  2. When a new key xix_i arrives:
    • Select it with probability St\frac{S}{t} and add to our sample
    • If not selected, the sample stays the same
    • If selected, we add xix_i and randomly remove one of the SS keys from the sample

Example with S=2S = 2:

Stream: 5,8,2,3,1,4,9,10,6,5, 8, 2, 3, 1, 4, 9, 10, 6, \ldots

We maintain a sample of size 2 as we process the stream. For each new element, we decide whether to include it with probability St=2t\frac{S}{t} = \frac{2}{t}. If we include it, we randomly evict one of the current sample elements to maintain the sample size.

Time ttStream ItemSample BeforeSelect ProbabilityDecisionSample After
15\emptyset-Select with probability 1{5}
28{5}-Select with probability 1{5, 8}
32{5, 8}23\frac{2}{3}Select; evict 5 w.p. 12\frac{1}{2}{2, 8}
43{2, 8}24=12\frac{2}{4} = \frac{1}{2}Don’t select (w.p. 12\frac{1}{2}){2, 8}
51{2, 8}25\frac{2}{5}--

Theorem: The sampling algorithm always, at any time, maintains a uniformly randomly sampled set of size SS.

Proof: By induction on tt (the number of keys seen so far).

We have seen the first S+1S + 1 keys: x1,,xS,xS+1x_1, \ldots, x_S, x_{S+1}.

Let A(t)A(t) denote our sample after processing xtx_t. Initially, A(S)={x1,,xS}A(S) = \{x_1, \ldots, x_S\}.

For xS+1x_{S+1}:

P(xS+1A(S+1))=SS+1  P(x_{S+1} \in A(S+1)) = \frac{S}{S+1}\ \ \checkmark

For xjx_j where j{1,,S}j \in \{1, \ldots, S\}:

P(xjA(S+1))=1P(xjA(S+1))=1P(xS+1 selected and evicted xj)=1(SS+11S)=11S+1=SS+1  \begin{align} P(x_j \in A(S+1)) &= 1 - P(x_j \notin A(S+1)) \\ &= 1 - P(\text{$x_{S+1}$ selected and evicted } x_j) \\ &= 1 - \left(\frac{S}{S+1} \cdot \frac{1}{S}\right) \\ &= 1 - \frac{1}{S+1} = \frac{S}{S+1}\ \ \checkmark \end{align}

Inductive Hypothesis: At time t=kt = k, all keys x1,,xkx_1, \ldots, x_k have equal probability Sk\frac{S}{k} of being in A(k)A(k).

To Prove: At time t=k+1t = k+1, all keys x1,,xk+1x_1, \ldots, x_{k+1} have equal probability Sk+1\frac{S}{k+1} of being in A(k+1)A(k+1).

For xk+1x_{k+1}:

P(xk+1A(k+1))=Sk+1  P(x_{k+1} \in A(k+1)) = \frac{S}{k+1}\ \ \checkmark

For xjx_j where j<k+1j \lt k + 1:

By the law of total probability:

P(xjA(k+1))=P(xjA(k+1)xjA(k))P(xjA(k)) +P(xjA(k+1)xjA(k))P(xjA(k))\begin{align} P(x_j \in A(k+1)) &=\quad P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k)) \\ &\ \quad + P(x_j \in A(k+1) \mid x_j \notin A(k)) \cdot P(x_j \notin A(k)) \end{align}

The second term is 0 (if xjx_j is not in the sample at time kk, then it has already been evicted, so it cannot be in A(k+1)A(k+1)).

Thus:

P(xjA(k+1))=P(xjA(k+1)xjA(k))P(xjA(k))P(x_j \in A(k+1)) = P(x_j \in A(k+1) \mid x_j \in A(k)) \cdot P(x_j \in A(k))

where P(xjA(k))P(x_j \in A(k)), by our inductive hypothesis, is Sk\frac{S}{k}.

Therefore:

(1P(xjA(k+1)xjA(k)))×Sk=1(Sk+1prob. new key is selected×1Sprob. it kicks out xj)×Sk=11k+1=kk+1\begin{align} & (1 - P(x_j \in A(k+1) \mid x_j \in A(k))) \times \frac{S}{k}\\ &= 1 - \left(\underbrace{\frac{S}{k+1}}_{\text{prob. new key is selected}} \times \underbrace{\frac{1}{S}}_{\text{prob. it kicks out } x_j}\right) \times \frac{S}{k} \\ &= 1 - \frac{1}{k+1} = \frac{k}{k+1} \end{align}

By induction, the algorithm maintains a uniformly random sample of size SS at all times.