Lecture 11 on 03/04/2026 - Introduce Bloom Filters

Recap: Cuckoo Hashing

Core properties:

$O(n)$ total preprocessing time
Query time: $O(1)$ worst-case

Disadvantage: $h_1$ and $h_2$ need to be from a $k$ -independent family, but $k$ is high ( $\approx \log n$ )

Space constraint: All hashing methods use $O(n)$ words of space.

To understand why space is an issue, let’s think about the universe of possible keys:

Keys come from some keyspace or universe $U$ - the set of all possible keys
Example: If your keys are 10-digit phone numbers, the universe is $U = [10^{10}]$ (all possible phone numbers), so $|U| = 10^{10}$

Now, how many bits are needed to represent a single key? If we want to distinguish between all possible keys in $U$ , we need enough bits so that $2^b \geq |U|$ , which means $b \geq \log_2(|U|)$ . Rounding up, each key requires $\lceil \log_2 |U| \rceil$ bits.

For $n$ keys, the total space is $n \cdot \lceil \log U \rceil = O(n \log U)$ bits. This can be enormous:

For $n = 1$ million phone numbers where $|U| = 10^{10}$ : we need $10^6 \times 34 \approx 34$ million bits ≈ 4 MB

The problem: The universe $U$ could be very large (potentially infinite). What if we don’t have $n \log U$ bits of space?

Information-Theoretic Lower Bound:

To exactly answer a membership query “is $q$ in $S = \{x_1, \ldots, x_n\}$ ?”, we need to store all keys exactly. If we don’t have enough space to store $U$ , we cannot answer membership queries exactly — this is an information-theoretic limitation.

Example: If we have 30 students in a course with full names 10–20 characters long, but we only have 2 characters to represent each student, there’s no way to determine membership correctly all the time. This is the fundamental trade-off: if space $<$ information needed to uniquely represent all keys, errors are inevitable.

Approximate Membership Problem

Bloom Filter (input: $S(x_1, \ldots, x_n)$ , parameter $0 < \varepsilon \leq 1$ )

Where $\varepsilon$ is the error rate. Smaller $\varepsilon$ means more space the Bloom filter will take.

Guarantees:

If $q \in S$ , then BF always answers yes (no false negatives)
If $q \notin S$ , then BF answers no with probability $\geq 1 - \varepsilon$
- Then BF may answer yes with probability $\leq \varepsilon$
- So $\varepsilon$ is the false positive rate (FPR)
Space: $1.44 n \log(1/\varepsilon)$
$\varepsilon$ is something we choose

Example: If we are fine with 1 in 32 queries being a false positive:

Space used by such a BF: $1.44 n \log(32) \approx 7n$ bits
This is only 7 bits per key, versus 34 bits per key (for $10^{10}$ universes)
Saves ~80% space compared to exact membership data structures

Why No False Negatives, But False Positives?

In many practical applications, false negatives are far worse than false positives:

Medical testing: Better to flag people without a virus (false positive) than miss people who actually have it (false negative)
Airport security: Better to briefly question someone who isn’t in the risk database (false positive) than let through someone who is (false negative)
Plagiarism detection: Better to double-check a non-plagiarized article (false positive) than miss actual plagiarism (false negative)

For these applications, having no false negatives but accepting a small false positive rate is a reasonable trade-off.

Memory Hierarchy Application:

Bloom filters are especially useful when data is stored in a slow device (disk) but with limited fast memory (RAM):

Keep the full dataset on disk (slow but all data)
Keep a compact Bloom filter in RAM (fast, uses little space)
For negative queries (not in database): Bloom filter quickly filters out most in RAM → avoid expensive disk access
For positive queries: Accept occasional false positives as cost of fast filtering on negatives

Why this matters: I/O block transfers between disk and main memory are extremely expensive (this is a major bottleneck in memory hierarchies). By using a compact Bloom filter in RAM, we can filter out most negative queries without triggering expensive disk I/O, making this approach practical for real-world systems with large datasets.

Bloom Filter Implementation

Bloom Filter parameters:

A Bloom filter is defined by two parameters:

$m$ = number of bits in the bit array: $m = \lceil 1.44 n \log(1/\varepsilon) \rceil$
- This is chosen so that the false positive rate achieves exactly $\varepsilon$
- The constant 1.44 comes from the math - it’s the optimal value that minimizes space for a given FPR
- Smaller $\varepsilon$ (fewer false positives) → larger $m$ (more bits needed)
$k$ = number of hash functions: $k = \lceil \log(2/e) \rceil \approx 0.693 \cdot \frac{m}{n}$
- This is the optimal number of hash functions for the chosen $m$ and $n$
- Interestingly, $k$ doesn’t depend on $\varepsilon$ directly, but rather on the ratio $m/n$
Hash functions: $h_1, h_2, \ldots, h_k : U \to \{0, 1, \ldots, m-1\}$ (each function maps keys to bit positions)

Data structure: Bit array of $m$ bits (initially all 0)

$LaTeX diagram$

Insertion algorithm:

To insert $n$ keys into the Bloom filter:

Start with a bit array of $m$ bits, all initialized to 0
For each key $x_i$ $x_{i}$ (in order $x_1, x_2, \ldots, x_n$ $x_{1}, x_{2}, \dots, x_{n}$ ):
- Apply all $k$ hash functions to $x_i$ : compute $h_1(x_i), h_2(x_i), \ldots, h_k(x_i)$
- Each hash function returns a bit position in $\{0, 1, \ldots, m-1\}$
- Set each of those $k$ bits to 1 in the bit array
- If a bit was already 1, it stays 1

Core property: Bits are monotonic — once set to 1, they never go back to 0. This is the crucial property that prevents false negatives.

Query algorithm: To check if element $q$ is in the Bloom filter:

Apply all $k$ hash functions to $q$ : compute $h_1(q), h_2(q), \ldots, h_k(q)$
Check the $k$ bits at those positions in the bit array
Answer “yes” if all $k$ bits are 1
Answer “no” if any of the $k$ bits is 0

Why No False Negatives:

If a key $q$ was in the original set $S$ , then during insertion we set all $k$ bits corresponding to $q$ to 1. Since bits are never reset to 0, when we later query $q$ , all $k$ bits will still be 1, so we answer yes. False negatives are impossible.

Why False Positives Can Occur:

A false positive occurs when a key $q$ that was NOT in $S$ has all $k$ bits set to 1. This happens due to hash collisions: other keys in $S$ may have hashed to the same bit positions, setting those bits to 1 even though $q$ was never inserted.

Example: Suppose $q$ has three hash locations (positions 5, 12, and 18). If three different keys in $S$ separately hashed to positions 5, 12, and 18 (due to collisions), those bits would all be set to 1. When we query $q$ , all its bits appear to be 1, so we incorrectly answer yes — a false positive.

Effect of increasing $k$ :

With more hash functions ( $k > 1$ ), we have higher false positive rates
With higher $k$ , we give the filter more tries to determine false membership
More tries → fewer false positives overall

Analysis: Proving False Positive Rate is ≤ ε

Strategy: We’ll compute the false positive rate mathematically by calculating the probability that a queried element’s $k$ bits all happen to be 1 by accident (due to other keys setting them).

Intuition: For a false positive to occur on a query for element $q$ (which was never inserted), all $k$ of its hash positions must have been set to 1 by some combination of the $n$ inserted keys. This is a probability problem.

Detailed calculation:

Consider a single bit $i$ in the Bloom filter. We’ll calculate the probability that this bit is 0 after all insertions (and thus the probability it’s 1).

Step 1: Probability a single insertion touches bit $i$

When we insert a key $x$ , we set $k$ random bits to 1. The probability that bit $i$ is one of the chosen $k$ bits is $k/m$ . Therefore, the probability that bit $i$ is NOT touched is $1 - k/m$ .

Step 2: Probability bit $i$ survives one insertion as 0

If bit $i$ starts as 0, it remains 0 only if we don’t touch it:

\Pr(\text{bit } i \text{ stays } 0 \text{ after inserting one key}) = 1 - \frac{k}{m}

Step 3: Probability bit $i$ survives all $n$ insertions as 0

The $n$ insertions are independent. Bit $i$ remains 0 only if none of the $n$ keys touch it:

\Pr(\text{bit } i \text{ is } 0 \text{ after all } n \text{ keys}) = \left(1 - \frac{k}{m}\right)^{n}

Step 4: Apply a limit approximation

For large $m$ , we can use the well-known approximation: $\lim_{m \to \infty} \left(1 - \frac{1}{m}\right)^m = e^{-1}$

Rewriting our expression:

\left(1 - \frac{k}{m}\right)^{n} = \left[\left(1 - \frac{k}{m}\right)^{m/k}\right]^{kn/m}

For large $m$ , the inner term approximates $e^{-1}$ :

\approx (e^{-1})^{kn/m} = e^{-kn/m}

Step 5: Probability that bit $i$ is 1

Therefore, the probability that bit $i$ is set to 1 after all insertions is:

\Pr(\text{bit } i \text{ is } 1) = 1 - e^{-kn/m}

This is the probability that a single random bit in the filter is 1.

Step 6: False positive probability

For a queried element $q$ (not in the database), a false positive occurs if all $k$ of its hash positions happen to be set to 1 by some combination of the inserted keys.

Since all $k$ bits are set independently during insertions:

\Pr(\text{FP on } q) = [\Pr(\text{bit is 1})]^k = \left(1 - e^{-kn/m}\right)^k

This is the false positive rate for the Bloom filter.

Step 7: Choosing parameters to achieve target $\varepsilon$

To make this analysis concrete, the Bloom filter designers chose:

$m = 1.44 n \log(1/\varepsilon)$ (number of bits)
$k = \log(2/e) \approx 0.693 \cdot \frac{m}{n}$ (number of hash functions)

With these choices, the false positive rate works out to:

\text{FPR} = \left(1 - e^{-kn/m}\right)^k \approx \varepsilon

The derivation is algebraically involved, but the important takeaway is that we can tune $m$ and $k$ to achieve any desired false positive rate $\varepsilon$ .