Skip to content

Lecture 11 on 03/04/2026 - Introduce Bloom Filters

Core properties:

  • O(n)O(n) total preprocessing time
  • Query time: O(1)O(1) worst-case

Disadvantage: h1h_1 and h2h_2 need to be from a kk-independent family, but kk is high (logn\approx \log n)

Space constraint: All hashing methods use O(n)O(n) words of space.

To understand why space is an issue, let’s think about the universe of possible keys:

  • Keys come from some keyspace or universe UU - the set of all possible keys
  • Example: If your keys are 10-digit phone numbers, the universe is U=[1010]U = [10^{10}] (all possible phone numbers), so U=1010|U| = 10^{10}

Now, how many bits are needed to represent a single key? If we want to distinguish between all possible keys in UU, we need enough bits so that 2bU2^b \geq |U|, which means blog2(U)b \geq \log_2(|U|). Rounding up, each key requires log2U\lceil \log_2 |U| \rceil bits.

For nn keys, the total space is nlogU=O(nlogU)n \cdot \lceil \log U \rceil = O(n \log U) bits. This can be enormous:

  • For n=1n = 1 million phone numbers where U=1010|U| = 10^{10}: we need 106×343410^6 \times 34 \approx 34 million bits ≈ 4 MB

The problem: The universe UU could be very large (potentially infinite). What if we don’t have nlogUn \log U bits of space?

Information-Theoretic Lower Bound:

To exactly answer a membership query “is qq in S={x1,,xn}S = \{x_1, \ldots, x_n\}?”, we need to store all keys exactly. If we don’t have enough space to store UU, we cannot answer membership queries exactly — this is an information-theoretic limitation.

Example: If we have 30 students in a course with full names 10–20 characters long, but we only have 2 characters to represent each student, there’s no way to determine membership correctly all the time. This is the fundamental trade-off: if space << information needed to uniquely represent all keys, errors are inevitable.

Bloom Filter (input: S(x1,,xn)S(x_1, \ldots, x_n), parameter 0<ε10 < \varepsilon \leq 1)

Where ε\varepsilon is the error rate. Smaller ε\varepsilon means more space the Bloom filter will take.

Guarantees:

  • If qSq \in S, then BF always answers yes (no false negatives)

  • If qSq \notin S, then BF answers no with probability 1ε\geq 1 - \varepsilon

    • Then BF may answer yes with probability ε\leq \varepsilon
    • So ε\varepsilon is the false positive rate (FPR)
  • Space: 1.44nlog(1/ε)1.44 n \log(1/\varepsilon)

  • ε\varepsilon is something we choose

Example: If we are fine with 1 in 32 queries being a false positive:

  • Space used by such a BF: 1.44nlog(32)7n1.44 n \log(32) \approx 7n bits
  • This is only 7 bits per key, versus 34 bits per key (for 101010^{10} universes)
  • Saves ~80% space compared to exact membership data structures

Why No False Negatives, But False Positives?

In many practical applications, false negatives are far worse than false positives:

  • Medical testing: Better to flag people without a virus (false positive) than miss people who actually have it (false negative)
  • Airport security: Better to briefly question someone who isn’t in the risk database (false positive) than let through someone who is (false negative)
  • Plagiarism detection: Better to double-check a non-plagiarized article (false positive) than miss actual plagiarism (false negative)

For these applications, having no false negatives but accepting a small false positive rate is a reasonable trade-off.

Memory Hierarchy Application:

Bloom filters are especially useful when data is stored in a slow device (disk) but with limited fast memory (RAM):

  • Keep the full dataset on disk (slow but all data)
  • Keep a compact Bloom filter in RAM (fast, uses little space)
  • For negative queries (not in database): Bloom filter quickly filters out most in RAM → avoid expensive disk access
  • For positive queries: Accept occasional false positives as cost of fast filtering on negatives

Why this matters: I/O block transfers between disk and main memory are extremely expensive (this is a major bottleneck in memory hierarchies). By using a compact Bloom filter in RAM, we can filter out most negative queries without triggering expensive disk I/O, making this approach practical for real-world systems with large datasets.

Bloom Filter parameters:

A Bloom filter is defined by two parameters:

  • mm = number of bits in the bit array: m=1.44nlog(1/ε)m = \lceil 1.44 n \log(1/\varepsilon) \rceil

    • This is chosen so that the false positive rate achieves exactly ε\varepsilon
    • The constant 1.44 comes from the math - it’s the optimal value that minimizes space for a given FPR
    • Smaller ε\varepsilon (fewer false positives) → larger mm (more bits needed)
  • kk = number of hash functions: k=log(2/e)0.693mnk = \lceil \log(2/e) \rceil \approx 0.693 \cdot \frac{m}{n}

    • This is the optimal number of hash functions for the chosen mm and nn
    • Interestingly, kk doesn’t depend on ε\varepsilon directly, but rather on the ratio m/nm/n
  • Hash functions: h1,h2,,hk:U{0,1,,m1}h_1, h_2, \ldots, h_k : U \to \{0, 1, \ldots, m-1\} (each function maps keys to bit positions)

Data structure: Bit array of mm bits (initially all 0)

LaTeX diagram

Insertion algorithm:

To insert nn keys into the Bloom filter:

  1. Start with a bit array of mm bits, all initialized to 0
  2. For each key xix_i (in order x1,x2,,xnx_1, x_2, \ldots, x_n):
    • Apply all kk hash functions to xix_i: compute h1(xi),h2(xi),,hk(xi)h_1(x_i), h_2(x_i), \ldots, h_k(x_i)
    • Each hash function returns a bit position in {0,1,,m1}\{0, 1, \ldots, m-1\}
    • Set each of those kk bits to 1 in the bit array
    • If a bit was already 1, it stays 1

Core property: Bits are monotonic — once set to 1, they never go back to 0. This is the crucial property that prevents false negatives.

Query algorithm: To check if element qq is in the Bloom filter:

  1. Apply all kk hash functions to qq: compute h1(q),h2(q),,hk(q)h_1(q), h_2(q), \ldots, h_k(q)
  2. Check the kk bits at those positions in the bit array
  3. Answer “yes” if all kk bits are 1
  4. Answer “no” if any of the kk bits is 0

Why No False Negatives:

If a key qq was in the original set SS, then during insertion we set all kk bits corresponding to qq to 1. Since bits are never reset to 0, when we later query qq, all kk bits will still be 1, so we answer yes. False negatives are impossible.

Why False Positives Can Occur:

A false positive occurs when a key qq that was NOT in SS has all kk bits set to 1. This happens due to hash collisions: other keys in SS may have hashed to the same bit positions, setting those bits to 1 even though qq was never inserted.

Example: Suppose qq has three hash locations (positions 5, 12, and 18). If three different keys in SS separately hashed to positions 5, 12, and 18 (due to collisions), those bits would all be set to 1. When we query qq, all its bits appear to be 1, so we incorrectly answer yes — a false positive.

Effect of increasing kk:

  • With more hash functions (k>1k > 1), we have higher false positive rates
  • With higher kk, we give the filter more tries to determine false membership
  • More tries → fewer false positives overall

Analysis: Proving False Positive Rate is ≤ ε

Section titled “Analysis: Proving False Positive Rate is ≤ ε”

Strategy: We’ll compute the false positive rate mathematically by calculating the probability that a queried element’s kk bits all happen to be 1 by accident (due to other keys setting them).

Intuition: For a false positive to occur on a query for element qq (which was never inserted), all kk of its hash positions must have been set to 1 by some combination of the nn inserted keys. This is a probability problem.

Detailed calculation:

Consider a single bit ii in the Bloom filter. We’ll calculate the probability that this bit is 0 after all insertions (and thus the probability it’s 1).

Step 1: Probability a single insertion touches bit ii

When we insert a key xx, we set kk random bits to 1. The probability that bit ii is one of the chosen kk bits is k/mk/m. Therefore, the probability that bit ii is NOT touched is 1k/m1 - k/m.

Step 2: Probability bit ii survives one insertion as 0

If bit ii starts as 0, it remains 0 only if we don’t touch it:

Pr(bit i stays 0 after inserting one key)=1km\Pr(\text{bit } i \text{ stays } 0 \text{ after inserting one key}) = 1 - \frac{k}{m}

Step 3: Probability bit ii survives all nn insertions as 0

The nn insertions are independent. Bit ii remains 0 only if none of the nn keys touch it:

Pr(bit i is 0 after all n keys)=(1km)n\Pr(\text{bit } i \text{ is } 0 \text{ after all } n \text{ keys}) = \left(1 - \frac{k}{m}\right)^{n}

Step 4: Apply a limit approximation

For large mm, we can use the well-known approximation: limm(11m)m=e1\lim_{m \to \infty} \left(1 - \frac{1}{m}\right)^m = e^{-1}

Rewriting our expression:

(1km)n=[(1km)m/k]kn/m\left(1 - \frac{k}{m}\right)^{n} = \left[\left(1 - \frac{k}{m}\right)^{m/k}\right]^{kn/m}

For large mm, the inner term approximates e1e^{-1}:

(e1)kn/m=ekn/m\approx (e^{-1})^{kn/m} = e^{-kn/m}

Step 5: Probability that bit ii is 1

Therefore, the probability that bit ii is set to 1 after all insertions is:

Pr(bit i is 1)=1ekn/m\Pr(\text{bit } i \text{ is } 1) = 1 - e^{-kn/m}

This is the probability that a single random bit in the filter is 1.

Step 6: False positive probability

For a queried element qq (not in the database), a false positive occurs if all kk of its hash positions happen to be set to 1 by some combination of the inserted keys.

Since all kk bits are set independently during insertions:

Pr(FP on q)=[Pr(bit is 1)]k=(1ekn/m)k\Pr(\text{FP on } q) = [\Pr(\text{bit is 1})]^k = \left(1 - e^{-kn/m}\right)^k

This is the false positive rate for the Bloom filter.

Step 7: Choosing parameters to achieve target ε\varepsilon

To make this analysis concrete, the Bloom filter designers chose:

  • m=1.44nlog(1/ε)m = 1.44 n \log(1/\varepsilon) (number of bits)
  • k=log(2/e)0.693mnk = \log(2/e) \approx 0.693 \cdot \frac{m}{n} (number of hash functions)

With these choices, the false positive rate works out to:

FPR=(1ekn/m)kε\text{FPR} = \left(1 - e^{-kn/m}\right)^k \approx \varepsilon

The derivation is algebraically involved, but the important takeaway is that we can tune mm and kk to achieve any desired false positive rate ε\varepsilon.