Lecture 11 on 03/04/2026 - Introduce Bloom Filters
Recap: Cuckoo Hashing
Section titled “Recap: Cuckoo Hashing”Core properties:
- total preprocessing time
- Query time: worst-case
Disadvantage: and need to be from a -independent family, but is high ()
Space constraint: All hashing methods use words of space.
To understand why space is an issue, let’s think about the universe of possible keys:
- Keys come from some keyspace or universe - the set of all possible keys
- Example: If your keys are 10-digit phone numbers, the universe is (all possible phone numbers), so
Now, how many bits are needed to represent a single key? If we want to distinguish between all possible keys in , we need enough bits so that , which means . Rounding up, each key requires bits.
For keys, the total space is bits. This can be enormous:
- For million phone numbers where : we need million bits ≈ 4 MB
The problem: The universe could be very large (potentially infinite). What if we don’t have bits of space?
Information-Theoretic Lower Bound:
To exactly answer a membership query “is in ?”, we need to store all keys exactly. If we don’t have enough space to store , we cannot answer membership queries exactly — this is an information-theoretic limitation.
Example: If we have 30 students in a course with full names 10–20 characters long, but we only have 2 characters to represent each student, there’s no way to determine membership correctly all the time. This is the fundamental trade-off: if space information needed to uniquely represent all keys, errors are inevitable.
Approximate Membership Problem
Section titled “Approximate Membership Problem”Bloom Filter (input: , parameter )
Where is the error rate. Smaller means more space the Bloom filter will take.
Guarantees:
-
-
If , then BF always answers yes (no false negatives)
-
If , then BF answers no with probability
- Then BF may answer yes with probability
- So is the false positive rate (FPR)
-
Space:
-
is something we choose
Example: If we are fine with 1 in 32 queries being a false positive:
- Space used by such a BF: bits
- This is only 7 bits per key, versus 34 bits per key (for universes)
- Saves ~80% space compared to exact membership data structures
Why No False Negatives, But False Positives?
In many practical applications, false negatives are far worse than false positives:
- Medical testing: Better to flag people without a virus (false positive) than miss people who actually have it (false negative)
- Airport security: Better to briefly question someone who isn’t in the risk database (false positive) than let through someone who is (false negative)
- Plagiarism detection: Better to double-check a non-plagiarized article (false positive) than miss actual plagiarism (false negative)
For these applications, having no false negatives but accepting a small false positive rate is a reasonable trade-off.
Memory Hierarchy Application:
Bloom filters are especially useful when data is stored in a slow device (disk) but with limited fast memory (RAM):
- Keep the full dataset on disk (slow but all data)
- Keep a compact Bloom filter in RAM (fast, uses little space)
- For negative queries (not in database): Bloom filter quickly filters out most in RAM → avoid expensive disk access
- For positive queries: Accept occasional false positives as cost of fast filtering on negatives
Why this matters: I/O block transfers between disk and main memory are extremely expensive (this is a major bottleneck in memory hierarchies). By using a compact Bloom filter in RAM, we can filter out most negative queries without triggering expensive disk I/O, making this approach practical for real-world systems with large datasets.
Bloom Filter Implementation
Section titled “Bloom Filter Implementation”Bloom Filter parameters:
A Bloom filter is defined by two parameters:
-
= number of bits in the bit array:
- This is chosen so that the false positive rate achieves exactly
- The constant 1.44 comes from the math - it’s the optimal value that minimizes space for a given FPR
- Smaller (fewer false positives) → larger (more bits needed)
-
= number of hash functions:
- This is the optimal number of hash functions for the chosen and
- Interestingly, doesn’t depend on directly, but rather on the ratio
-
Hash functions: (each function maps keys to bit positions)
Data structure: Bit array of bits (initially all 0)
Insertion algorithm:
To insert keys into the Bloom filter:
- Start with a bit array of bits, all initialized to 0
- For each key (in order ):
- Apply all hash functions to : compute
- Each hash function returns a bit position in
- Set each of those bits to 1 in the bit array
- If a bit was already 1, it stays 1
Core property: Bits are monotonic — once set to 1, they never go back to 0. This is the crucial property that prevents false negatives.
Query algorithm: To check if element is in the Bloom filter:
- Apply all hash functions to : compute
- Check the bits at those positions in the bit array
- Answer “yes” if all bits are 1
- Answer “no” if any of the bits is 0
Why No False Negatives:
If a key was in the original set , then during insertion we set all bits corresponding to to 1. Since bits are never reset to 0, when we later query , all bits will still be 1, so we answer yes. False negatives are impossible.
Why False Positives Can Occur:
A false positive occurs when a key that was NOT in has all bits set to 1. This happens due to hash collisions: other keys in may have hashed to the same bit positions, setting those bits to 1 even though was never inserted.
Example: Suppose has three hash locations (positions 5, 12, and 18). If three different keys in separately hashed to positions 5, 12, and 18 (due to collisions), those bits would all be set to 1. When we query , all its bits appear to be 1, so we incorrectly answer yes — a false positive.
Effect of increasing :
- With more hash functions (), we have higher false positive rates
- With higher , we give the filter more tries to determine false membership
- More tries → fewer false positives overall
Analysis: Proving False Positive Rate is ≤ ε
Section titled “Analysis: Proving False Positive Rate is ≤ ε”Strategy: We’ll compute the false positive rate mathematically by calculating the probability that a queried element’s bits all happen to be 1 by accident (due to other keys setting them).
Intuition: For a false positive to occur on a query for element (which was never inserted), all of its hash positions must have been set to 1 by some combination of the inserted keys. This is a probability problem.
Detailed calculation:
Consider a single bit in the Bloom filter. We’ll calculate the probability that this bit is 0 after all insertions (and thus the probability it’s 1).
Step 1: Probability a single insertion touches bit
When we insert a key , we set random bits to 1. The probability that bit is one of the chosen bits is . Therefore, the probability that bit is NOT touched is .
Step 2: Probability bit survives one insertion as 0
If bit starts as 0, it remains 0 only if we don’t touch it:
Step 3: Probability bit survives all insertions as 0
The insertions are independent. Bit remains 0 only if none of the keys touch it:
Step 4: Apply a limit approximation
For large , we can use the well-known approximation:
Rewriting our expression:
For large , the inner term approximates :
Step 5: Probability that bit is 1
Therefore, the probability that bit is set to 1 after all insertions is:
This is the probability that a single random bit in the filter is 1.
Step 6: False positive probability
For a queried element (not in the database), a false positive occurs if all of its hash positions happen to be set to 1 by some combination of the inserted keys.
Since all bits are set independently during insertions:
This is the false positive rate for the Bloom filter.
Step 7: Choosing parameters to achieve target
To make this analysis concrete, the Bloom filter designers chose:
- (number of bits)
- (number of hash functions)
With these choices, the false positive rate works out to:
The derivation is algebraically involved, but the important takeaway is that we can tune and to achieve any desired false positive rate .