Lecture 26 (05/13/2026) - Approximate Nearest Neighbor; Locality Sensitive Hashing; Course Review
Scribes: Naufil Faruqi and Samuel Sokol
Summary of Lecture
Section titled “Summary of Lecture”- Introducing the Nearest Neighbor Search (NNS) problem by comparing exact membership to similarity queries
- Defining Exact NNS and demonstrating that the trivial brute force algorithm takes query time
- Discussing the hardness of Exact NNS: achieving strictly sub-linear query time would violate the Strong Exponential Time Hypothesis (SETH)
- Relaxing the problem to Approximate NNS using an approximation factor to bypass the exact NNS bottleneck
- Briefly introducing K-Nearest Neighbors (K-NN) and its approximate variants for retrieving multiple similar data points
Introduction to Nearest Neighbor Search
Section titled “Introduction to Nearest Neighbor Search”When dealing with large datasets, a fundamental problem is evaluating queries against a membership set. Suppose we have a set . A basic membership query asks: “Is the query in ?”
However, in many real-world applications (like image search or recommendation systems), we are more interested in similarity. Instead of exact matches, the question becomes: “Is the query similar to some key in ?” This gives rise to the Nearest Neighbor Search (NNS) problem.
Exact Nearest Neighbor Search (NNS)
Section titled “Exact Nearest Neighbor Search (NNS)”In the formal Nearest Neighbor Search problem, we are given a query point and we want to return the nearest neighbor of in the set . Both the query and the data points exist in a -dimensional real coordinate space, meaning and .
The goal is to find the point that minimizes the distance to :
Trivial Solution (Brute Force): The most straightforward approach is to compute the exact distance for all , and simply output the smallest value. Because computing the distance between two -dimensional vectors takes time, and we must do this for points, the brute force query time is exactly .
Hardness of Exact NNS
Section titled “Hardness of Exact NNS”A natural question is whether we can achieve an exact NNS algorithm that operates significantly faster than the brute force time, perhaps avoiding the need to scan every single point linearly.
Theorem (Ryan Williams and Alman): Any algorithm that solves exact NNS in strictly sub-linear time concerning , specifically running in query time, would violate the Strong Exponential Time Hypothesis (SETH).
Analysis: SETH is a major conjecture in computational complexity implying that boolean satisfiability cannot be solved significantly faster than exhaustive search. Because violating SETH is considered highly unlikely, this result demonstrates that a fast, exact algorithm for NNS in high dimensions is practically impossible. This inherent bottleneck forces us to consider relaxations of the problem.
Approximate Nearest Neighbor Search
Section titled “Approximate Nearest Neighbor Search”Approximate NNS
Section titled “Approximate NNS”Since exact NNS is computationally prohibitive, we relax the requirement and allow the algorithm to return an approximate nearest neighbor.
For a given approximation factor , the Approximate NNS problem requires us to return a point such that its distance to the query is no more than times the distance to the true nearest neighbor. Mathematically, we must find an such that:
By sacrificing exactness (controlled by the parameter ), we can drastically improve query times.
K-Nearest Neighbors (K-NN)
Section titled “K-Nearest Neighbors (K-NN)”A closely related variant is K-NN. In the exact K-NN problem, we return the nearest neighbors to the query rather than just the single closest point. Similarly, there exists an Approximate K-NNS variant for , which has been a topic of recent research (e.g., PODS 2020), offering similar time-accuracy tradeoffs for retrieving multiple similar points.
Sets and Jaccard Distance
Section titled “Sets and Jaccard Distance”While Euclidean and Hamming distances are useful for vector spaces, we often need to measure the similarity between collections of items or sets.
Jaccard Distance / Similarity: Given two sets and , the Jaccard measurement is defined as the size of their intersection divided by the size of their union:
This metric provides a value between 0 and 1, where 1 indicates the sets are identical and 0 indicates they are completely disjoint. Min-Hash is a standard LSH technique built specifically to estimate this Jaccard metric for large sets.
Locality Sensitive Hashing (LSH)
Section titled “Locality Sensitive Hashing (LSH)”Data Structure
Section titled “Data Structure”depends on the definition of distance. For Hamming distance (see Hamming Distance):
for Euclidean distance (see LSH for Euclidean Space):
LSH Family
Section titled “LSH Family”Related: See Alexandrr Andoni’s work on data-dependent LSH. Linked here: https://arxiv.org/pdf/1501.01062
LSH family:
A family of hash functions
where
is called -sensitive if for any two points , and a random from ,
-
- If , then
- If , then
A family is interesting if
ANN Algorithm
Section titled “ANN Algorithm”-
-
Find for the problem distance.
-
For
Each defines one hash table, so there are buckets/tables total.
Hamming Distance
Section titled “Hamming Distance”Consider binary strings in :
Let sample the -th bit:
-
- If , then Therefore,
- If , then Therefore,
Sampling is a -sensitive hash family for Hamming distance.
So,
where each is chosen from the above LSH family.
Thus,
LSH for Euclidean Space
Section titled “LSH for Euclidean Space”random projection onto a -dimensional hyperplane gives an LSH
For Euclidean distance, the “bucket” is what point on the hyperplane you land on.
Final Topics Overview
Section titled “Final Topics Overview”In Class
Section titled “In Class”-
- Basic Probability:
- Expectation, Variance, Markov, Chebyshev, Chernoff
- Streaming Algorithms
- Uniform Sampling
- Counting (Morris Counter)
- Approximate Median
- Count-min Sketch, Frequency Approximation, Heavy Hitters
- Frequency Moment Estimation (AMS Sampling)
- Distinct Elements
- Online Algorithms
- Toy
- List Update
- Expert’s MWU
- Caching, Paging, Resource Augmentation
- JL Lemma, NNS
On Brightspace
Section titled “On Brightspace”- Basic Probability: Expectation, variance, Markov, Chebyshev, Chernoff
- Streaming Algorithms
- Online Algorithms
- Dimension Reduction
- Nearest Neighbor Search