Skip to content

Lecture 26 (05/13/2026) - Approximate Nearest Neighbor; Locality Sensitive Hashing; Course Review

Scribes: Naufil Faruqi and Samuel Sokol

  • Introducing the Nearest Neighbor Search (NNS) problem by comparing exact membership to similarity queries
  • Defining Exact NNS and demonstrating that the trivial brute force algorithm takes O(nd)O(nd) query time
  • Discussing the hardness of Exact NNS: achieving strictly sub-linear query time would violate the Strong Exponential Time Hypothesis (SETH)
  • Relaxing the problem to Approximate NNS using an approximation factor c>1c > 1 to bypass the exact NNS bottleneck
  • Briefly introducing K-Nearest Neighbors (K-NN) and its approximate variants for retrieving multiple similar data points

When dealing with large datasets, a fundamental problem is evaluating queries against a membership set. Suppose we have a set S={x1,,xn}S = \{x_1, \dots, x_n\}. A basic membership query asks: “Is the query qq in SS?”

However, in many real-world applications (like image search or recommendation systems), we are more interested in similarity. Instead of exact matches, the question becomes: “Is the query qq similar to some key in SS?” This gives rise to the Nearest Neighbor Search (NNS) problem.

In the formal Nearest Neighbor Search problem, we are given a query point qq and we want to return the nearest neighbor of qq in the set SS. Both the query qq and the data points xix_i exist in a dd-dimensional real coordinate space, meaning xiRdx_i \in \mathbb{R}^d and qRdq \in \mathbb{R}^d.

The goal is to find the point xix_i that minimizes the distance to qq:

min1ind(q,xi)\min_{1 \le i \le n} d(q, x_i)

Trivial Solution (Brute Force): The most straightforward approach is to compute the exact distance d(q,xi)d(q, x_i) for all i[n]i \in [n], and simply output the smallest value. Because computing the distance between two dd-dimensional vectors takes O(d)O(d) time, and we must do this for nn points, the brute force query time is exactly O(nd)O(nd).

A natural question is whether we can achieve an exact NNS algorithm that operates significantly faster than the brute force O(nd)O(nd) time, perhaps avoiding the need to scan every single point linearly.

Theorem (Ryan Williams and Alman): Any algorithm that solves exact NNS in strictly sub-linear time concerning nn, specifically running in O(n0.999d)O(n^{0.999}d) query time, would violate the Strong Exponential Time Hypothesis (SETH).

Analysis: SETH is a major conjecture in computational complexity implying that boolean satisfiability cannot be solved significantly faster than exhaustive search. Because violating SETH is considered highly unlikely, this result demonstrates that a fast, exact algorithm for NNS in high dimensions is practically impossible. This inherent bottleneck forces us to consider relaxations of the problem.

Since exact NNS is computationally prohibitive, we relax the requirement and allow the algorithm to return an approximate nearest neighbor.

For a given approximation factor c>1c > 1, the Approximate NNS problem requires us to return a point xjx_j such that its distance to the query is no more than cc times the distance to the true nearest neighbor. Mathematically, we must find an xjx_j such that:

d(q,xj)cmin1ind(q,xi)d(q, x_j) \le c \min_{1 \le i \le n} d(q, x_i)

By sacrificing exactness (controlled by the parameter cc), we can drastically improve query times.

A closely related variant is K-NN. In the exact K-NN problem, we return the kk nearest neighbors to the query qq rather than just the single closest point. Similarly, there exists an Approximate K-NNS variant for c>1c > 1, which has been a topic of recent research (e.g., PODS 2020), offering similar time-accuracy tradeoffs for retrieving multiple similar points.

While Euclidean and Hamming distances are useful for vector spaces, we often need to measure the similarity between collections of items or sets.

Jaccard Distance / Similarity: Given two sets AA and BB, the Jaccard measurement is defined as the size of their intersection divided by the size of their union:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}

This metric provides a value between 0 and 1, where 1 indicates the sets are identical and 0 indicates they are completely disjoint. Min-Hash is a standard LSH technique built specifically to estimate this Jaccard metric for large sets.

Preprocessing / Storage: O(n1+pd)\text{Preprocessing / Storage: } O(n^{1+p}d) Query time: O(npd)\text{Query time: } O(n^p d)

pp depends on the definition of distance. For Hamming distance (see Hamming Distance):

p=1cp = \frac{1}{c}

for Euclidean distance (see LSH for Euclidean Space):

p=1c2p=\frac{1}{c^2}

Related: See Alexandrr Andoni’s work on data-dependent LSH. Linked here: https://arxiv.org/pdf/1501.01062

LSH family:

A family H\mathcal{H} of hash functions

H={h1,,hH}\mathcal{H} = \{h_1,\ldots,h_{|\mathcal{H}|}\}

where

hi:UUh_i : U \longrightarrow U'

is called (r,cr,p1,p2)(r,cr,p_1,p_2)-sensitive if for any two points x,yUx,y \in U, and a random hh from H\mathcal{H},

  • If d(x,y)rd(x,y) \le r, then h(x)=h(y)    w.p.p1h(x) = h(y) ~~~~ \mathrm{w.p.} \ge p_1
  • If d(x,y)crd(x,y) \ge cr, then h(x)=h(y)    w.p.p2h(x) = h(y) ~~~~ \mathrm{w.p.} \le p_2

A family is interesting if

p1>p2.p_1 > p_2.
  • Find H\mathcal{H} for the problem distance.

  • For hijHh_{ij} \in \mathcal{H}

    x1g1=(h11,h21,,hk1)g2=(h12,h22,,hk2)xngL=(h1L,h2L,,hkL)\begin{array}{ccl} x_1 & \Longrightarrow & g_1 = (h_{11}, h_{21}, \ldots, h_{k1}) \\[6pt] \vdots & & g_2 = (h_{12}, h_{22}, \ldots, h_{k2}) \\[6pt] x_n & & \vdots \\[6pt] & & g_L = (h_{1L}, h_{2L}, \ldots, h_{kL}) \end{array}

    Each gk (1kL)g_{k ~(1 \leq k \leq L)} defines one hash table, so there are LL buckets/tables total.

Consider binary strings in {0,1}d\{0,1\}^d:

{0,1}d={(0,1,1,)}d.\{0,1\}^d = \underbrace{\{(0,1,1,\ldots)\}}_{d}.

Let hih_i sample the ii-th bit:

hi:{0,1}d{0,1}.h_i : \{0,1\}^d \to \{0,1\}.
  • If d(x,y)rd(x,y) \le r, then Pr(hi(x)=hi(y))drd=1rd.\Pr(h_i(x)=h_i(y)) \ge \frac{d-r}{d} = 1-\frac{r}{d}. Therefore, p1=1rd.p_1 = 1-\frac{r}{d}.
  • If d(x,y)crd(x,y) \ge cr, then Pr(hi(x)=hi(y))dcrd=1crd.\Pr(h_i(x)=h_i(y)) \le \frac{d-cr}{d} = 1-\frac{cr}{d}. Therefore, p2=1crd.p_2 = 1-\frac{cr}{d}.

Sampling is a (r,cr,1rd,1crd)(r,cr,1-\frac{r}{d},1-\frac{cr}{d})-sensitive hash family for Hamming distance.

So,

g=(h1,h2,,hk),g = (h_1,h_2,\ldots,h_k),

where each hih_i is chosen from the above LSH family.

Thus,

g:{0,1}d{0,1}k.g : \{0,1\}^d \to \{0,1\}^k.
h:Rdh(p)p=(p1,,pd)\begin{aligned} h &: \mathbb{R}^d \\ h(p) & \\ p &= (p_1,\ldots,p_d) \end{aligned}

random projection onto a dd'-dimensional hyperplane gives an LSH

For Euclidean distance, the “bucket” is what point on the hyperplane you land on.

  • Basic Probability:
    • Expectation, Variance, Markov, Chebyshev, Chernoff
  • Streaming Algorithms
    • Uniform Sampling
    • Counting (Morris Counter)
    • Approximate Median
    • Count-min Sketch, Frequency Approximation, Heavy Hitters
    • Frequency Moment Estimation (AMS Sampling)
    • Distinct Elements
  • Online Algorithms
    • Toy
    • List Update
    • Expert’s MWU
    • Caching, Paging, Resource Augmentation
  • JL Lemma, NNS
  • Basic Probability: Expectation, variance, Markov, Chebyshev, Chernoff
  • Streaming Algorithms
  • Online Algorithms
  • Dimension Reduction
  • Nearest Neighbor Search