Lecture 26 (05/13/2026) - Approximate Nearest Neighbor; Locality Sensitive Hashing; Course Review | CSCI 328

Scribes: Naufil Faruqi and Samuel Sokol

Summary of Lecture

Introducing the Nearest Neighbor Search (NNS) problem by comparing exact membership to similarity queries
Defining Exact NNS and demonstrating that the trivial brute force algorithm takes $O(nd)$ query time
Discussing the hardness of Exact NNS: achieving strictly sub-linear query time would violate the Strong Exponential Time Hypothesis (SETH)
Relaxing the problem to Approximate NNS using an approximation factor $c > 1$ to bypass the exact NNS bottleneck
Briefly introducing K-Nearest Neighbors (K-NN) and its approximate variants for retrieving multiple similar data points

Introduction to Nearest Neighbor Search

When dealing with large datasets, a fundamental problem is evaluating queries against a membership set. Suppose we have a set $S = \{x_1, \dots, x_n\}$ . A basic membership query asks: “Is the query $q$ in $S$ ?”

However, in many real-world applications (like image search or recommendation systems), we are more interested in similarity. Instead of exact matches, the question becomes: “Is the query $q$ similar to some key in $S$ ?” This gives rise to the Nearest Neighbor Search (NNS) problem.

Exact Nearest Neighbor Search (NNS)

In the formal Nearest Neighbor Search problem, we are given a query point $q$ and we want to return the nearest neighbor of $q$ in the set $S$ . Both the query $q$ and the data points $x_i$ exist in a $d$ -dimensional real coordinate space, meaning $x_i \in \mathbb{R}^d$ and $q \in \mathbb{R}^d$ .

The goal is to find the point $x_i$ that minimizes the distance to $q$ :

\min_{1 \le i \le n} d(q, x_i)

Trivial Solution (Brute Force): The most straightforward approach is to compute the exact distance $d(q, x_i)$ for all $i \in [n]$ , and simply output the smallest value. Because computing the distance between two $d$ -dimensional vectors takes $O(d)$ time, and we must do this for $n$ points, the brute force query time is exactly $O(nd)$ .

Hardness of Exact NNS

A natural question is whether we can achieve an exact NNS algorithm that operates significantly faster than the brute force $O(nd)$ time, perhaps avoiding the need to scan every single point linearly.

Theorem (Ryan Williams and Alman): Any algorithm that solves exact NNS in strictly sub-linear time concerning $n$ , specifically running in $O(n^{0.999}d)$ query time, would violate the Strong Exponential Time Hypothesis (SETH).

Analysis: SETH is a major conjecture in computational complexity implying that boolean satisfiability cannot be solved significantly faster than exhaustive search. Because violating SETH is considered highly unlikely, this result demonstrates that a fast, exact algorithm for NNS in high dimensions is practically impossible. This inherent bottleneck forces us to consider relaxations of the problem.

Approximate Nearest Neighbor Search

Approximate NNS

Since exact NNS is computationally prohibitive, we relax the requirement and allow the algorithm to return an approximate nearest neighbor.

For a given approximation factor $c > 1$ , the Approximate NNS problem requires us to return a point $x_j$ such that its distance to the query is no more than $c$ times the distance to the true nearest neighbor. Mathematically, we must find an $x_j$ such that:

d(q, x_j) \le c \min_{1 \le i \le n} d(q, x_i)

By sacrificing exactness (controlled by the parameter $c$ ), we can drastically improve query times.

K-Nearest Neighbors (K-NN)

A closely related variant is K-NN. In the exact K-NN problem, we return the $k$ nearest neighbors to the query $q$ rather than just the single closest point. Similarly, there exists an Approximate K-NNS variant for $c > 1$ , which has been a topic of recent research (e.g., PODS 2020), offering similar time-accuracy tradeoffs for retrieving multiple similar points.

Sets and Jaccard Distance

While Euclidean and Hamming distances are useful for vector spaces, we often need to measure the similarity between collections of items or sets.

Jaccard Distance / Similarity: Given two sets $A$ and $B$ , the Jaccard measurement is defined as the size of their intersection divided by the size of their union:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

This metric provides a value between 0 and 1, where 1 indicates the sets are identical and 0 indicates they are completely disjoint. Min-Hash is a standard LSH technique built specifically to estimate this Jaccard metric for large sets.

Locality Sensitive Hashing (LSH)

Data Structure

\text{Preprocessing / Storage: } O(n^{1+p}d)

\text{Query time: } O(n^p d)

$p$ depends on the definition of distance. For Hamming distance (see Hamming Distance):

p = \frac{1}{c}

for Euclidean distance (see LSH for Euclidean Space):

p=\frac{1}{c^2}

LSH Family

Related: See Alexandrr Andoni’s work on data-dependent LSH. Linked here: https://arxiv.org/pdf/1501.01062

LSH family:

A family $\mathcal{H}$ of hash functions

\mathcal{H} = \{h_1,\ldots,h_{|\mathcal{H}|}\}

where

h_i : U \longrightarrow U'

is called $(r,cr,p_1,p_2)$ -sensitive if for any two points $x,y \in U$ , and a random $h$ from $\mathcal{H}$ ,

If $d(x,y) \le r$ , then $h(x) = h(y) ~~~~ \mathrm{w.p.} \ge p_1$
If $d(x,y) \ge cr$ , then $h(x) = h(y) ~~~~ \mathrm{w.p.} \le p_2$

A family is interesting if

p_1 > p_2.

ANN Algorithm

Find $\mathcal{H}$ for the problem distance.
For $h_{ij} \in \mathcal{H}$
$\begin{array}{ccl} x_1 & \Longrightarrow & g_1 = (h_{11}, h_{21}, \ldots, h_{k1}) \\[6pt] \vdots & & g_2 = (h_{12}, h_{22}, \ldots, h_{k2}) \\[6pt] x_n & & \vdots \\[6pt] & & g_L = (h_{1L}, h_{2L}, \ldots, h_{kL}) \end{array}$
Each $g_{k ~(1 \leq k \leq L)}$ defines one hash table, so there are $L$ buckets/tables total.

Hamming Distance

Consider binary strings in $\{0,1\}^d$ :

\{0,1\}^d = \underbrace{\{(0,1,1,\ldots)\}}_{d}.

Let $h_i$ sample the $i$ -th bit:

h_i : \{0,1\}^d \to \{0,1\}.

If $d(x,y) \le r$ , then $\Pr(h_i(x)=h_i(y)) \ge \frac{d-r}{d} = 1-\frac{r}{d}.$ Therefore, $p_1 = 1-\frac{r}{d}.$
If $d(x,y) \ge cr$ , then $\Pr(h_i(x)=h_i(y)) \le \frac{d-cr}{d} = 1-\frac{cr}{d}.$ Therefore, $p_2 = 1-\frac{cr}{d}.$

Sampling is a $(r,cr,1-\frac{r}{d},1-\frac{cr}{d})$ -sensitive hash family for Hamming distance.

So,

g = (h_1,h_2,\ldots,h_k),

where each $h_i$ is chosen from the above LSH family.

Thus,

g : \{0,1\}^d \to \{0,1\}^k.

LSH for Euclidean Space

\begin{aligned} h &: \mathbb{R}^d \\ h(p) & \\ p &= (p_1,\ldots,p_d) \end{aligned}

random projection onto a $d'$ -dimensional hyperplane gives an LSH

For Euclidean distance, the “bucket” is what point on the hyperplane you land on.

Final Topics Overview

In Class

Basic Probability:
- Expectation, Variance, Markov, Chebyshev, Chernoff
Streaming Algorithms
- Uniform Sampling
- Counting (Morris Counter)
- Approximate Median
- Count-min Sketch, Frequency Approximation, Heavy Hitters
- Frequency Moment Estimation (AMS Sampling)
- Distinct Elements
Online Algorithms
- Toy
- List Update
- Expert’s MWU
- Caching, Paging, Resource Augmentation
JL Lemma, NNS

On Brightspace

Basic Probability: Expectation, variance, Markov, Chebyshev, Chernoff
Streaming Algorithms
Online Algorithms
Dimension Reduction
Nearest Neighbor Search