Skip to content

Lecture 25 (05/11/2026) - Finish Paging; Dimensionality Reduction & Johnson-Lindenstrauss Lemma; Introduce Nearest Neighbor Search

Scribes: Alisha Adhikari and Saartaj Alam

  • Review of paging algorithms and competitive analysis
  • Resource augmentation and randomized paging algorithms
  • Introduction to high-dimensional geometry
  • Curse of dimensionality and motivation for dimension reduction

Recall the paging problem discussed in the previous lecture. We are given a cache that can only store a limited number of items. Requests for items arrive one at a time, and whenever the cache becomes full, we must decide which item to evict from the cache.

Several heuristics for solving the paging problem were discussed previously:

  • Least Recently Used (LRU)
  • Least Frequently Used (LFU)
  • First In First Out (FIFO)

If the future request sequence is known in advance, then the optimal strategy is the Farthest-in-Future algorithm. Whenever an eviction is required, this algorithm removes the item whose next occurrence is farthest in the future.

However, in the online setting, future requests are unknown. Online algorithms only know the requests that have already appeared.

A paging algorithm is said to be kk-competitive if the number of cache misses produced by the algorithm is at most kk times the number of cache misses produced by the optimal offline algorithm.

For a cache of size kk, both LRU and FIFO are known to be kk-competitive.

A cache miss occurs whenever a requested item is not currently stored in the cache.

To improve the competitive ratio, the lecture introduced the idea of resource augmentation.

Suppose the online algorithm has cache size kk, while the optimal offline algorithm has a smaller cache size kk' where

k<kk' < k

In this setting, LRU and FIFO achieve the competitive ratio

kkk\frac{k}{k-k'}

As an example, suppose

k=k2k' = \frac{k}{2}

Then,

kkk=kkk/2=kk/2=2\frac{k}{k-k'} = \frac{k}{k-k/2} = \frac{k}{k/2} = 2

Thus, if the online algorithm is allowed roughly twice the cache size of the optimal offline algorithm, then LRU and FIFO become 22-competitive.

Both LRU and FIFO are deterministic algorithms because they do not use randomness.

If randomization is allowed, then significantly better competitive ratios can be achieved. In particular, there exist randomized paging algorithms with competitive ratio approximately

O(logk)O(\log k)

This is an exponential improvement over the deterministic kk-competitive bound.

Dimension reduction is widely used in:

  • Machine learning
  • High-dimensional statistics
  • Big data analysis

The goal is to reduce the dimensionality of data while preserving important geometric properties.

Consider two points in R2\mathbb{R}^2:

P1=(x1,y1)P_1 = (x_1,y_1) P2=(x2,y2)P_2 = (x_2,y_2)

The Euclidean (L2L_2) distance between them is

d(P1,P2)=(x1x2)2+(y1y2)2d(P_1,P_2) = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}

More generally, for points in Rd\mathbb{R}^d,

P=(p1,p2,,pd)P = (p_1,p_2,\dots,p_d) Q=(q1,q2,,qd)Q = (q_1,q_2,\dots,q_d)

the L2L_2 distance is

d(P,Q)=i=1d(piqi)2d(P,Q) = \sqrt{ \sum_{i=1}^{d} (p_i-q_i)^2 }

Suppose we are given a dataset consisting of nn points in Rd\mathbb{R}^d:

P1,P2,,PnP_1,P_2,\dots,P_n

Each point contains dd coordinates:

Pi=(pi(1),pi(2),,pi(d))P_i = (p_i^{(1)},p_i^{(2)},\dots,p_i^{(d)})

Subscripts denote different points, while superscripts denote coordinates within a point.

Many algorithms are polynomial in the number of points nn but exponential in the dimension dd. For example, an algorithm may require runtime such as

O(n22d)O(n^2 2^d)

When dd is small, such runtimes are manageable. However, modern datasets often contain very high-dimensional data.

For example, a 16×1616 \times 16 grayscale image contains

16×16=25616 \times 16 = 256

pixels. By storing each pixel intensity as a coordinate, the image can be represented as a point in R256\mathbb{R}^{256}.

Thus, even low-resolution images naturally generate high-dimensional vectors.

Dimension reduction attempts to map points from a high-dimensional space into a lower-dimensional space.

Suppose we define a mapping

f:RdRdf : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}

where

ddd' \ll d

Applying ff to every point produces transformed points

f(P1),f(P2),,f(Pn)f(P_1),f(P_2),\dots,f(P_n)

in a lower-dimensional space.

The goal is to preserve pairwise distances approximately. This idea leads to the introduction of the Johnson—Lindenstrauss Lemma.

For any 0<ε<10 < \varepsilon < 1 and any integer n>1n > 1, let

Dlognε2D' \geq \frac{\log n}{\varepsilon^2}

Then, for any set SS of nn points in DD dimensions, there exists a function

F:RDRDF : \mathbb{R}^D \to \mathbb{R}^{D'}

such that for any two points xx and yy in SS:

(1ε)xyF(x)F(y)(1+ε)xy(1 - \varepsilon) \cdot \|x - y\| \leq \|F(x) - F(y)\| \leq (1 + \varepsilon) \cdot \|x - y\|

This means that the function FF maps every point to a lower-dimensional space, and all distances are preserved up to a factor of (1±ε)(1 \pm \varepsilon).

  • The new dimension is independent of the original dimension. Regardless of how large DD is, the target dimension is only logn/ε2\log n / \varepsilon^2. Since logn\log n is much smaller than nn, and nn is generally much smaller than DD, we see a significant reduction.
  • The number of points doesn’t change. Only the dimension of each point is reduced.
  • Distances are preserved. F(x)F(y)\|F(x) - F(y)\| is the distance between the images of xx and yy in the lower-dimensional space. It is within a (1±ε)(1 \pm \varepsilon) factor of the original distance xy\|x - y\|.

A hyperplane is a subspace that is exactly one less dimension than the space that it exists in. For example, in 3D space, a hyperplane is a flat plane (2D). In 2D space, a hyperplane is a line (1D).

To construct a random DD'-dimensional hyperplane in RD\mathbb{R}^D, take DD' random unit vectors q1,,qDq_1, \ldots, q_{D'} in RD\mathbb{R}^D. Let HH be the vector space spanned by q1,,qDq_1, \ldots, q_{D'} and the origin.

Example. In R3\mathbb{R}^3, to reduce from 3 dimensions to 2, we pick 2 random unit vectors, they can be called q1q_1 and q2q_2. They define a unique 2-dimensional plane HH.

Given a hyperplane HH, the map FF is the orthogonal projection onto HH:

F(p)=closest point on H to pF(p) = \text{closest point on } H \text{ to } p

It follows that if pp already lies on HH, then F(p)=pF(p) = p. Otherwise, find the perpendicular line from pp to HH, and the intersection point is F(p)F(p).

How do we determine a “bad” hyperplane? If all points are nearly perpendicular to the dataset, the projection will occur to nearly the same point, thus destroying distance information. With a randomly chosen hyperplane, it is very unlikely that it will poorly align with any dataset. However, it’s not impossible. In the event that the chosen hyperplane fails the distance-preservation guarantee, we can simply repeat the process witha new, random hyperplane. Since the probability of success if polynomially large, a “good” hyperplane can be found in polynomial time.

This applies to machine learning as well, abstract as it may seem.

The problem statement is the following:

Given nn points x1,,xnx_1, \ldots, x_n in RD\mathbb{R}^D, preprocess them so that given a query qRDq \in \mathbb{R}^D, return the closest point in the dataset:

x=argminxiqxix^* = \arg\min_{x_i} \|q - x_i\|

Nearest neighbor search forms the basis of the nearest neighbor classifier in machine learning. The idea is that with a labeled training set, we can classify new images by finding its nearest neighbor in the training set, and using that neighbor’s label. In the euclidean space, similar images lie closer together.

The search can be incredibly expensive, but with JL, we can reduce the dimension from DD to O(logn/ε2)O(\log n / \varepsilon^2), making the search drastically cheaper while approximately preserving the identity of the nearest neighbor.