Lecture 25 (05/11/2026) - Finish Paging; Dimensionality Reduction & Johnson-Lindenstrauss Lemma; Introduce Nearest Neighbor Search | CSCI 328

Scribes: Alisha Adhikari and Saartaj Alam

Summary of the lecture

Review of paging algorithms and competitive analysis
Resource augmentation and randomized paging algorithms
Introduction to high-dimensional geometry
Curse of dimensionality and motivation for dimension reduction

Paging Problem

Recall the paging problem discussed in the previous lecture. We are given a cache that can only store a limited number of items. Requests for items arrive one at a time, and whenever the cache becomes full, we must decide which item to evict from the cache.

Several heuristics for solving the paging problem were discussed previously:

Least Recently Used (LRU)
Least Frequently Used (LFU)
First In First Out (FIFO)

If the future request sequence is known in advance, then the optimal strategy is the Farthest-in-Future algorithm. Whenever an eviction is required, this algorithm removes the item whose next occurrence is farthest in the future.

However, in the online setting, future requests are unknown. Online algorithms only know the requests that have already appeared.

Competitive Analysis

A paging algorithm is said to be $k$ -competitive if the number of cache misses produced by the algorithm is at most $k$ times the number of cache misses produced by the optimal offline algorithm.

For a cache of size $k$ , both LRU and FIFO are known to be $k$ -competitive.

A cache miss occurs whenever a requested item is not currently stored in the cache.

Resource Augmentation

To improve the competitive ratio, the lecture introduced the idea of resource augmentation.

Suppose the online algorithm has cache size $k$ , while the optimal offline algorithm has a smaller cache size $k'$ where

k' < k

In this setting, LRU and FIFO achieve the competitive ratio

\frac{k}{k-k'}

As an example, suppose

k' = \frac{k}{2}

Then,

\frac{k}{k-k'} = \frac{k}{k-k/2} = \frac{k}{k/2} = 2

Thus, if the online algorithm is allowed roughly twice the cache size of the optimal offline algorithm, then LRU and FIFO become $2$ -competitive.

Randomized Paging Algorithms

Both LRU and FIFO are deterministic algorithms because they do not use randomness.

If randomization is allowed, then significantly better competitive ratios can be achieved. In particular, there exist randomized paging algorithms with competitive ratio approximately

O(\log k)

This is an exponential improvement over the deterministic $k$ -competitive bound.

Introduction to Dimension Reduction

Dimension reduction is widely used in:

Machine learning
High-dimensional statistics
Big data analysis

The goal is to reduce the dimensionality of data while preserving important geometric properties.

Review of Euclidean Distance

Consider two points in $\mathbb{R}^2$ :

P_1 = (x_1,y_1)

P_2 = (x_2,y_2)

The Euclidean ( $L_2$ ) distance between them is

d(P_1,P_2) = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}

More generally, for points in $\mathbb{R}^d$ ,

P = (p_1,p_2,\dots,p_d)

Q = (q_1,q_2,\dots,q_d)

the $L_2$ distance is

d(P,Q) = \sqrt{ \sum_{i=1}^{d} (p_i-q_i)^2 }

High-Dimensional Data

Suppose we are given a dataset consisting of $n$ points in $\mathbb{R}^d$ :

P_1,P_2,\dots,P_n

Each point contains $d$ coordinates:

P_i = (p_i^{(1)},p_i^{(2)},\dots,p_i^{(d)})

Subscripts denote different points, while superscripts denote coordinates within a point.

Curse of Dimensionality

Many algorithms are polynomial in the number of points $n$ but exponential in the dimension $d$ . For example, an algorithm may require runtime such as

O(n^2 2^d)

When $d$ is small, such runtimes are manageable. However, modern datasets often contain very high-dimensional data.

For example, a $16 \times 16$ grayscale image contains

16 \times 16 = 256

pixels. By storing each pixel intensity as a coordinate, the image can be represented as a point in $\mathbb{R}^{256}$ .

Thus, even low-resolution images naturally generate high-dimensional vectors.

Dimension Reduction Mapping

Dimension reduction attempts to map points from a high-dimensional space into a lower-dimensional space.

Suppose we define a mapping

f : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}

where

d' \ll d

Applying $f$ to every point produces transformed points

f(P_1),f(P_2),\dots,f(P_n)

in a lower-dimensional space.

The goal is to preserve pairwise distances approximately. This idea leads to the introduction of the Johnson—Lindenstrauss Lemma.

The Johnson-Lindenstrauss Lemma

Introduction

For any $0 < \varepsilon < 1$ and any integer $n > 1$ , let

D' \geq \frac{\log n}{\varepsilon^2}

Then, for any set $S$ of $n$ points in $D$ dimensions, there exists a function

F : \mathbb{R}^D \to \mathbb{R}^{D'}

such that for any two points $x$ and $y$ in $S$ :

(1 - \varepsilon) \cdot \|x - y\| \leq \|F(x) - F(y)\| \leq (1 + \varepsilon) \cdot \|x - y\|

This means that the function $F$ maps every point to a lower-dimensional space, and all distances are preserved up to a factor of $(1 \pm \varepsilon)$ .

Observations

The new dimension is independent of the original dimension. Regardless of how large $D$ is, the target dimension is only $\log n / \varepsilon^2$ . Since $\log n$ is much smaller than $n$ , and $n$ is generally much smaller than $D$ , we see a significant reduction.
The number of points doesn’t change. Only the dimension of each point is reduced.
Distances are preserved. $\|F(x) - F(y)\|$ is the distance between the images of $x$ and $y$ in the lower-dimensional space. It is within a $(1 \pm \varepsilon)$ factor of the original distance $\|x - y\|$ .

Random Hyperplanes and Projection

Definition

A hyperplane is a subspace that is exactly one less dimension than the space that it exists in. For example, in 3D space, a hyperplane is a flat plane (2D). In 2D space, a hyperplane is a line (1D).

Constructing a Random Hyperplane

To construct a random $D'$ -dimensional hyperplane in $\mathbb{R}^D$ , take $D'$ random unit vectors $q_1, \ldots, q_{D'}$ in $\mathbb{R}^D$ . Let $H$ be the vector space spanned by $q_1, \ldots, q_{D'}$ and the origin.

Example. In $\mathbb{R}^3$ , to reduce from 3 dimensions to 2, we pick 2 random unit vectors, they can be called $q_1$ and $q_2$ . They define a unique 2-dimensional plane $H$ .

Maps

Given a hyperplane $H$ , the map $F$ is the orthogonal projection onto $H$ :

F(p) = \text{closest point on } H \text{ to } p

It follows that if $p$ already lies on $H$ , then $F(p) = p$ . Otherwise, find the perpendicular line from $p$ to $H$ , and the intersection point is $F(p)$ .

Benefits of Randomness

How do we determine a “bad” hyperplane? If all points are nearly perpendicular to the dataset, the projection will occur to nearly the same point, thus destroying distance information. With a randomly chosen hyperplane, it is very unlikely that it will poorly align with any dataset. However, it’s not impossible. In the event that the chosen hyperplane fails the distance-preservation guarantee, we can simply repeat the process witha new, random hyperplane. Since the probability of success if polynomially large, a “good” hyperplane can be found in polynomial time.

Nearest Neighbor Search

This applies to machine learning as well, abstract as it may seem.

The problem statement is the following:

Given $n$ points $x_1, \ldots, x_n$ in $\mathbb{R}^D$ , preprocess them so that given a query $q \in \mathbb{R}^D$ , return the closest point in the dataset:

x^* = \arg\min_{x_i} \|q - x_i\|

Nearest neighbor search forms the basis of the nearest neighbor classifier in machine learning. The idea is that with a labeled training set, we can classify new images by finding its nearest neighbor in the training set, and using that neighbor’s label. In the euclidean space, similar images lie closer together.

The search can be incredibly expensive, but with JL, we can reduce the dimension from $D$ to $O(\log n / \varepsilon^2)$ , making the search drastically cheaper while approximately preserving the identity of the nearest neighbor.