Skip to content

Lec 04-30-2026: Lines & Planes (cont.) & Support Vector Machines

(Ending of our Linear Algebra recap)

In our last lecture, we introduced the notion of a hyperplane. Recall that a hyperplane is a flat (n1)(n-1)-dimensional surface in Rn\mathbb{R}^n - a line in R2\mathbb{R}^2, a flat plane in R3\mathbb{R}^3, and so on. Algebraically, a hyperplane is given by wTx+b=0\vec{w}^T\vec{x} + b = 0:

[w1w2wn][x1x2xn]+b=0\begin{bmatrix} w_1 & w_2 & \cdots & w_n \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} + b = 0 w1x1+w2x2++wnxn+b=0w_1x_1 + w_2x_2 + \cdots + w_nx_n + b = 0

In 2D, this becomes w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b = 0. Rearranging to isolate x2x_2 (which plays the role of yy):

w2x2=w1x1b    x2=w1w2x1bw2    y=mx+bw_2x_2 = -w_1x_1 - b \implies x_2 = \frac{-w_1}{w_2}x_1 - \frac{b}{w_2} \implies y = mx + b

Note: the "bb" in slope-intercept form here is b/w2-b/w_2, which is a different value from the bb in the original hyperplane equation - they just happen to share the same letter.

Theorem: w\vec{w} is Orthogonal to the Hyperplane

Section titled “Theorem: w⃗\vec{w}w is Orthogonal to the Hyperplane”
Proof

To show w\vec{w} is orthogonal to the entire hyperplane, it suffices to show it is orthogonal to any vector lying parallel to the hyperplane (since such vectors span all directions within the hyperplane). Let v\vec{v} be an arbitrary vector parallel to the hyperplane, with v1\vec{v}_1 & v2\vec{v}_2 pointing from the origin to the base & tip of v\vec{v} respectively.

Coordinate axes with hyperplane as a diagonal blue line; position vectors v1 and v2 (orange) from origin to the base and tip of v (teal) respectively; w (teal) shown perpendicular to the hyperplane with a right angle marker

Since v1\vec{v}_1 & v2\vec{v}_2 begin from the origin, we have:

v=v2v1(parallel to hyperplane)\vec{v} = \vec{v}_2 - \vec{v}_1 \quad \text{(parallel to hyperplane)}

Since the tips of v1\vec{v}_1 & v2\vec{v}_2 lie on the hyperplane wTx+b=0\vec{w}^T\vec{x} + b = 0, they each satisfy the hyperplane equation:

wTv2+b=0andwTv1+b=0\vec{w}^T\vec{v}_2 + b = 0 \qquad \text{and} \qquad \vec{w}^T\vec{v}_1 + b = 0

Subtracting the second equation from the first eliminates bb, leaving:

wTv2wTv1=0    wT(v2v1)=0    wTv=0\vec{w}^T\vec{v}_2 - \vec{w}^T\vec{v}_1 = 0 \implies \vec{w}^T(\vec{v}_2 - \vec{v}_1) = 0 \implies \vec{w}^T\vec{v} = 0

Hence w\vec{w} is orthogonal to v\vec{v}. Since v\vec{v} is parallel to the hyperplane, then w\vec{w} is also orthogonal to the hyperplane. \square

(Another structure of math)

This follows immediately from the theorem: w^\hat{w} is just w\vec{w} scaled by the positive scalar 1w\frac{1}{\|\vec{w}\|}, so it points in the exact same direction as w\vec{w}. Since orthogonality depends only on direction, w^\hat{w} is orthogonal to the hyperplane too.

By default, we draw w\vec{w} with its tail at the origin. However, since w\vec{w} represents a direction (perpendicular to the hyperplane), we can translate it to start from any point and it still correctly describes the orientation of the hyperplane - as the diagram below illustrates.

alt text

Given a point PP not on the hyperplane, we want the shortest distance from PP to the hyperplane. As before, the hyperplane is defined by wTx+b=0\vec{w}^T\vec{x} + b = 0.

2D diagram of coordinate axes with the hyperplane as a diagonal line, point P above it, w (orange) perpendicular to the hyperplane at origin, and a blue arrow from P down to the hyperplane asking "can we find this length?"

There are infinitely many line segments connecting PP to some point on the hyperplane, but the shortest one is always the perpendicular one. (Any non-perpendicular segment forms the hypotenuse of a right triangle whose leg includes the perpendicular segment, making it strictly longer.) So we go with the perpendicular segment.

Step 1: Project P onto the hyperplane:

  • Call the vector from the origin to the projection of PP: xp\vec{x}_p
  • Call the vector between xp\vec{x}_p & PP: d\vec{d}
  • Call the vector between the origin & PP: x0\vec{x}_0

Coordinate axes with hyperplane; xp (magenta) from origin to projection of P on hyperplane, delta (yellow-green) from projection to P, x0 (teal) from origin to P, w (orange) perpendicular to hyperplane; right angle marker at the projection point

From the diagram:

xp+d=x0    xp=x0d(Eq 1)\vec{x}_p + \vec{d} = \vec{x}_0 \implies \vec{x}_p = \vec{x}_0 - \vec{d} \tag{Eq 1}

Since the tip of xp\vec{x}_p lies on the hyperplane, then wTxp+b=0\vec{w}^T\vec{x}_p + b = 0.

Since d\vec{d} is the perpendicular segment from PP to the hyperplane, it is orthogonal to the hyperplane. But w\vec{w} is also orthogonal to the hyperplane. Two vectors that are both perpendicular to the same surface must be parallel to each other, so d\vec{d} is a scalar multiple of w\vec{w}:

d=sw(Eq 2)\vec{d} = s\vec{w} \tag{Eq 2}

Again our goal is to use this to find the length between P & the hyperplane, aka d\|\vec{d}\|.

Substituting Eq 1 into wTxp+b=0\vec{w}^T\vec{x}_p + b = 0:

wT(x0d)+b=0wT(x0sw)+b=0(Eq 2)wTx0swTw+b=0wTx0+b=swTws=wTx0+bwTw(Eq 3)\begin{aligned} \vec{w}^T(\vec{x}_0 - \vec{d}) + b &= 0 \\ \vec{w}^T(\vec{x}_0 - s\vec{w}) + b &= 0 && \text{(Eq 2)} \\ \vec{w}^T\vec{x}_0 - s\vec{w}^T\vec{w} + b &= 0 \\ \vec{w}^T\vec{x}_0 + b &= s\vec{w}^T\vec{w} \\ s &= \frac{\vec{w}^T\vec{x}_0 + b}{\vec{w}^T\vec{w}} \tag{Eq 3} \end{aligned}

Since d=sw\vec{d} = s\vec{w}, by Eq 3:

d=(wTx0+bwTw)w\vec{d} = \left(\frac{\vec{w}^T\vec{x}_0 + b}{\vec{w}^T\vec{w}}\right)\vec{w} d=dTd=(sw)T(sw)=swTsw=s2wTw=s2wTw=swTw=wTx0+bwTwwTw=wTx0+bwTwwTw(wTw0)=wTx0+bwTw=wTx0+bw2=wTx0+bw\begin{aligned} \|\vec{d}\| &= \sqrt{\vec{d}^T\vec{d}} \\ &= \sqrt{(s\vec{w})^T(s\vec{w})} \\ &= \sqrt{s\vec{w}^T \cdot s\vec{w}} \\ &= \sqrt{s^2\,\vec{w}^T\vec{w}} \\ &= \sqrt{s^2}\,\sqrt{\vec{w}^T\vec{w}} \\ &= |s|\sqrt{\vec{w}^T\vec{w}} \\ &= \left|\frac{\vec{w}^T\vec{x}_0 + b}{\vec{w}^T\vec{w}}\right|\sqrt{\vec{w}^T\vec{w}} \\ &= \frac{|\vec{w}^T\vec{x}_0 + b|}{\vec{w}^T\vec{w}}\,\sqrt{\vec{w}^T\vec{w}} && (\vec{w}^T\vec{w} \geq 0) \\ &= \frac{|\vec{w}^T\vec{x}_0 + b|}{\sqrt{\vec{w}^T\vec{w}}} = \frac{|\vec{w}^T\vec{x}_0 + b|}{\sqrt{\|\vec{w}\|^2}} = \frac{|\vec{w}^T\vec{x}_0 + b|}{\|\vec{w}\|} \end{aligned} 2D diagram showing the hyperplane, point P, vector x0 (blue) from origin to P, and a right-angle marker at the hyperplane where the perpendicular from P meets it; annotation reads 'Given w^T x + b = 0 & point p, you can get orthogonal distance'

Top: nested boxes with DL inside NN inside ML inside AI; Bottom: tree with Machine Learning branching into Supervised Learning, Unsupervised Learning, and Reinforcement Learning

SVMs are a type of supervised learning.

Idea: Given a bunch of data points which belong to 1 of 2 classes, the goal is to decide which class a new datapoint will belong to.

Example: In medicine, a tumor is an abnormal growth of cells which may or may not be cancerous.

Let’s develop a model which can attempt to differentiate if a tumor is/isn’t cancerous based on the following training data.

Each patient’s data contains 13 features (size of tumor, cell density, length of tumor, …) and 1 label (C = cancerous, N = non-cancerous), giving 14 dimensions total. The data for patient ii is represented as (xi,yi)(\vec{x}_i, y_i) where yi{C,N}y_i \in \{C, N\} is encoded as 11 or 1-1.

Three patient cards (1, 2, 569) each with stick figure, feature list (size, density, length, ..., C or N) with orange subscript labels; card 2 annotated with 13-metrics brace, response y-value arrow, and 14 Dimensions note; card 1 shows encoding to 1 or -1

Each data point is a pair (xi,yi)(\vec{x}_i, y_i): a feature vector and a label. Since the output is categorical, we encode the label numerically:

([tumor sizecell densitytumor length]xi,  C or N1 or 1)\left(\underbrace{\begin{bmatrix}\text{tumor size}\\\text{cell density}\\\text{tumor length}\\\vdots\end{bmatrix}}_{\vec{x}_i},\; \underbrace{C \text{ or } N}_{1 \text{ or } -1}\right)

Let XijX_{ij} denote the jthj^{\text{th}} feature of patient ii.

  • X11X_{11} is the first feature of patient (pt) 1.
  • X12X_{12} is the second feature of patient (pt) 1.
  • Y1Y_1 is either C or N.

Since cancerous tumors rapidly divide & grow uncontrollably, they tend to have measurable differences in properties like tumor size and cell density. The hypothesis is that a few of the 13 features can reliably discriminate between the two classes.

We’ll start with just 2 of the 13 features to visualize the idea, then generalize to all dimensions. The output we predict is the label yi{+1,1}y_i \in \{+1, -1\} for a new patient.

Scatter plot of the Wisconsin Breast Cancer Dataset: Tumor Size (Mean Radius) on x-axis, Cell Density (Mean Compactness) on y-axis; Cancerous (+1) in red/orange, Non-Cancerous (-1) in green; 569 total data points

With 569 training patients plotted, now imagine a 570th570^{\text{th}} patient walks in. Their features give a new point on this scatter plot - the question is: which class does it belong to, and how do we draw the boundary that separates them?