Skip to content

Lec 05-05-2026: Support Vector Machines: Motivation & Setup

Last lecture we sketched a subset of the Wisconsin Breast Cancer Dataset.

Notation: xjix^i_j where ii = patient #, jj = feature #

Two patients (1 and 569) represented as stick figures, each with a box listing their tumor features and labels

Since the output is not numeric, let’s perform a conversion:

yi={1if person i has no cancer+1if person i has cancery_i = \begin{cases} -1 & \text{if person } i \text{ has no cancer} \\ +1 & \text{if person } i \text{ has cancer} \end{cases}

Patient 570 as a stick figure with a feature box showing size, cell density, and an unknown label (?)

A new patient 570 walks in with features (size…, cell density…, …) but with an unknown label.

Q: A new patient walks in with a tumor. Can we predict if the tumor is/isn’t cancerous?

Q: Can we come up with a line (hyperplane) that does a good job at separating these two classes?

A: We answered this (fully) in 1994.

It is tough to draw a separating line because every line/hyperplane will misclassify some points.

Since this question is hard:

  1. We will pose an easier question
  2. We will investigate if the easier question happens IRL
  3. We will try to answer the easier question
  4. We will try to modify our answer to answer the harder question

Since every line/hyperplane misclassifies some things, it would be easier if our dataset looked like:

Scatter plot with -1 points in the lower-left and +1 points in the upper-right, separated by a red diagonal hyperplane, labeled 'This is called linearly separable'

Scatter plot with +1 and -1 points randomly mixed with no discernible pattern, labeled 'NOT linearly separable'

Def. A set of points is said to be linearly separable if there exists a hyperplane such that one side of the hyperplane consists of all points with label 1-1 and the other side contains all points with label +1+1.

Q: If the data is linearly separable, how many hyperplanes can we draw?

Scatter plot with -1 and +1 points showing three valid separating hyperplanes drawn in pink, blue, and yellow

There are infinitely many hyperplanes.

Q: Which is the “best”?

A: The support vector machine gives one way of finding the “best” hyperplane.

The idea is this. Consider the following 2 hyperplanes imposed upon the same dataset:

Scatter plot with a red hyperplane that has a small margin, positioned close to the data points

Scatter plot with a blue hyperplane that has a large margin, positioned centrally between the two clusters

Claim: The blue hyperplane will generalize better.

Scatter plot with red hyperplane, a new borderline square point misclassified as +1 by the hyperplane, with annotation 'probably -1, but the hyperplane predicts +1'

The blue hyperplane has more “breathing room” — there is a larger margin between the hyperplane and the surrounding points.

We care about maximizing the distance from the hyperplane to the nearest point, i.e. maximizing the margin.

margin=min(1,5)=1\text{margin} = \min(1, 5) = 1

Scatter plot with a red hyperplane positioned close to the -1 cluster; the nearest -1 point (circled) has a short distance (labeled 1) to the hyperplane and the nearest +1 point (circled, far away) has a long distance (labeled 5) to the hyperplane

Our goal is to maximize the margin. SVM will tell us how to do this.

In a linearly separable dataset, how can we find the hyperplane which maximizes the margin?

The hyperplane equation is:

wTx+b=0\vec{w}^T \vec{x} + b = 0

We want to find w\vec{w} and bb.

Here, x\vec{x} is the feature vector of a data point (the inputs), while w\vec{w} and bb are the unknowns that define the hyperplane - w\vec{w} controls its orientation and bb controls its offset from the origin.

Phase 2: Does Linearly Separable Data Happen IRL?

Section titled “Phase 2: Does Linearly Separable Data Happen IRL?”
  • If the # of dimensions is low, no
  • If the # of dimensions is high, it likely is linearly separable

2 reasons for this:

  1. Higher dimensional spaces are “roomier” (more space, so more likely to separate the data)

    Think of being stuck in traffic: in 2D (the road), everyone is clustered and impossible to separate. But introduce a third dimension - the ability to fly - and you can lift some cars up and pass a plane between them. The same idea applies to data: more dimensions give the hyperplane more directions to orient itself, making it easier to slip one between the two classes.

  2. Blessing of non-uniformity - the phenomenon is sometimes not random

    • Ex. If cancer was random, we would see this:

Scatter plot with density (vertical) and size (horizontal) axes; +1 and -1 labels randomly mixed with no discernible pattern

If data were truly random, you couldn’t make predictions - no pattern to exploit, no clustering to speak of. It would be like throwing darts at a board. But IRL we see:

Scatter plot where -1 points are concentrated in the lower-left and +1 points in the upper-right, with some intermixing near the middle boundary

Datapoints are often concentrated in a specific region.

Given a linearly separable dataset, how can we find the hyperplane which maximizes the margin?

We first define what is meant by “margin.”

Def. Given a dataset D\mathbb{D} of the form D={(xi,yi)yi{1,1}}\mathbb{D} = \{(\vec{x}_i, y_i) \mid y_i \in \{-1, 1\}\}, we define the margin of the hyperplane corresponding to wTx+b=0\vec{w}^T \vec{x} + b = 0, denoted by

γ(w,b):=minxiDwTxi+bw\gamma(\vec{w}, b) := \min_{\vec{x}_i \in \mathbb{D}} \frac{\lvert \vec{w}^T \vec{x}_i + b \rvert}{\lVert \vec{w} \rVert}

The expression wTxi+bw\frac{\lvert \vec{w}^T \vec{x}_i + b \rvert}{\lVert \vec{w} \rVert} is the perpendicular distance from point xi\vec{x}_i to the hyperplane wTx+b=0\vec{w}^T \vec{x} + b = 0 - a result from linear algebra (the projection distance formula).

The formula works as follows:

  1. For each point in our training data
  2. Compute the distance from the point to the hyperplane
  3. Take the minimum distance

The formula tells us how to find the margin given a hyperplane & a set of points.

As mentioned previously, our goal is to find the w\vec{w} & bb which maximizes γ(w,b)\gamma(\vec{w}, b):

Goal:

argmaxw,b  γ(w,b)\underset{\vec{w},\, b}{\arg\max} \; \gamma(\vec{w}, b)

Each part of this expression has a specific role:

  • argmax\arg\max (the operator): “Return the argument that produces the maximum, not the maximum value itself.” Contrast with plain max\max: if f(5)=20f(5) = 20 is the peak, then maxf=20\max f = 20 but argmaxf=5\arg\max f = 5.
  • w,b\vec{w}, b (the subscript): What you are free to vary in the search - all possible choices of w\vec{w} and bb, i.e. all possible hyperplanes. These are the unknowns you want to find.
  • γ(w,b)\gamma(\vec{w}, b) (the expression): The thing being maximized - the margin as a function of the hyperplane. For any given w\vec{w} and bb, this evaluates to the size of that hyperplane’s margin.

Put together: “Among all choices of w\vec{w} and bb, find the ones that make the margin γ(w,b)\gamma(\vec{w}, b) as large as possible.”

Original Explanation:

The notation argmax\arg\max means “return the input that produces the maximum, not the maximum value itself.” For example, if f(5)=20f(5) = 20 is the maximum of ff, then maxf=20\max f = 20 but argmaxf=5\arg\max f = 5. Here we want the specific w\vec{w} and bb that yield the largest margin - not the margin value itself.

Scatter plot with -1 and +1 points showing three valid separating hyperplanes drawn in red, blue, and purple

To compute argmaxw,b  γ(w,b)\underset{\vec{w},b}{\arg\max} \; \gamma(\vec{w},b):

  1. Search through each hyperplane
  2. Compute the margin associated to each hyperplane
  3. Return the w\vec{w} and bb that produce the maximum margin

Two panels: left shows midpoint hyperplane separating -1 cluster (lower-left, one circled) and +1 cluster (upper-right, one circled); right shows a larger-margin hyperplane far from the data

There is a subtle issue with the goal as stated. If we simply ask a computer to find the w\vec{w} and bb that maximize γ(w,b)\gamma(\vec{w}, b), it will push the hyperplane off to infinity.

Why? Because γ(w,b)\gamma(\vec{w}, b) is the minimum distance from the hyperplane to any data point. If we slide the hyperplane far away from all the data, that minimum distance keeps growing - there is nothing in the formulation that prevents it. The computer would happily report a margin of 342, then 1000, then \infty.

To fix this, we need to add a constraint that keeps the hyperplane anchored near the data. This turns the problem into a constrained optimization - the same kind encountered in calculus when maximizing the area of a garden subject to a limited amount of fencing material:

argmaxw,b  γ(w,b)subject to a constraint on w,b\underset{\vec{w},\, b}{\arg\max} \; \gamma(\vec{w}, b) \quad \text{subject to a constraint on } \vec{w}, b

Choosing the right constraint, and then simplifying the resulting problem, is what comes next.