Lec 05-05-2026: Support Vector Machines: Motivation & Setup
Wisconsin Breast Cancer Dataset
Section titled “Wisconsin Breast Cancer Dataset”Last lecture we sketched a subset of the Wisconsin Breast Cancer Dataset.
Notation: where = patient #, = feature #
Since the output is not numeric, let’s perform a conversion:
A new patient 570 walks in with features (size…, cell density…, …) but with an unknown label.
The Classification Question
Section titled “The Classification Question”Q: A new patient walks in with a tumor. Can we predict if the tumor is/isn’t cancerous?
Q: Can we come up with a line (hyperplane) that does a good job at separating these two classes?
A: We answered this (fully) in 1994.
It is tough to draw a separating line because every line/hyperplane will misclassify some points.
Since this question is hard:
- We will pose an easier question
- We will investigate if the easier question happens IRL
- We will try to answer the easier question
- We will try to modify our answer to answer the harder question
Phase 1: Pose an Easier Question
Section titled “Phase 1: Pose an Easier Question”Since every line/hyperplane misclassifies some things, it would be easier if our dataset looked like:
Def. A set of points is said to be linearly separable if there exists a hyperplane such that one side of the hyperplane consists of all points with label and the other side contains all points with label .
Q: If the data is linearly separable, how many hyperplanes can we draw?
There are infinitely many hyperplanes.
Q: Which is the “best”?
A: The support vector machine gives one way of finding the “best” hyperplane.
The Best Hyperplane
Section titled “The Best Hyperplane”The idea is this. Consider the following 2 hyperplanes imposed upon the same dataset:
Claim: The blue hyperplane will generalize better.
The blue hyperplane has more “breathing room” — there is a larger margin between the hyperplane and the surrounding points.
We care about maximizing the distance from the hyperplane to the nearest point, i.e. maximizing the margin.
Our goal is to maximize the margin. SVM will tell us how to do this.
The Easier Question
Section titled “The Easier Question”In a linearly separable dataset, how can we find the hyperplane which maximizes the margin?
The hyperplane equation is:
We want to find and .
Here, is the feature vector of a data point (the inputs), while and are the unknowns that define the hyperplane - controls its orientation and controls its offset from the origin.
Phase 2: Does Linearly Separable Data Happen IRL?
Section titled “Phase 2: Does Linearly Separable Data Happen IRL?”- If the # of dimensions is low, no
- If the # of dimensions is high, it likely is linearly separable
2 reasons for this:
-
Higher dimensional spaces are “roomier” (more space, so more likely to separate the data)
Think of being stuck in traffic: in 2D (the road), everyone is clustered and impossible to separate. But introduce a third dimension - the ability to fly - and you can lift some cars up and pass a plane between them. The same idea applies to data: more dimensions give the hyperplane more directions to orient itself, making it easier to slip one between the two classes.
-
Blessing of non-uniformity - the phenomenon is sometimes not random
- Ex. If cancer was random, we would see this:
If data were truly random, you couldn’t make predictions - no pattern to exploit, no clustering to speak of. It would be like throwing darts at a board. But IRL we see:
Datapoints are often concentrated in a specific region.
Phase 3: Finding the Optimal Hyperplane
Section titled “Phase 3: Finding the Optimal Hyperplane”Given a linearly separable dataset, how can we find the hyperplane which maximizes the margin?
We first define what is meant by “margin.”
Def. Given a dataset of the form , we define the margin of the hyperplane corresponding to , denoted by
The expression is the perpendicular distance from point to the hyperplane - a result from linear algebra (the projection distance formula).
The formula works as follows:
- For each point in our training data
- Compute the distance from the point to the hyperplane
- Take the minimum distance
The formula tells us how to find the margin given a hyperplane & a set of points.
As mentioned previously, our goal is to find the & which maximizes :
Goal:
Each part of this expression has a specific role:
- (the operator): “Return the argument that produces the maximum, not the maximum value itself.” Contrast with plain : if is the peak, then but .
- (the subscript): What you are free to vary in the search - all possible choices of and , i.e. all possible hyperplanes. These are the unknowns you want to find.
- (the expression): The thing being maximized - the margin as a function of the hyperplane. For any given and , this evaluates to the size of that hyperplane’s margin.
Put together: “Among all choices of and , find the ones that make the margin as large as possible.”
Original Explanation:
The notation means “return the input that produces the maximum, not the maximum value itself.” For example, if is the maximum of , then but . Here we want the specific and that yield the largest margin - not the margin value itself.
To compute :
- Search through each hyperplane
- Compute the margin associated to each hyperplane
- Return the and that produce the maximum margin
A Problem with This Formulation
Section titled “A Problem with This Formulation”There is a subtle issue with the goal as stated. If we simply ask a computer to find the and that maximize , it will push the hyperplane off to infinity.
Why? Because is the minimum distance from the hyperplane to any data point. If we slide the hyperplane far away from all the data, that minimum distance keeps growing - there is nothing in the formulation that prevents it. The computer would happily report a margin of 342, then 1000, then .
To fix this, we need to add a constraint that keeps the hyperplane anchored near the data. This turns the problem into a constrained optimization - the same kind encountered in calculus when maximizing the area of a garden subject to a limited amount of fencing material:
Choosing the right constraint, and then simplifying the resulting problem, is what comes next.