Lec 05-07-2026: Support Vector Machines: Maximizing the Margin
Recall
Section titled “Recall”Given a linearly separable dataset , our goal is to find the hyperplane which maximizes the margin — the distance from the hyperplane to the closest point.
The distance from a point to the hyperplane is given by:
The Margin
Section titled “The Margin”Think of as a machine: you feed it a hyperplane by specifying and , and it returns a single number — the distance to the closest data point. That number is the margin for that particular hyperplane. Different hyperplanes give different margin values.
Maximizing the Margin
Section titled “Maximizing the Margin”We must consider all possible hyperplanes. For instance, on a given dataset, one hyperplane may give a margin of 1.1 (distance to the closest point, circled):
Since our goal is to maximize the margin, we must find:
where denotes the inputs that achieve the maximum. For example:
Here, the subscript in and means “over all ” - i.e., we range over every possible value of to find the maximum and the input that achieves it.
Concretely: fix one , compute all point-to-hyperplane distances, take the minimum — that is for that hyperplane. Try a different , repeat. After doing this across all hyperplanes, you end up with a collection of margins — say — and return the and that produced the largest one.
Small Hiccup
Section titled “Small Hiccup”If our only goal is to maximize the margin, we run into a problem:
The computer, in attempting to maximize the margin (i.e. to maximize the distance from the hyperplane to the nearest point), will send the hyperplane to infinity — this is bad because we want the hyperplane to separate the two classes.
Fixing the Hiccup
Section titled “Fixing the Hiccup”Q: How do we fix this?
A: Impose a constraint which tells the computer to make sure the hyperplane separates the two classes.
To see what form this constraint should take, recall that is the label of point , and that is the signed output of the hyperplane at — it is positive on one side of the hyperplane and negative on the other. If a point is on the correct side of the hyperplane for its class, then and have the same sign, so their product is positive. If the point is on the wrong side, the signs are opposite and the product is negative.
So requiring:
is a compact way of saying “every data point must be on the correct side of the hyperplane.” Two things still need to be verified:
- Does this actually force the hyperplane to lie between the two classes?
- Does this prevent the hyperplane from flying off to infinity?
Here is Why
Section titled “Here is Why”Updated goal:
Why It Solves the Problem
Section titled “Why It Solves the Problem”For this datapoint (labeled , but on the wrong side of the hyperplane), it must be the case that :
This violates the constraint, so the hyperplane is forced to separate the two classes.
Preventing the Hyperplane from Flying Off to
Section titled “Preventing the Hyperplane from Flying Off to −∞-\infty−∞”For this datapoint with label : since it rests above the hyperplane, :
The constraint is again violated, so the hyperplane cannot fly off to .
One Final Comment on This Constraint
Section titled “One Final Comment on This Constraint”This constraint allows for the midpoint hyperplane to be considered, allowing us to place the hyperplane in between the 2 classes, which is what we want.
For this datapoint (labeled , correctly above the hyperplane): :
The constraint is satisfied.
Similarly, for a -1 point correctly below the hyperplane, :
Hence, the condition is satisfied: places the hyperplane where it is wanted.
Constrained Optimization
Section titled “Constrained Optimization”is a constrained optimization problem. In a constrained optimization problem, we are not free to search over all possible values of our variables — we can only consider values that satisfy the constraint. Think of it as finding the best solution within a restricted set: instead of a global maximum, we find the best answer among the options the constraint allows.
A familiar example from calculus illustrates the idea:
Example from Calculus
Section titled “Example from Calculus”Q: Find the area of the maximum rectangular garden given the farmer only has 1000ft of fencing.
Rewriting the Optimization Problem
Section titled “Rewriting the Optimization Problem”We want to simplify this into a form that is easier to work with. Start by substituting the definition of :
The min is taken over all data points . The denominator doesn’t depend on — once you fix a hyperplane, is just a constant — so it can be factored out:
The Scaling Trick
Section titled “The Scaling Trick”Here is a key observation from algebra: the line and the line are the same geometric line — multiplying both sides by 7 doesn’t change it. The same applies to hyperplanes: you can scale by any non-zero constant and it still describes the same hyperplane.
Since we are searching over all possible hyperplanes anyway, we are free to choose how we write any given hyperplane. We will use this freedom to always normalize so that the minimum numerator equals 1:
To see why this is always possible: suppose you have some and the minimum comes out to 15. Replace with and with — this still describes the same hyperplane, but now the minimum is . You can always rescale to make the minimum 1, so this isn’t an extra restriction — it’s just a choice of normalization.
With the minimum fixed at 1, it drops out and the problem simplifies to:
Turning Maximization into Minimization
Section titled “Turning Maximization into Minimization”is a positive scalar. To maximize , think about how behaves: as gets smaller, gets larger. So the maximum of occurs exactly when is as small as possible. Maximizing and minimizing have the same solution:
Removing the Square Root
Section titled “Removing the Square Root”Minimizing and minimizing produce the same argmin. Squaring is a monotone transformation for non-negative values — if then — so it preserves which value is smallest. Wherever is minimized, is minimized there too. This is the same idea used in calculus when minimizing distances: rather than minimizing , you minimize instead, which avoids dealing with the square root.
The square root is gone, which makes this much cleaner to work with. To be precise: the constraint didn’t disappear during the rewriting above — only the objective was being simplified. The full problem, with the constraint restored, is:
This is the standard form of the SVM optimization problem: minimize a quadratic objective subject to linear constraints. Problems of this form have well-developed solution techniques, which is what comes next.