Lec 05-12-2026: Hard & Soft Margin SVMs | MATH 245

Last Lecture

Given a linearly separable dataset $\mathbb{D}$ , our goal is to find the $\vec{w}$ and $b$ which creates the hyperplane with maximum margin.

To do so, we must solve the following optimization:

\underset{\vec{w},b}{\text{argmax}} \ \gamma(\vec{w}, b)

subject to

\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 0

Without this constraint, the hyperplane would simply fly off to infinity — the margin can be made arbitrarily large just by moving the hyperplane infinitely far away from all the data. The constraint is what keeps it properly anchored over the dataset.

This is equivalent to:

\underset{\vec{w},b}{\text{argmin}} \ \vec{w}^T \vec{w}

subject to:

$\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 0$
$\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1$

Which is further equivalent to:

\underset{\vec{w},b}{\text{argmin}} \ \vec{w}^T \vec{w}

subject to

\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

The condition $\forall i, y_i(\vec{w}^T \vec{x}_i + b) > 0$ serves two purposes: it ensures the hyperplane correctly classifies all points, and it keeps the hyperplane from flying off to infinity. That is, it ensures we get something like this:

$2D coordinate plane with diagonal hyperplane labeled w^T x + b = 0$

For all positive points, since $\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1$ :

\Rightarrow \forall i, \ \vec{w}^T \vec{x}_i + b \geq 1

And since $y_i = 1$ :

\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) \geq 1

For all negative labels, since $\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1$ :

\Rightarrow \forall i, \ \vec{w}^T \vec{x}_i + b \leq -1

And since $y_i = -1$ :

\Rightarrow \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

This is the same constraint as for positive $y_i$ ‘s.

We now have the answer to how to find the $\vec{w}$ and $b$ that maximizes the margin in a linearly separable dataset.

The points that sit exactly on one of the two margin boundaries — satisfying $y_i(\vec{w}^T\vec{x}_i + b) = 1$ with equality — are called support vectors. They are the only points that determine the position and orientation of the hyperplane; every other correctly classified point could be removed from the dataset and the solution would be unchanged. This is where the name “Support Vector Machine” comes from.

Hard Margin Support Vector Machine

\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

This is a QP (quadratic programming) problem.

$Two parabolas f (black) and ½f (red) sharing the same argmin; dotted line connects both minima vertically, showing the ½ factor does not shift the argmin$

Solving QP Problems

The reason we spent all that time transforming the original problem into this QP form is practical: early ML researchers couldn’t solve the original max-margin formulation directly. Mathematicians, however, had already been studying quadratic programming for years and had well-developed tools for it. Reformulating as a QP opened the door to those existing solutions.

There are several approaches to solving QP problems:

Active set methods (1950s/60s)
Interior point methods (1980s)
Subgradient method (1960s/70s) — we are studying this
Sequential minimal optimization (SMO, 1998)

Subgradient Method Approach

The subgradient method is closely related to the gradient descent we’ve already studied — think of it as gradient descent with an extra constraint-checking step. It is somewhat outdated (SMO is today’s standard), but it works and is much easier to understand without a full course in numerical analysis.

To solve the optimization, we:

Initialize — choose values for $\vec{w}$ , $b$ , $\alpha$
For each datapoint $(x_i, y_i)$ , check if the constraint is satisfied: is $y_i(\vec{w}^T \vec{x}_i + b) \geq 1$ ?

Yes:

$\vec{w}_\text{new} = \vec{w}_\text{old} - \alpha \vec{w}_\text{old}$

$b_\text{new} = b_\text{old}$

No:

$\vec{w}_\text{new} = \vec{w}_\text{old} - \alpha(\vec{w}_\text{old} - y_i \vec{x}_i)$

$b_\text{new} = b_\text{old} + \alpha y_i$
Repeat until convergence — that is, until the constraints are always satisfied and $\vec{w}$ and $b$ stop meaningfully changing between updates.

This completes phase 3.

Phase 4: Modifying the Approach for the Original Problem

Original problem: Given a dataset which is not linearly separable, find the hyperplane which maximizes the margin.

In the case the data is NOT linearly separable, no such hyperplane will satisfy the condition:

\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

$2D scatter plot with diagonal hyperplane; -1 points lower-left, +1 points upper-right, one misclassified -1 point circled in red on the +1 side$

For the misclassified point, $\vec{w}^T \vec{x}_i + b$ has the wrong sign relative to $y_i$ , so:

y_i(\vec{w}^T \vec{x}_i + b) < 0

But it needs to be $y_i(\vec{w}^T \vec{x}_i + b) \geq 1$ .

Idea: Relax the condition/constraint by softening it. This gives leeway, and produces the “SVM with soft margins” (also called “SVM with slack variables”).

SVM with Soft Margin

The intuition is that in the case where the dataset is not linearly separable, we aim to satisfy:

\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

as much as possible, but will allow some violation by adding a little slack $\xi$ .

Consider this data:

$Two parallel lines: hyperplane w^T x+b=0 (lower) and margin boundary w^T x+b=1 (upper); +1 point boxed in red between the lines; misclassified -1 circled in red above the margin; -1 points lower-left$

For the red-boxed $+1$ point, $\vec{w}^T \vec{x}_i + b < 1$ . Since $y_i = 1$ :

\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) < 1

\Rightarrow \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1 \text{ is violated}

Let’s say $\vec{w}^T \vec{x}_i + b = 0.4$ . Although this condition is not satisfied, we will satisfy:

y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi \quad \text{as long as } \xi \geq 0.6

$Four parallel lines: negative margin wx+b=-1 (solid), decision boundary wx+b=0 (solid), imaginary shifted margin through boxed +1 (dashed), positive margin wx+b=1 (solid); +1 point boxed in red on the dashed line; misclassified -1 circled in red above positive margin; -1 points lower-left; cyan xi arrow perpendicular from positive margin down to dashed line$

For the red-circled $-1$ point (misclassified, past the positive margin boundary), $\vec{w}^T \vec{x}_i + b > 1$ . Since $y_i = -1$ , and multiplying a positive number by $-1$ flips the sign:

\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) < -1 \quad \text{(condition is violated)}

For example, $y_i(\vec{w}^T \vec{x}_i + b) = -3$ . Again, we introduce slack — this point will satisfy:

y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi \quad \text{as long as } \xi \geq 4

The hard margin SVM:

\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

becomes the Soft Margin SVM:

\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \xi_i

subject to

\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi_i

\forall i, \ \xi_i \geq 0

$\xi_i$ is a measure of how much point $i$ violates the margin. Since we want to minimize violations, we want to penalize the total amount of violation across all points.

You might wonder: why $\sum \xi_i$ specifically? Why not count the number of violations, or just penalize the worst one? The count ( $\sum \mathbb{1}[\xi_i > 0]$ ) is not smooth — it jumps discontinuously and is hard to optimize. Taking only the max ignores all the other violations. The sum strikes the right balance: every violation contributes proportionally to how bad it is, and the result is a smooth, convex expression that fits naturally into the QP framework we already have. We also add $\forall i,\; \xi_i \geq 0$ to prevent “negative violations” (i.e., a well-classified point can’t cancel out a misclassified one).

$C$ is a hyperparameter — think of it as a fee charged per unit of total violation. A high $C$ penalizes violations heavily, so the optimizer will work hard to avoid misclassifications (tighter margin, fewer errors). A low $C$ is more tolerant of misclassifications, allowing a wider margin at the cost of some wrongly classified points.