Skip to content

Lec 05-12-2026: Hard & Soft Margin SVMs

Given a linearly separable dataset D\mathbb{D}, our goal is to find the w\vec{w} and bb which creates the hyperplane with maximum margin.

To do so, we must solve the following optimization:

argmaxw,b γ(w,b)\underset{\vec{w},b}{\text{argmax}} \ \gamma(\vec{w}, b)

subject to

i, yi(wTxi+b)0\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 0

Without this constraint, the hyperplane would simply fly off to infinity — the margin can be made arbitrarily large just by moving the hyperplane infinitely far away from all the data. The constraint is what keeps it properly anchored over the dataset.

This is equivalent to:

argminw,b wTw\underset{\vec{w},b}{\text{argmin}} \ \vec{w}^T \vec{w}

subject to:

  • i, yi(wTxi+b)0\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 0
  • minxiDwTxi+b=1\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1

Which is further equivalent to:

argminw,b wTw\underset{\vec{w},b}{\text{argmin}} \ \vec{w}^T \vec{w}

subject to

i, yi(wTxi+b)1\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

The condition i,yi(wTxi+b)>0\forall i, y_i(\vec{w}^T \vec{x}_i + b) > 0 serves two purposes: it ensures the hyperplane correctly classifies all points, and it keeps the hyperplane from flying off to infinity. That is, it ensures we get something like this:

2D coordinate plane with diagonal hyperplane labeled w^T x + b = 0

For all positive points, since minxiDwTxi+b=1\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1:

i, wTxi+b1\Rightarrow \forall i, \ \vec{w}^T \vec{x}_i + b \geq 1

And since yi=1y_i = 1:

yi(wTxi+b)1\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) \geq 1

For all negative labels, since minxiDwTxi+b=1\min_{\vec{x}_i \in \mathbb{D}} \left| \vec{w}^T \vec{x}_i + b \right| = 1:

i, wTxi+b1\Rightarrow \forall i, \ \vec{w}^T \vec{x}_i + b \leq -1

And since yi=1y_i = -1:

i, yi(wTxi+b)1\Rightarrow \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

This is the same constraint as for positive yiy_i‘s.

We now have the answer to how to find the w\vec{w} and bb that maximizes the margin in a linearly separable dataset.

The points that sit exactly on one of the two margin boundaries — satisfying yi(wTxi+b)=1y_i(\vec{w}^T\vec{x}_i + b) = 1 with equality — are called support vectors. They are the only points that determine the position and orientation of the hyperplane; every other correctly classified point could be removed from the dataset and the solution would be unchanged. This is where the name “Support Vector Machine” comes from.

argminw,b 12wTwsubject toi, yi(wTxi+b)1\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

This is a QP (quadratic programming) problem.

Two parabolas f (black) and ½f (red) sharing the same argmin; dotted line connects both minima vertically, showing the ½ factor does not shift the argmin

The reason we spent all that time transforming the original problem into this QP form is practical: early ML researchers couldn’t solve the original max-margin formulation directly. Mathematicians, however, had already been studying quadratic programming for years and had well-developed tools for it. Reformulating as a QP opened the door to those existing solutions.

There are several approaches to solving QP problems:

  • Active set methods (1950s/60s)
  • Interior point methods (1980s)
  • Subgradient method (1960s/70s) — we are studying this
  • Sequential minimal optimization (SMO, 1998)

The subgradient method is closely related to the gradient descent we’ve already studied — think of it as gradient descent with an extra constraint-checking step. It is somewhat outdated (SMO is today’s standard), but it works and is much easier to understand without a full course in numerical analysis.

To solve the optimization, we:

  • Initialize — choose values for w\vec{w}, bb, α\alpha

  • For each datapoint (xi,yi)(x_i, y_i), check if the constraint is satisfied: is yi(wTxi+b)1y_i(\vec{w}^T \vec{x}_i + b) \geq 1?

    Yes:

    wnew=woldαwold\vec{w}_\text{new} = \vec{w}_\text{old} - \alpha \vec{w}_\text{old}

    bnew=boldb_\text{new} = b_\text{old}

    No:

    wnew=woldα(woldyixi)\vec{w}_\text{new} = \vec{w}_\text{old} - \alpha(\vec{w}_\text{old} - y_i \vec{x}_i)

    bnew=bold+αyib_\text{new} = b_\text{old} + \alpha y_i

  • Repeat until convergence — that is, until the constraints are always satisfied and w\vec{w} and bb stop meaningfully changing between updates.

This completes phase 3.

Phase 4: Modifying the Approach for the Original Problem

Section titled “Phase 4: Modifying the Approach for the Original Problem”

Original problem: Given a dataset which is not linearly separable, find the hyperplane which maximizes the margin.

In the case the data is NOT linearly separable, no such hyperplane will satisfy the condition:

i, yi(wTxi+b)1\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

2D scatter plot with diagonal hyperplane; -1 points lower-left, +1 points upper-right, one misclassified -1 point circled in red on the +1 side

For the misclassified point, wTxi+b\vec{w}^T \vec{x}_i + b has the wrong sign relative to yiy_i, so:

yi(wTxi+b)<0y_i(\vec{w}^T \vec{x}_i + b) < 0

But it needs to be yi(wTxi+b)1y_i(\vec{w}^T \vec{x}_i + b) \geq 1.

Idea: Relax the condition/constraint by softening it. This gives leeway, and produces the “SVM with soft margins” (also called “SVM with slack variables”).

The intuition is that in the case where the dataset is not linearly separable, we aim to satisfy:

argminw,b 12wTwsubject toi, yi(wTxi+b)1\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

as much as possible, but will allow some violation by adding a little slack ξ\xi.

Consider this data:

Two parallel lines: hyperplane w^T x+b=0 (lower) and margin boundary w^T x+b=1 (upper); +1 point boxed in red between the lines; misclassified -1 circled in red above the margin; -1 points lower-left

For the red-boxed +1+1 point, wTxi+b<1\vec{w}^T \vec{x}_i + b < 1. Since yi=1y_i = 1:

yi(wTxi+b)<1\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) < 1 i, yi(wTxi+b)1 is violated\Rightarrow \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1 \text{ is violated}

Let’s say wTxi+b=0.4\vec{w}^T \vec{x}_i + b = 0.4. Although this condition is not satisfied, we will satisfy:

yi(wTxi+b)1ξas long as ξ0.6y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi \quad \text{as long as } \xi \geq 0.6

Four parallel lines: negative margin wx+b=-1 (solid), decision boundary wx+b=0 (solid), imaginary shifted margin through boxed +1 (dashed), positive margin wx+b=1 (solid); +1 point boxed in red on the dashed line; misclassified -1 circled in red above positive margin; -1 points lower-left; cyan xi arrow perpendicular from positive margin down to dashed line

For the red-circled 1-1 point (misclassified, past the positive margin boundary), wTxi+b>1\vec{w}^T \vec{x}_i + b > 1. Since yi=1y_i = -1, and multiplying a positive number by 1-1 flips the sign:

yi(wTxi+b)<1(condition is violated)\Rightarrow y_i(\vec{w}^T \vec{x}_i + b) < -1 \quad \text{(condition is violated)}

For example, yi(wTxi+b)=3y_i(\vec{w}^T \vec{x}_i + b) = -3. Again, we introduce slack — this point will satisfy:

yi(wTxi+b)1ξas long as ξ4y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi \quad \text{as long as } \xi \geq 4

The hard margin SVM:

argminw,b 12wTwsubject toi, yi(wTxi+b)1\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} \quad \text{subject to} \quad \forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1

becomes the Soft Margin SVM:

argminw,b 12wTw+Ci=1nξi\underset{\vec{w},b}{\text{argmin}} \ \frac{1}{2} \vec{w}^T \vec{w} + C \sum_{i=1}^{n} \xi_i

subject to

i, yi(wTxi+b)1ξi\forall i, \ y_i(\vec{w}^T \vec{x}_i + b) \geq 1 - \xi_i i, ξi0\forall i, \ \xi_i \geq 0