Skip to content

Lec 05-14-2026: Soft-Margin SVM & Cross-Validation

Last lecture, we started the soft-margin SVM. We said: given a non-linearly separable dataset DD, to find the optimal hyperplane wTx+b=0\vec{w}^T \vec{x} + b = 0 which maximizes the margin, we solve the following optimization problem:

argminw,b12wTwhow large margin can be+Ci=1nξiwants to make margin small(penalizes constraint violations)\arg\min_{\vec{w}, b} \underbrace{\frac{1}{2} \vec{w}^T \vec{w}}_{\text{how large margin can be}} + \underbrace{C \sum_{i=1}^{n} \xi_i}_{\substack{\text{wants to make margin small} \\ \text{(penalizes constraint violations)}}}

Subject to i\forall i, yi(wTxi+b)1ξiy_i (\vec{w}^T \vec{x}_i + b) \geq 1 - \xi_i and i\forall i, ξi0\xi_i \geq 0

Because ξi\xi_i is a measure of how much a point violates the margin, we wish to minimize the total violation, and that’s why we are minimizing i=1nξi\sum_{i=1}^{n} \xi_i.

  • ξi\xi_i is a measure of how much datapoint ii measures the error.
  • So ξi\sum \xi_i is a measure of the total violation of the margin across all datapoints.

Ci=1nξiC \sum_{i=1}^{n} \xi_i is saying “I’ll allow some points to break the margin but charge a fee of CC units per violation.”

  • If CC is low, it is cheap to break the margin, so we get a larger margin.
  • If CC is high, it is expensive to break the margin, so the margin / breathing room will decrease to make sure there are fewer violations.

So we want to maximize the margin and minimize violations at the same time — CC is what controls that balance.

We split the data as follows:

Dataset split: 60% training, 20% validation, 20% test

Using the 60% training split, we train 5 models with different values of CC:

Tree diagram: 60% training data split into 5 models with varying C values

Which CC yields a better model?

We use each of the 5 models to predict on the validation data and compute the accuracy.

  • Use the CC coming from the highest accuracy.

Using this CC, run the optimization on the training + validation set. This yields wfinal\vec{w}_\text{final} and bfinalb_\text{final}, and our end model is wfinalTx+bfinal=0\vec{w}_\text{final}^T \vec{x} + b_\text{final} = 0.

We use our end model on the test set, which the model has never seen before, so we get an “honest” reading.