Skip to content

Files

Latest commit

abcc22d · Feb 1, 2025

History

History
78 lines (68 loc) · 4.69 KB

File metadata and controls

78 lines (68 loc) · 4.69 KB

SVM

  • linear separator (can be non-linear if use of a kernel)
  • not prone to the curse of dimensionality
  • not sensitive to outliers
  • not prone to overfitting
  • we search for the hyperplane that maximizes the margin between two classes
  • margin: distance between the hyperplane and the support vectors = 2 | | w | |
  • we want to find w that maximizes y i ( w T x i + b ) 1
  • we want to maximize the margin, that is minimize | | w | | or | | w | | 2 (convex)
  • h(x) is the decision function that determines which side of the hyperplane a point lies on:
  • h ( x ) = i N α i y i x i . x + b
  • soft margin: maximize y i ( w T x i + b ) ( 1 ξ i ) with ξ i 0
  • ξ i is a slack variable
  • misclassification when ξ i > 1 , in the margin when 1 ξ i 0
  • or minimize 1 2 | | w | | 2 + C i ξ i
  • C is the penalty error term (positive):
    • C small: soft-SVM
      • large margin
      • tolerates more errors
      • low variance, high bias
    • C big: hard-SVM
      • small margin
      • tolerates no errors (overfitting)
      • low bias, high variance
  • the support vectors are the vectors that define the separating hyperplanes ( α i 0 )
  • The kernel function K ( x , x ) computes the inner product in feature space:
  • K ( x , x ) =< φ ( x ) , φ ( x ) >
  • The problem with this scalar product is that it is performed in a large dimensional space, which leads to impractical calculations. For example, if φ maps to a 1-million dimensional space, computing φ(x) would require storing 1 million values for each data point, and computing the dot product would require 1 million multiplications and additions.
  • The kernel trick is therefore to replace a scalar product in a large dimensional space with a kernel function K that is easy to calculate. This means we can compute K(x,x') directly without explicitly computing φ(x) and φ(x'). For example, with the Gaussian (RBF) kernel:
  • K ( x , x ) = exp ( | x x | 2 2 σ 2 )
  • for a function to be a kernel there must exist a function into a feature space such that the function output the same result as the dot product of the projected vectors.
  • for a function K(x,x') to be a valid kernel, there must exist a feature mapping function φ such that:
    • φ maps input vectors to some feature space F
    • K ( x , x ) =< φ ( x ) , φ ( x ) > for all x,x'
  • This is known as Mercer's condition: not every function can be a kernel
  • For example:
    • Linear kernel: K ( x , x ) = x T x corresponds to φ(x) = x (identity mapping)
    • Polynomial kernel: K ( x , x ) = ( x T x + 1 ) 2 corresponds to a φ that maps to all degree-2 polynomial features
    • RBF kernel: K ( x , x ) = exp ( | x x | 2 2 σ 2 ) corresponds to φ mapping to an infinite-dimensional space

One class SVM

One-Class SVM is similar

  • instead of using a hyperplane to separate two classes of instances
  • it uses a hypersphere to encompass all of the instances.
  • Now think of the "margin" as referring to the outside of the hypersphere
  • so by "the largest possible margin", we mean "the smallest possible hypersphere".

Get a confidence score

https://prateekvjoshi.com/2015/12/15/how-to-compute-confidence-measure-for-svm-classifiers/

Infographic

More