- linear separator (can be non-linear if use of a kernel)
- not prone to the curse of dimensionality
- not sensitive to outliers
- not prone to overfitting
- we search for the hyperplane that maximizes the margin between two classes
- margin: distance between the hyperplane and the support vectors
- we want to find
that maximizes - we want to maximize the margin, that is minimize
or (convex) - h(x) is the decision function that determines which side of the hyperplane a point lies on:
- soft margin: maximize
with -
is a slack variable - misclassification when
, in the margin when - or minimize
-
is the penalty error term (positive): -
small: soft-SVM - large margin
- tolerates more errors
- low variance, high bias
-
big: hard-SVM - small margin
- tolerates no errors (overfitting)
- low bias, high variance
-
- the support vectors are the vectors that define the separating hyperplanes (
) - The kernel function
computes the inner product in feature space: - The problem with this scalar product is that it is performed in a large dimensional space, which leads to impractical calculations. For example, if φ maps to a 1-million dimensional space, computing φ(x) would require storing 1 million values for each data point, and computing the dot product would require 1 million multiplications and additions.
- The kernel trick is therefore to replace a scalar product in a large dimensional space with a kernel function K that is easy to calculate. This means we can compute K(x,x') directly without explicitly computing φ(x) and φ(x'). For example, with the Gaussian (RBF) kernel:
- for a function to be a kernel there must exist a function into a feature space such that the function output the same result as the dot product of the projected vectors.
- for a function K(x,x') to be a valid kernel, there must exist a feature mapping function φ such that:
- φ maps input vectors to some feature space F
-
for all x,x'
- This is known as Mercer's condition: not every function can be a kernel
- For example:
- Linear kernel:
corresponds to φ(x) = x (identity mapping) - Polynomial kernel:
corresponds to a φ that maps to all degree-2 polynomial features - RBF kernel:
corresponds to φ mapping to an infinite-dimensional space
- Linear kernel:
One-Class SVM is similar
- instead of using a hyperplane to separate two classes of instances
- it uses a hypersphere to encompass all of the instances.
- Now think of the "margin" as referring to the outside of the hypersphere
- so by "the largest possible margin", we mean "the smallest possible hypersphere".
https://prateekvjoshi.com/2015/12/15/how-to-compute-confidence-measure-for-svm-classifiers/
- https://stats.stackexchange.com/questions/23391/how-does-a-support-vector-machine-svm-work
- https://pythonmachinelearning.pro/classification-with-support-vector-machines/
- https://github.com/ujjwalkarn/DataSciencePython#support-vector-machine-in-python
- https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
- https://web.archive.org/web/20230318045102/http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/
- http://vxy10.github.io/2016/06/26/lin-svm/
- https://sebastianraschka.com/faq/docs/num-support-vectors.html
- https://chunml.github.io/ChunML.github.io/tutorial/Support-Vector-Machine/
- http://www.statsoft.com/textbook/support-vector-machines
- https://www.svm-tutorial.com/2015/06/svm-understanding-math-part-3/
- https://cs.stanford.edu/people/karpathy/svmjs/demo/
- https://dscm.quora.com/The-Kernel-Trick
- https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
- https://fr.wikipedia.org/wiki/Astuce_du_noyau#Contexte_et_principe
- https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
- https://stats.stackexchange.com/questions/323593/how-does-a-one-class-svm-model-work?rq=1
- https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-trick-4c9729dc8efe
- https://towardsdatascience.com/truly-understanding-the-kernel-trick-1aeb11560769