PLA/pocket | linear SVR | / | worse performance |
---|---|---|---|
linear soft-margin SVM | linear ridge regression | regularized logistic regression | |
/ | kernel ridge regession | kernel logistic regression | worse performance |
SVM | SVR | probabilistic SVM |
linear Hard-Margin SVM ->non-linear Hard-Margin SVM ->kernel -> soft-margin SVm
http://www.bilibili.com/video/av9636299/index_2.html
$max(w): margin(w)$ # w is defined as the line
subject-to:every:$ y(n)·w·x(n)>0$
$margin(w)=\min_{n=1}^{N} distance(xn,w)$
其中: $distance(xn,w)=\frac{ w·x+b }{ w }=\frac{y_n(w·x_n+b)}{ w }$
=>
$max(b,w): \frac{1}{|w|} <==> min(b,w): \frac{w·w}{2}$
subject to: $ y_n·(w·x_n+b) \geq 1 $ for all of n
=>Quadratic Programming
optimal $u<-Q_p(Q,p,A,c)$
min(u):$u·Q_u/2+p_u$
subject to: $a_n·u \geq c_n$
u=[b w]·T
==> $g_{SVM}(x)=sign(w·x+b)$
$z_n=phi(x_n)$ to represent the hyperplanes
phi(x) is the feature transform function
min(b,w): $\frac{W·W}{2}$
s.t.: $y_n(w·z_n+b) \geq 1$ for all of n
with Lagrange multipliers $\alpha(n)$
$L(b,w,\alpha(n))= \frac{W·W}{2}+ \sum_n \alpha(1-y_n·(W·z_n+b))$
SVM;$min_{(b,w)}{max {L(b,w,alpha)} for-all-{\alpha}} =max_{(\alpha)}[min_{(b,w)}L(b,w,\alpha)]$
KKT condition:
1、$\sum (\alpha(n)·y_n)=0$ for all n
2、$w=\sum (\alpha(n)·y_n·z_n)$ for all n
3、$\alpha(n)(1-y_n(w·z_n+b))=0$
=> SVM: $min [\frac{(\sum \alpha(n)·y_n·z_n)^2}{2}-\sum \alpha(n)]$ for all of n
=>
standard hard-margin SVM dual
$min_{(\alpha)}:\frac{(\sum\alpha(n)·y_n·z_n)^2}{2}-\sum\alpha(n)$
s.t.: $\sum(y_n·\alpha(n))=0$
$\alpha \geq 0$
if $\alpha(s)=\alpha(n)>0$ then the point n is a support vector
=> QP => KKT => b=yn-w·zn,w=sum (alpha(n)·yn·zn) for all n
=>
w=sum (alpha(s)·ys·zs) # s is all of SVpoint
b=ys-sum (alpha(s)·ys·zs·zs)
gSVM(z)=sign(w·z+b)
zn·zm=K(xn,xm) = kernel model of xn & xm
kernel model:
linear kernel # degree 1
zn·zm = K(xn,xm) = (xn·xm)
.
polynomial kernel # degree Q
zn·zm = K(xn,xm) = (theta+gama·xn·xm)^Q , gama>0, the matrix K must be a semi_definite matrix, Q should be small
.
gaussian kernel # infinit degree
zn·zm = K(xn,xm) = exp(-gama· xn-xm ^2) , gama>0
object:
min(b,w) :
1/2·w·w+C·sum(keci(n)) s.t. : yn(w·zn+b)>=1-keci(n) keci(n)>=0
=>
min(alpha) :
1/2·w·w-sum(alpha(n)) s.t. : sum(yn·alpha(n))=0 0<=alpha(n)<=C
the same method(kernel QP) to solve alpha(n)
KKT condition
1、sum (alpha(n)·yn)=0 for all n 2、w=sum (alpha(n)·yn·zn) for all n 3、alpha(n)(1-keci(n)-yn(w·zn+b))=0 0<alpha(n)<C: ·alpha(n)=0 : keci = 0 : correctly ·0<alpha(n)<C : keci=0 : the point is on the boundary ·alpha(n)=C : keci = yn-(w·zn+b): the error of the point is the keci ·the more C, the more probability of overfit
=> b= ys-sum( alpha(n) · yn · K(xn,xs) )
w=
gSVm=
is approximate as SVM
object:
min(b,w) :
lambda/N·W·W+1/N·sum[(yn-W·zn)^2]
optimal solution : W*=sum[beta(n)·zn]
can solve for optimal beta instead of W
=>
min(beta) :
lambda/N·beta·K·beta+1/N· y-K·beta ^2
beta=inv(lambda·I+K)·y , always exists for lambda>0
time complexity:O(N^3),it’s heavy to solve because of K is big
object:
min(b,w) :
lambda/N·W·W+1/N·sum[log2(1+exp(-yn·w·zn))]
optimal solution : W*=sum[beta(n)·zn]
can solve for optimal beta instead of W
=>
min(beta) :
lambda/N·beta·K·beta+1/N·sum(log2(1+exp(-yn·sum(beta·K))))
almost beta is not 0, it’s not very good
object:
min(b,w) :
lambda/N·W·W+1/N·sum[max(0,|w·zn-y|-e)]
<==>
1/2·w·+C·sum[max(0,|w·zn+b-yn|-e)] s.t.: -e-keci(n)1<=yn-w·zn-b<=e+keci(n)2
=>
min(alpha1,alpha2)
1/2·w·w+sum[(e-yn)·alpha2+(e+yn)·alpha1] s.t.: sum[alpha2-alpha1]=0 0<=alpha2<=C 0<=alpha1<=C
KKT condition
1、w=sum(alpha2-alpha1)·zn 2、0=sum(alpha2-alpha1) 3、alpha2·(e+keci(n)2-yn+w·zn+b)=0 4、alpha1·(e+keci(n)1-yn+w·zn+b)=0
1、strictly within tube w·zn+b-yn <e
beta(n)=alpha2-alpha1=0
2、beta(n)!=0 : on or outside Tube
1、run SVM on D to get (bSVM,wSVM)[or the duel alpha],and transform D to zn=wSVM·phi(xn)+bSVM 2、run logistic regression on {(zn,yn)}1..N to get (A,B) 3、return g(x)=theta(A·(wSVM·phi(x)+bSVm)+B)