MIT+ML04:Kernel Methods and Non-linear classification

阅读数: 13次 2020-08-10

特征工程在机器学习中，一直是非常重要的一项。

无论是在spam filtering 还是在native OCR system，尤其是在分类问题中，选择好的特征，对于模型的学习和训练效果都有着非常重要的影响。

此PPT截图只供个人笔记存档使用！

Common Feature engineering

CTR:click through rate
- One-hot
- One-of-key encoding

在现金机器学习的趋势之中，更多的是更多的是field feature+hand field feature

所以说 design of good feature 很重要。

在完整的过程中：

we need to decide what raw features to collect

After we have know these raw features, we have to collect them and know how to encode these fratuers.

data cleaning
preprocessing
Scikit learn

分类问题的两个view,两个方向：

a nonlinear classier with given features
Nonlinear features with linear classifier

所以说，深度神经网络有什么好处呢？

深度神经网络是non-linear classifier works on raw data.

所以，可以吧DNN理解成：first construct nonlinear-feature and then through a linear classifier to do classification.

最鲜明的例子：CNN卷积神经网络

Non-linear classifier举例

Quadratic classifier:

nearest neighbour classifier:

Nonlinear features:

定义

一个not perfectly true example:

很多时候，一些低维空间中的数据，我们很难找到他们的线性划分，即联系到第二讲中，他们不是linearly separatable的，但是，当我们把这些数据映射到高维度的空间中，他们有时候会变得linearly separatable.

这就是Nonlinear feature 神奇的地方，即Nonlinear feature map $DataSets=[x^{(1)},x^{(2)},…,x^{(n)}]$ to $[\Phi(x^{(1)}),\Phi(x^{(2)}),…,\Phi(x^{(n)})]$

其中 $\Phi$ 即是non-linear feature了，i.e. non-linear function.

举例

一个很有名的例子：XOR da ta

一些常见的nonlinear features

我们之前说了，non-linear feature+simple linear classifer

那么我们有了这些使我们本来不linearly separable的data变得linearly separable以后，我们最先接触的简单的linear classifer有什么变化吗？

其实这些方法在本质上还是无变化的。

但是，从不易线性可分到易线性可分的过程中，我们付出了什么代价？

i.e. How long is the feature vector in each case?

也就是说，从length的角度考虑我们额外付出了多少计算代价？

Kernels

有了kernel method我们有了一个很tricky的方法

我们有了核方法以后，我们就不需要显示的去构造nonlinear feature函数了，而是可以隐式的使用这些方法。

$w^T\phi(x)$

Decision function

h(x)=sgn(<w, $\phi(x)$ >+ $w_0$ )

这里注意尖括号是dot product 的notation，而我们的分类器的decision depends only on dot product.

但是，如果我们可以把best parameter vector写成训练数据的线性组合，

$w=\sum_ia_i\phi(x^{(i)})$

我们为什么可以把best parameter vector写成训练数据的线性组合呢？因为我们本身实用的是线性分类器，所以可以写成训练数据的线性组合？

所以，比较难求，也是我们想求的那个内积：

$<w,\phi(x)>=<\sum_ia_i\phi(x^{(i)},\phi(x))>$

= $\sum_i a_i<\phi(x^{(i)},\phi(x)>$

也就是说，将这个内积，我们想用其他形式来替代：

$<\phi(x^{(i)},\phi(x)>=Kernel(x^{(i)},x)$

所以，能找到核函数，我们就不要大费周章的去寻找non-linear feature function了。

核函数运算简单的简单例子

$x=[x_1,x_2],z=[z_1,z_2]$

$\phi(x)=[x_1^2,\sqrt{2} x_1x_2,x_2^2]$ , $\phi(z)=[z_1^2,\sqrt{2} z_1z_2,z_2^2]$

求： $\phi(x)^T\phi(z)$

$\phi(x)^T\phi(z)=(x^Tz)^2$

用核方法计算比直接算对于计算的开销要小很多。

Bonus material 存档,具体见slides，同时为了不侵权。