\[\begin{equation} \frac{\partial L(w,b)}{\partial w}=- \sum_{x_i \in M}y_ix_i\\\\ \frac{\partial L(w,b)}{\partial b}=- \sum_{x_i \in M}y_i \end{equation}\]根据上面的梯度,我们可以得到$w$和$b$的更新公式:
\[w_{new} \leftarrow w_{old} - \rho \frac{\partial L(w,b)}{\partial w}=w_{old} + \rho \sum_{x_i \in M}y_ix_i\\\\ b_{new} \leftarrow b_{old} - \rho \frac{\partial L(w,b)}{\partial b}=b_{old} + \rho \sum_{x_i \in M}y_i\]但是在实际在使用的时候,并不会把$\sum$带入进去,而采用的是下面的式子进行迭代更新的(见统计学习方法第28页):
\[w_{new} \leftarrow w_{old} + \rho y_ix_i\\\\ b_{new} \leftarrow b_{old} + \rho y_i\]即在实际训练的时候,我们随机初始化$w$和$b$,然后选取训练集中的一个点,如果该点属于误分类点(所以损失函数前会多一个负号),即$y_i(wx_i+b)\leq 0$,按照上面的进行权重和偏置的更新。
这个问题在Deep Learning, Perceptron, Backpropagation第26:38分有相应的解答,主要的区别使用梯度下降的类型不一样,梯度下降按使用训练样本的数目不一样可以一分为Batch gradient descent(批量梯度下降)、Stochastic gradient descent(SGD, 随机梯度下降)以及Mini-batch gradient descent(小型批量梯度下降)三种方式,批量梯度下降和随机梯度下降的差异如下(摘自An overview of gradient descent optimization algorithms):
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
Though the sigma that it gives you the correct direction but you can show that stochastic gradient descent all most of the time it gives you a good direction. You know it may happen that using only one point you go wrong for in one iteration, but over all you go to the right direction generally and it’s faster.