专业英语W11:Chapter1 to the end
Learnning with gradient descent
训练集和测试机样本与过拟合:This helps give us confidence that our system can recognize digits from people whose writing it didn’t see during training.
手写识别训练集样本中:每一个训练图片都是一个$28\times28$像素的灰度图片,将其变成一个$784$维的向量,
$y=y(x)$,y是一个output,10-维的;
Cost function:损失函数: mean squared error(MSE:均方误差)
Aim of our training algorithm: So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b) as a function of the weights and biases.
Gradient descent:we want to find a set of weights and biases which make the cost as small as possible.
Quadratic cost:二次代价函数
使用二次代价函数来评估模型的学习效果而不是直接使用成功分类的正确数量来评估模型的原因:
成功分类的正确数量和weights 和biases不是呈一个线性关系;(Not smooth)这个不是线性关系让我们对weights 和cost的调整很难;
对比之下:如果使用了smooth cost function ,我们就可以很明显的看出weight和cost变化后对结果的影响,我们也就知道该怎么调整了;
First focus on minimizing the quadratic cost ;
Second examine the classification accuracy;
现有学习逻辑是:在已知函数的情况下,minimize the function;
Use gradient descent to solve minimization problems;
然后再回去看我们需要求minimization的函数是什么;
如何解决多变量函数求最小值的问题?尤其是当函数的自变量很多很多个的时候:
Ball rolls down to the bottom of the valley,do this simulation simply by derivatives (perhaps some second derivatives)
But what’s really exciting about the equation is that it lets us see how to choose Δv so as to make ΔC negative.
Δv=−η∇C,η是学习率;
学习率参数的选取比较关键,太小会让我们变化下降的速度太慢,而且有时并不是下降;
我们⾸先限制步⻓为⼩的固定值,
即 ∥∆v∥ = ϵ,ϵ > 0。当步⻓固定时,我们要找到使得 C 减⼩最⼤的下降⽅向。
An idea called stochastic gradient descent(随机梯度下降)
can be used to speed up learning. The idea is to estimate the gradient ∇C∇C by computing ∇Cx∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C∇C, and this helps speed up gradient descent, and thus learning.
To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mm of randomly chosen training inputs.
We’ll label those random training inputs X1,X2,…,XmX1,X2,…,Xm, and refer to them as a mini-batch.
有种叫做随机梯度下降的算法能够加速学习。其思想就是通过随机选取⼩量训练输⼊样本来
计算 ∇Cx*,进⽽估算梯度 *∇C*。通过计算少量样本的平均值我们可以快速得到⼀个对于实际梯
度 ∇C* 的很好的估算,这有助于加速梯度下降,进⽽加速学习过程。
更准确地说,随机梯度下降通过随机选取⼩量的 m 个训练输⼊来⼯作。我们将这些随机的
训练输⼊标记为 X1, X*2, . . . , X*m,并把它们称为⼀个⼩批量数据(mini-batch)。
我们可以把随机梯度下降想象成⼀次⺠意调查:在⼀个⼩批量数据上采样⽐对⼀个完整数据集进⾏梯度下降分析要容易得多,正如进⾏⼀次⺠意调查⽐举⾏⼀次全⺠选举要更容易。
!当然,这个估算并不是完美的 —— 存在统计波动 —— 但是没必要完美:我们实际关⼼的是在某个⽅向上移动来减少 C,⽽这意味着我们不需要梯度的精确计算。在实践中,随机梯度下降是在神经⽹络的学习中被⼴泛使⽤、⼗分有效的技术,它也是本书中展开的⼤多数学习技术的基础。
梯度下降算法⼀个极端的版本是把⼩批量数据的⼤⼩设为 1.我们按照规则更新我们的权重和偏置.然后我们选取另一个训练输入再一次更新权重和偏置.
如此重复,这个过程被称为在线、online或者递增学习.