- How to train a NN ?
- What is the basic idea behind training?
- Some of the apsects of NN optimization:
- BP
- Mini-batch
- Initialization
- Batch normalization
- Gradient Clipping
- Adaptive methods: Adam & Adagrad
- Momentum
- Other ideas.
Use SGD to do Neural Network training
$$
ERM formulation:\
min_{\theta}R_N(\theta)=\frac{1}{N}\sum Loss(y_i,F(x_i;\theta))
$$
SGD:
$$
update:\
\theta <- \theta-\eta\frac{\part Loss}{\part\theta}
$$
Q1:How to select $\theta_0$?
Q2:如何控制和调整step-size(aka)learning rate?
Q3:如何去计算这个梯度?
初始化进行SGD的点选在哪里比较好呢?
是at random?
These refinements are all needed to be considered.
In reality: there are multiple ways to do SGD:
- momentum
- clipping
- adaptivity
Step1 :How to start? How initialization?
Since NN loss is highly nonconvex , so initializing a NN is important.
Besides, Optimizing it to attain a “good” solution is hard and requires careful tuning.
Don’t init all weights to be same
原因: 如果我们把weights都初始化为相同的,那么我们就无神经元的具体区分了。如果全把weight置为0,那你也无法获得multi-power的能力。
Initialize Randomly:
via the Gaussian $N(0,\sigma^2)$,$\sigma$ depends on the number of neurons in a given layer.
For ReLUs: current recommendation is set $\sigma^2=\frac{2}{n}$
【存疑问】:roughly ensures that random input to a unit does not depend on the number of inputs it gets. Symmetry breaking?
Step2: 如何确定学习速率?
- 太小太慢
- 太快不稳定:fluctuate.
Numerous heuristics for tuning the learning-rate
- Decaying
- Adaptive
- Architecture Sensitive
Often the most pesky parameter, tuning well can have huge impac.
NN toolkits use so-called “step-size” Schedulers.
Adaptive learning Rates: Adam Solver.
Step3. How to compute a stochastic Gradient
注意SGD的AIM 的理解!
$$
\frac{\part Loss()}{\part \theta}
$$
BP 就是使上面那个偏导数的计算一个简单的方法
$$
z_i=\sum_{j=1}^pw_{ij}x_j+b_i (i^{th}INPUT )\
f(z_i)=activation(z_i)\
z=\sum_{i=1}^m w_if(z_i)+b (INPUT 2 OUTPUT)\
f(z)=F(\vec x;\theta)=z (OUTPUT)
$$
我们可以发现,上面的连锁反应:
权重变了,影响单个Unit的输入,从而影响通过激活函数后的输出结果,自然而然的,就会影响到最终结果,最后影响到loss,
Aim:To get$\frac{\part Loss()}{\part \theta}$
$\frac{\part Loss()}{\part \theta}=\frac{\part Loss()}{\part w_{ij}}=\frac{\part z_i()}{\part w_{ij}}\frac{\part f(z_i)}{\part z_i}\frac{\part z}{\part f(z_i)}\frac{\part Loss()}{\part z}$
general idea of BP
手撕BP
时隔两个月,又看了一次BP,感觉对于整体神经网络的结构有了更清晰的认识,神经网络Feedforward和BP的结合?
注意BP与动态规划之间的关系,do extra storage to increase efficiency.
举个不太恰当的例子,如果把上图中的箭头表示欠钱的关系,即c→e表示e欠c的钱。以a, b为例,直接计算e对它们俩的偏导相当于a, b各自去讨薪。a向c讨薪,c说e欠我钱,你向他要。于是a又跨过c去找e。b先向c讨薪,同样又转向e,b又向d讨薪,再次转向e。可以看到,追款之路,充满艰辛,而且还有重复,即a, b 都从c转向e。
作者:Anonymous
链接:https://www.zhihu.com/question/27239198/answer/89853077
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
Automatic Differentiation自动微分
Forward mode AD
Backward mode AD (Use BP as a special case.
Other innovations:
- 梯度消失
- 梯度爆炸
- Partial remedies for unstable gradients.
Residual Networks(Resnets)
神经网络叠的越深,则学习的效果就一定越好吗?
left to explore.