MIT+ML07 Neural Network And Optimization

阅读数: 6次 2020-08-20

How to train a NN ?
What is the basic idea behind training?
Some of the apsects of NN optimization:
1. BP
2. Mini-batch
3. Initialization
4. Batch normalization
5. Gradient Clipping
6. Adaptive methods: Adam & Adagrad
7. Momentum
8. Other ideas.

Use SGD to do Neural Network training

$ERM formulation:\ min_{\theta}R_N(\theta)=\frac{1}{N}\sum Loss(y_i,F(x_i;\theta))$

SGD:

$update:\ \theta <- \theta-\eta\frac{\part Loss}{\part\theta}$
Q1:How to select

$\theta_0$ ?

Q2:如何控制和调整step-size（aka)learning rate?

Q3:如何去计算这个梯度？

初始化进行SGD的点选在哪里比较好呢？

是at random?

These refinements are all needed to be considered.

In reality： there are multiple ways to do SGD:

momentum

clipping

adaptivity

Step1 :How to start? How initialization?

Since NN loss is highly nonconvex , so initializing a NN is important.

Besides, Optimizing it to attain a “good” solution is hard and requires careful tuning.

Don’t init all weights to be same

原因：如果我们把weights都初始化为相同的，那么我们就无神经元的具体区分了。如果全把weight置为0，那你也无法获得multi-power的能力。

Initialize Randomly:

via the Gaussian $N(0,\sigma^2)$ , $\sigma$ depends on the number of neurons in a given layer.

For ReLUs: current recommendation is set $\sigma^2=\frac{2}{n}$

【存疑问】：roughly ensures that random input to a unit does not depend on the number of inputs it gets. Symmetry breaking?

Step2: 如何确定学习速率？

太小太慢
太快不稳定：fluctuate.

Numerous heuristics for tuning the learning-rate

Decaying
Adaptive
Architecture Sensitive

Often the most pesky parameter, tuning well can have huge impac.

NN toolkits use so-called “step-size” Schedulers.

Adaptive learning Rates: Adam Solver.

Step3. How to compute a stochastic Gradient

注意SGD的AIM 的理解！

$\frac{\part Loss()}{\part \theta}$

BP 就是使上面那个偏导数的计算一个简单的方法

$z_i=\sum_{j=1}^pw_{ij}x_j+b_i (i^{th}INPUT )\ f(z_i)=activation(z_i)\ z=\sum_{i=1}^m w_if(z_i)+b （INPUT 2 OUTPUT）\ f(z)=F(\vec x;\theta)=z (OUTPUT)$

我们可以发现，上面的连锁反应：

权重变了，影响单个Unit的输入，从而影响通过激活函数后的输出结果，自然而然的，就会影响到最终结果，最后影响到loss，

Aim:To get $\frac{\part Loss()}{\part \theta}$

$\frac{\part Loss()}{\part \theta}=\frac{\part Loss()}{\part w_{ij}}=\frac{\part z_i()}{\part w_{ij}}\frac{\part f(z_i)}{\part z_i}\frac{\part z}{\part f(z_i)}\frac{\part Loss()}{\part z}$

general idea of BP

手撕BP

时隔两个月，又看了一次BP，感觉对于整体神经网络的结构有了更清晰的认识，神经网络Feedforward和BP的结合？

注意BP与动态规划之间的关系，do extra storage to increase efficiency.

举个不太恰当的例子，如果把上图中的箭头表示欠钱的关系，即c→e表示e欠c的钱。以a, b为例，直接计算e对它们俩的偏导相当于a, b各自去讨薪。a向c讨薪，c说e欠我钱，你向他要。于是a又跨过c去找e。b先向c讨薪，同样又转向e，b又向d讨薪，再次转向e。可以看到，追款之路，充满艰辛，而且还有重复，即a, b 都从c转向e。

作者：Anonymous
链接：https://www.zhihu.com/question/27239198/answer/89853077
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Automatic Differentiation自动微分

Forward mode AD

Backward mode AD (Use BP as a special case.

Other innovations:

梯度消失
梯度爆炸
Partial remedies for unstable gradients.

Residual Networks(Resnets)

神经网络叠的越深，则学习的效果就一定越好吗？

left to explore.