3B1B 的视频看了会上瘾!做得太棒了
But what is a Neural Network?
website: https://www.3blue1brown.com/lessons/neural-networks
video: https://www.youtube.com/watch?v=aircAruvnKk
Plain vanilla——multilayer perception多层感知器
classic example: recognize handwritten digits
neurons: thing that holds a number(activation)
layers:
- input layer; output layer
- hidden layers
- why 2 hidden layers and with 16 neurons?
core: how activation of the former layer influence the activation of the latter layer?
- e.g. recognize the loop on the top: 8,9
- how to recognize these edges, loops, and patterns? break them down into little pieces?
- e.g. recognize the loop on the top: 8,9
edge detection example
what parameters?
- weights
- calculate the weighted sum of the activation from the input layer
- e.g. 为边的位置赋正值,周围位置赋负值,其余位置赋0值。
requirement: all activations $\in [0,1]$
- functions:
- sigmoid 用$\sigma()$表示
- 但现在几乎不用sigmoid了,用ReLU之类比较多(deep NN)
- ReLU(a)=max(0,a) (inactive below 0)
- biase偏置值,不希望太容易被激活 用b表示
- $\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)$
- functions:
too many weights and biases!
- learning: finding the right weights and biases
第一层简化为:$a^{(1)}=\sigma(Wa^{(0)}+b)$
re-understand the neurons as functions that outputs the result of the former layer.
deep learning
gradient descent
- cost function—>average cost
- input: weights&biases
- output : 1 number (the cost)
- parameters: training examples
- $C(\omega)$ : single input. how to find its minimum?
- slope detection&flowing: local minimum (depended on the random start)
- local vs global,老生常谈
- slope detection&flowing: local minimum (depended on the random start)
- $C(x,y)$ : two inputs
- $\nabla C(x,y)$ , gradient, the direction of the steepest increase
- backpropagation
- network learning=minimizing the cost function
- smooth output (continuous ranging activation)
- how gradient of the cost function$(\nabla C)$ of 13000 dims impact? 另一种思考方式:
- 大小:说明哪些维度重要
- 正负:说明该维度上应移动的方向
- (每个维度上包含权重和偏差)
analyze this network
loose patterns——not intelligence…
not picking edges and loops
learn more
要了解更多信息,我强烈推荐迈克尔·尼尔森的书
http://neuralnetworksanddeeplearning....
本书逐步介绍了这些视频中示例背后的代码,您可以在此处找到:
https://github.com/mnielsen/neural-ne...
MNIST 数据库:
http://yann.lecun.com/exdb/mnist/
另请查看 Chris Olah 的博客:
他关于神经网络和拓扑的文章特别漂亮,但说实话,那里的所有内容都很棒。 如果您喜欢,您一定会_喜欢_ distill 的出版物:
research corner
两篇论文:
a closer look at memorization in deep Networks
the loss surfaces of multilayer networks
backpropagation
intuitive walkthrough
without notations!
- bad network, silly outcome
- only adjust weights & biases
for example, the graph”2”.
you want to increase the value of 2 from 0.2 to 1, as the value$=\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)$
there are 3 ways:
- increase b
- increase $\omega_i$
- in proportion to $a_i$ (增加对应更大的$a_i$的$\omega_i$,性价比更高)
- “Neurons that fire together wire together”
- change $a_i$
- in proportion to $\omega_i$ (正权重增大,负权重减小)
and also decrease the value of other numbers.
So we add up all the last-layer neurons’ desire effects, and get the wanted nudges for the second last layer.
THAT’S THE FIRST PROPAGATION!
每一层的nudges加和,每个样本的总nudges加和求平均,得到相量$\nabla C$的倍数。(非精确量化)
如何偷懒?
把样本分成mini-batchs,每个mini-batch算$\nabla C$,再综合。
称为随机梯度下降(Stochastic gradient descent)
derivatives in computational graphs
dive a little bit into the calculus!
由浅入深!
- 每层一个神经元
- $C(\omega_1, b_1,\omega_2,b_2,\omega_3,b_3)$
- 神经元index:
- 上标表示层数,如$a^{(L)}$表示最后一层(个)神经元
- desired output最终层激活值:记作y(0 or 1)
- Cost $C_0=(a^{(L)}-y)^2$
- $a^{(L)}=\sigma(\omega^{(L)}a^{(L-1)}+b^{(L)})$
- 令$z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}$
- 则$a^{(L)}=\sigma(z^{(L)})$
- $C_0$对$\omega^{(L)}$的微小变化有多敏感?
- $\frac{\partial C_0}{\partial \omega^{(L)}}=\frac{\partial z^{(L)}}{\partial \omega^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}}$
- $C_0=(a^{(L)}-y)^2$
- $a^{(L)}=\sigma(z^{(L)})$
- $\frac{\partial a^{(L)}}{\partial z^{(L)}}=\sigma’(z^{(L)})$
- $z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}$
- $\frac{\partial z^{(L)}}{\partial \omega^{(L)}}=a^{(L-1)}$
- $\frac{\partial C_0}{\partial \omega^{(L)}}=\frac{\partial z^{(L)}}{\partial \omega^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}}$
- $b^{(L)}$同理;每层的多个$\omega^{(L)}$同理;各层的$\omega^{(i)},b^{(i)}$同理。
- 每层不止一个神经元:加下标
- $a^{(L-1)}_k$ & $a^{(L)}j$:$\omega^{(L)}{jk}$
- $C_0=\sum_{j=0}^{n_L-1}(a_j^{(L)}-y_j)^2$
- $\frac{\partial C_0}{\partial a^{(L)}}$需要加和