3B1B 的视频看了会上瘾!做得太棒了

But what is a Neural Network?

website: https://www.3blue1brown.com/lessons/neural-networks

video: https://www.youtube.com/watch?v=aircAruvnKk

  • Plain vanilla——multilayer perception多层感知器

  • classic example: recognize handwritten digits

  • neurons: thing that holds a number(activation)

  • layers:

    • input layer; output layer
    • hidden layers
    • why 2 hidden layers and with 16 neurons?
  • core: how activation of the former layer influence the activation of the latter layer?

    • e.g. recognize the loop on the top: 8,9
      • how to recognize these edges, loops, and patterns? break them down into little pieces?
  • edge detection example

    • what parameters?

      • weights
      • calculate the weighted sum of the activation from the input layer
        • e.g. 为边的位置赋正值,周围位置赋负值,其余位置赋0值。
    • requirement: all activations $\in [0,1]$

      • functions:
        • sigmoid 用$\sigma()$表示
        • 但现在几乎不用sigmoid了,用ReLU之类比较多(deep NN)
          • ReLU(a)=max(0,a) (inactive below 0)
      • biase偏置值,不希望太容易被激活 用b表示
        • $\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)$
    • too many weights and biases!

      • learning: finding the right weights and biases

      Untitled

      第一层简化为:$a^{(1)}=\sigma(Wa^{(0)}+b)$

  • re-understand the neurons as functions that outputs the result of the former layer.

deep learning

gradient descent

  • cost function—>average cost
    • input: weights&biases
    • output : 1 number (the cost)
    • parameters: training examples
  • $C(\omega)$ : single input. how to find its minimum?
    • slope detection&flowing: local minimum (depended on the random start)
      • local vs global,老生常谈
  • $C(x,y)$ : two inputs
    • $\nabla C(x,y)$ , gradient, the direction of the steepest increase
    • backpropagation
  • network learning=minimizing the cost function
    • smooth output (continuous ranging activation)
  • how gradient of the cost function$(\nabla C)$ of 13000 dims impact? 另一种思考方式:
    • 大小:说明哪些维度重要
    • 正负:说明该维度上应移动的方向
    • (每个维度上包含权重和偏差)

analyze this network

loose patterns——not intelligence…

not picking edges and loops

learn more

要了解更多信息,我强烈推荐迈克尔·尼尔森的书

http://neuralnetworksanddeeplearning....

本书逐步介绍了这些视频中示例背后的代码,您可以在此处找到:

https://github.com/mnielsen/neural-ne...

MNIST 数据库:

http://yann.lecun.com/exdb/mnist/

另请查看 Chris Olah 的博客:

http://colah.github.io/

他关于神经网络和拓扑的文章特别漂亮,但说实话,那里的所有内容都很棒。 如果您喜欢,您一定会_喜欢_ distill 的出版物:

https://distill.pub/

research corner

两篇论文:

a closer look at memorization in deep Networks

the loss surfaces of multilayer networks

backpropagation

intuitive walkthrough

without notations!

  • bad network, silly outcome
  • only adjust weights & biases

for example, the graph”2”.

  1. you want to increase the value of 2 from 0.2 to 1, as the value$=\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)$

    there are 3 ways:

    • increase b
    • increase $\omega_i$
      • in proportion to $a_i$ (增加对应更大的$a_i$的$\omega_i$,性价比更高)
      • “Neurons that fire together wire together”
    • change $a_i$
      • in proportion to $\omega_i$ (正权重增大,负权重减小)
  2. and also decrease the value of other numbers.

So we add up all the last-layer neurons’ desire effects, and get the wanted nudges for the second last layer.

THAT’S THE FIRST PROPAGATION!

每一层的nudges加和,每个样本的总nudges加和求平均,得到相量$\nabla C$的倍数。(非精确量化)

如何偷懒?

把样本分成mini-batchs,每个mini-batch算$\nabla C$,再综合。

称为随机梯度下降(Stochastic gradient descent)

derivatives in computational graphs

dive a little bit into the calculus!

由浅入深!

  • 每层一个神经元
    • $C(\omega_1, b_1,\omega_2,b_2,\omega_3,b_3)$
    • 神经元index:
      • 上标表示层数,如$a^{(L)}$表示最后一层(个)神经元
    • desired output最终层激活值:记作y(0 or 1)
    • Cost $C_0=(a^{(L)}-y)^2$
    • $a^{(L)}=\sigma(\omega^{(L)}a^{(L-1)}+b^{(L)})$
      • 令$z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}$
      • 则$a^{(L)}=\sigma(z^{(L)})$
    • $C_0$对$\omega^{(L)}$的微小变化有多敏感?
      • $\frac{\partial C_0}{\partial \omega^{(L)}}=\frac{\partial z^{(L)}}{\partial \omega^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}}$
        • $C_0=(a^{(L)}-y)^2$
        • $a^{(L)}=\sigma(z^{(L)})$
          • $\frac{\partial a^{(L)}}{\partial z^{(L)}}=\sigma’(z^{(L)})$
        • $z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}$
          • $\frac{\partial z^{(L)}}{\partial \omega^{(L)}}=a^{(L-1)}$
    • $b^{(L)}$同理;每层的多个$\omega^{(L)}$同理;各层的$\omega^{(i)},b^{(i)}$同理。
    • 每层不止一个神经元:加下标
      • $a^{(L-1)}_k$ & $a^{(L)}j$:$\omega^{(L)}{jk}$
      • $C_0=\sum_{j=0}^{n_L-1}(a_j^{(L)}-y_j)^2$
        • $\frac{\partial C_0}{\partial a^{(L)}}$需要加和