3B1B 的视频看了会上瘾!做得太棒了

But what is a Neural Network?

website: https://www.3blue1brown.com/lessons/neural-networks

video: https://www.youtube.com/watch?v=aircAruvnKk

  • Plain vanilla——multilayer perception多层感知器

  • classic example: recognize handwritten digits

  • neurons: thing that holds a number(activation)

  • layers:

    • input layer; output layer
    • hidden layers
    • why 2 hidden layers and with 16 neurons?
  • core: how activation of the former layer influence the activation of the latter layer?

    • e.g. recognize the loop on the top: 8,9
      • how to recognize these edges, loops, and patterns? break them down into little pieces?
  • edge detection example

    • what parameters?

      • weights
      • calculate the weighted sum of the activation from the input layer
        • e.g. 为边的位置赋正值,周围位置赋负值,其余位置赋0值。
    • requirement: all activations [0,1]\in [0,1]

      • functions:
        • sigmoid 用σ()\sigma()表示
        • 但现在几乎不用sigmoid了,用ReLU之类比较多(deep NN)
          • ReLU(a)=max(0,a) (inactive below 0)
      • biase偏置值,不希望太容易被激活 用b表示
        • σ(ω1a1+ω2a2++ωnan+b)\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)
    • too many weights and biases!

      • learning: finding the right weights and biases

      第一层简化为:a(1)=σ(Wa(0)+b)a^{(1)}=\sigma(Wa^{(0)}+b)

  • re-understand the neurons as functions that outputs the result of the former layer.

deep learning

gradient descent

  • cost function—>average cost

    • input: weights&biases
    • output : 1 number (the cost)
    • parameters: training examples
  • C(ω)C(\omega) : single input. how to find its minimum?

    • slope detection&flowing: local minimum (depended on the random start)
      • local vs global,老生常谈
  • C(x,y)C(x,y) : two inputs

    • C(x,y)\nabla C(x,y) , gradient, the direction of the steepest increase
    • backpropagation
  • network learning=minimizing the cost function

    • smooth output (continuous ranging activation)
  • how gradient of the cost function(C)(\nabla C) of 13000 dims impact? 另一种思考方式:

    • 大小:说明哪些维度重要
    • 正负:说明该维度上应移动的方向
    • (每个维度上包含权重和偏差)

analyze this network

loose patterns——not intelligence…

not picking edges and loops

learn more

要了解更多信息,我强烈推荐迈克尔·尼尔森的书

http://neuralnetworksanddeeplearning…

本书逐步介绍了这些视频中示例背后的代码,您可以在此处找到:

https://github.com/mnielsen/neural-ne…

MNIST 数据库:

http://yann.lecun.com/exdb/mnist/

另请查看 Chris Olah 的博客:

http://colah.github.io/

他关于神经网络和拓扑的文章特别漂亮,但说实话,那里的所有内容都很棒。 如果您喜欢,您一定会_喜欢_ distill 的出版物:

https://distill.pub/

research corner

两篇论文:

a closer look at memorization in deep Networks

the loss surfaces of multilayer networks

backpropagation

intuitive walkthrough

without notations!

  • bad network, silly outcome

  • only adjust weights & biases

for example, the graph”2”.

  1. you want to increase the value of 2 from 0.2 to 1, as the value=σ(ω1a1+ω2a2++ωnan+b)=\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)

    there are 3 ways:

    • increase b

    • increase ωi\omega_i

      • in proportion to aia_i (增加对应更大的aia_iωi\omega_i,性价比更高)
      • “Neurons that fire together wire together”
    • change aia_i

      • in proportion to ωi\omega_i (正权重增大,负权重减小)
  2. and also decrease the value of other numbers.

So we add up all the last-layer neurons’ desire effects, and get the wanted nudges for the second last layer.

THAT’S THE FIRST PROPAGATION!

每一层的nudges加和,每个样本的总nudges加和求平均,得到相量C\nabla C的倍数。(非精确量化)

如何偷懒?

把样本分成mini-batchs,每个mini-batch算C\nabla C,再综合。

称为随机梯度下降(Stochastic gradient descent)

derivatives in computational graphs

dive a little bit into the calculus!

由浅入深!

  • 每层一个神经元

    • C(ω1,b1,ω2,b2,ω3,b3)C(\omega_1, b_1,\omega_2,b_2,\omega_3,b_3)
    • 神经元index:
      • 上标表示层数,如a(L)a^{(L)}表示最后一层(个)神经元
    • desired output最终层激活值:记作y(0 or 1)
    • Cost C0=(a(L)y)2C_0=(a^{(L)}-y)^2
    • a(L)=σ(ω(L)a(L1)+b(L))a^{(L)}=\sigma(\omega^{(L)}a^{(L-1)}+b^{(L)})
      • z(L)=ω(L)a(L1)+b(L)z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}
      • a(L)=σ(z(L))a^{(L)}=\sigma(z^{(L)})
    • C0C_0ω(L)\omega^{(L)}的微小变化有多敏感?
      • C0ω(L)=z(L)ω(L)a(L)z(L)C0a(L)\frac{\partial C_0}{\partial \omega^{(L)}}=\frac{\partial z^{(L)}}{\partial \omega^{(L)}}\frac{\partial a^{(L)}}{\partial z^{(L)}}\frac{\partial C_0}{\partial a^{(L)}}
        • C0=(a(L)y)2C_0=(a^{(L)}-y)^2
        • a(L)=σ(z(L))a^{(L)}=\sigma(z^{(L)})
          • a(L)z(L)=σ(z(L))\frac{\partial a^{(L)}}{\partial z^{(L)}}=\sigma'(z^{(L)})
        • z(L)=ω(L)a(L1)+b(L)z^{(L)}=\omega^{(L)}a^{(L-1)}+b^{(L)}
          • z(L)ω(L)=a(L1)\frac{\partial z^{(L)}}{\partial \omega^{(L)}}=a^{(L-1)}
    • b(L)b^{(L)}同理;每层的多个ω(L)\omega^{(L)}同理;各层的ω(i)b(i)\omega^{(i)},b^{(i)}同理。
    • 每层不止一个神经元:加下标
      • ak(L1)a^{(L-1)}_k & aj(L)a^{(L)}_jωjk(L)\omega^{(L)}_{jk}
      • C0=j=0nL1(aj(L)yj)2C_0=\sum_{j=0}^{n_L-1}(a_j^{(L)}-y_j)^2
        • C0a(L)\frac{\partial C_0}{\partial a^{(L)}}需要加和

本站总访问量次!

本站由 Yantares 使用 Stellar 1.33.1 主题创建。
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。

发表了 98 篇文章 · 总计 176.3k 字