3B1B 的视频看了会上瘾！做得太棒了

But what is a Neural Network?

Plain vanilla——multilayer perception多层感知器
classic example: recognize handwritten digits
neurons: thing that holds a number(activation)
layers:
- input layer; output layer
- hidden layers
- why 2 hidden layers and with 16 neurons?
core: how activation of the former layer influence the activation of the latter layer?
- e.g. recognize the loop on the top: 8,9
  - how to recognize these edges, loops, and patterns? break them down into little pieces?
edge detection example
- what parameters?
  - weights
  - calculate the weighted sum of the activation from the input layer
    - e.g. 为边的位置赋正值，周围位置赋负值，其余位置赋0值。
- requirement: all activations $\in [0,1]$
  - functions:
    - sigmoid 用$\sigma()$表示
    - 但现在几乎不用sigmoid了，用ReLU之类比较多（deep NN）
      - ReLU(a)=max(0,a) (inactive below 0)
  - biase偏置值，不希望太容易被激活用b表示
    - $\sigma(\omega_1a_1+\omega_2a_2+…+\omega_na_n+b)$
- too many weights and biases!
  - learning: finding the right weights and biases
  第一层简化为：$a^{(1)}=\sigma(Wa^{(0)}+b)$
re-understand the neurons as functions that outputs the result of the former layer.

deep learning

cost function—>average cost
- input: weights&biases
- output : 1 number (the cost)
- parameters: training examples
$C(\omega)$ : single input. how to find its minimum?
- slope detection&flowing: local minimum (depended on the random start)
  - local vs global，老生常谈
$C(x,y)$ : two inputs
- $\nabla C(x,y)$ , gradient, the direction of the steepest increase
- backpropagation
network learning=minimizing the cost function
- smooth output (continuous ranging activation)
how gradient of the cost function$(\nabla C)$ of 13000 dims impact? 另一种思考方式：
- 大小：说明哪些维度重要
- 正负：说明该维度上应移动的方向
- （每个维度上包含权重和偏差）

loose patterns——not intelligence…

not picking edges and loops

要了解更多信息，我强烈推荐迈克尔·尼尔森的书

本书逐步介绍了这些视频中示例背后的代码，您可以在此处找到：

MNIST 数据库：

另请查看 Chris Olah 的博客：

他关于神经网络和拓扑的文章特别漂亮，但说实话，那里的所有内容都很棒。如果您喜欢，您一定会_喜欢_ distill 的出版物：

两篇论文：

a closer look at memorization in deep Networks

the loss surfaces of multilayer networks

without notations!

for example, the graph”2”.

So we add up all the last-layer neurons’ desire effects, and get the wanted nudges for the second last layer.

THAT’S THE FIRST PROPAGATION!

每一层的nudges加和，每个样本的总nudges加和求平均，得到相量$\nabla C$的倍数。（非精确量化）

如何偷懒？

把样本分成mini-batchs，每个mini-batch算$\nabla C$，再综合。

称为随机梯度下降(Stochastic gradient descent)

dive a little bit into the calculus!

由浅入深！