1。 前向傳播

反向傳播演算法推導過程(非常詳細)

假設

X

N\times m

的矩陣(其中,

 N

為樣本個數(batch size),

 m

為特徵維數):

h_1

Z_1

的維數為

m_1 \rightarrow W_1

m\times m_1

的矩陣,

 b_1 \in \mathbb{R}^{m_1},

h_2

Z_2

的維數為

 m_2 \rightarrow W_2

 m_1\times m_2

的矩陣,

 b_2 \in \mathbb{R}^{m_2},

{\vdots}

h_L

Z_L

的維數為

 m_L \rightarrow W_L

m_{L-1}\times m_L

的矩陣,

 b_L \in \mathbb{R}^{m_L}

前向演算法:

\begin{array}{l}{h_{1}=x W_{1}+\tilde{b}_{1}, Z_{1}=f_{1}\left(h_{1}\right), \tilde{b}_{1}}為b_1^T沿著行方向擴充套件成N行 \\ {h_{2}=Z_{1} W_{2}+\tilde{b}_{2}, Z_{2}=f_{2}\left(h_{2}\right)} \\ {\vdots} \\ {h_{L}=Z_{L-1} W_{L}+\tilde{b}_{L}, Z_{L}=f_{L}\left(h_{L}\right)} \\ {\text { out }=Z_{L} W_{L+1}+\tilde{b}_{L+1}}\end{array}

假設輸出為

 n

維,則

 out

為大小為

 N\times n

的矩陣,根據MSE或CE準則可以求得

\frac{\partial J}{\partial out}

,對於迴歸問題與分類問題,

 \frac{\partial J}{\partial out}

的求解方法如下:

反向傳播演算法推導過程(非常詳細)

對於迴歸問題,對out直接計算損失,損失函式為MSE。 損失:

 J=\frac{1}{2N}\sum_{i=1}^{N}||y_i-\tilde{y_i}||^2

  \begin{aligned}       \frac{\partial J}{\partial y_i}&=\frac{1}{2N}\sum_{i=1}^{N}(y_i-\tilde{y_i})\times 2 \\       &=\frac{1}{N}\sum_{i=1}^{N}(y_i-\tilde{y_i})   \end{aligned}

對於分類問題,out後接softmax進行分類,然後使用CE(cross entropy)計算loss。

S_k=\frac{e^{y_k}}{\sum_{i=1}^{n}e^{y_i}}

一個樣本對應的網路的輸出

 S(s_1,s_2,...,s_n)

是一個機率分佈,而這個樣本的標註

 \tilde{S}

一般為

(0,0,...,1,0,0,...,0)

,也可以看做一個機率分佈(硬分佈)。cross entropy可以看成是

S

 \tilde{S}

之間的KL距離:

  D(\tilde{S}||S)=\Sigma\tilde{S}\log\frac{\tilde{S}}{S}

假設

 \tilde{S}=(0,0,...,1,0,0,...,0)

,其中1為第

k

個元素(索引從0開始),令

 S=(s_0,s_1,...,s_k,...,s_{n-1})

損失:

 \begin{aligned}       J=D(\tilde{S}||S)&=1\times \log\frac{1}{s_k}\\&=-\log s_k \quad(CE損失函式,可看做目標類別機率最大)\\       &=-\log\frac{e^{y_k}}{\sum_{i=0}^{n-1}e^{y_i}}  \end{aligned}

 \begin{aligned}      &\frac{\partial J}{\partial y_m}=\frac{\partial J}{\partial y_m}(\log \sum_{i=0}^{n-1}e^{y_i}-y_k)=\frac{e^{y_m}}{\sum_{i=0}^{n-1}e^{y_i}}-\delta(m=k)=s_m-\delta(m=k) \\ &寫成向量形式為:\frac{\partial J}{\partial y}=S-\tilde{S}  \end{aligned}

KL距離(相對熵)

:是Kullback-Leibler Divergence的簡稱,也叫相對熵(Relative Entropy)。它衡量的是相同事件空間裡的兩個機率分佈的差異情況。其物理意義是:在相同事件空間裡,機率分佈 P(x) 對應的每個事件,若用機率分佈 Q(x) 編碼時,平均每個基本事件(符號)編碼長度增加了多少位元。我們用

 D(P||Q)

表示KL距離,計算公式如下:

D(P||Q)=\sum_{x\in X}P(x)\log\frac{P(x)}{Q(x)}

,當兩個機率分佈完全相同時,即 P(X)=Q(X) ,其相對熵為0。

2。反向傳播

\text { out }=Z_{L} W_{L+1}+\tilde{b}_{L+1}

,為了便於詳細說明反向傳播演算法,假設

 Z_L

 2\times 3

的向量,

 W_{L+1}

 3\times 2

的向量:

\begin{array}{l}{Z_{L}=\left(\begin{array}{ccc}{z_{11}} & {z_{12}} & {z_{13}} \\ {z_{21}} & {z_{22}} & {z_{23}}\end{array}\right)_{2 \times 3}, W_{L+1}=\left(\begin{array}{cc}{w_{11}} & {w_{12}} \\ {w_{21}} & {w_{22}} \\ {w_{31}} & {w_{32}}\end{array}\right)_{3 \times 2} \tilde{b}_{L+1}=\left(\begin{array}{cc}{b_{1}} & {b_{2}} \\ {b_{1}} & {b_{2}}\end{array}\right)_{2 \times 2}, \text { out }=\left(\begin{array}{cc}{o_{11}} & {o_{12}} \\ {o_{21}} & {o_{22}}\end{array}\right)} \\ \Rightarrow  {Z_{L}W_{L+1}+\tilde{b}_{L+1}=\left(\begin{array}{cc}{z_{11} w_{11}+z_{12} w_{21}+z_{13} w_{31}+b_1} & {z_{11} w_{12}+z_{12} w_{22}+z_{13} w_{32}+b_2} \\ {z_{21} w_{11}+z_{22} w_{21}+z_{23} w_{31}+b_1} & {z_{21} w_{12}+z_{22} w_{22}+z_{23} w_{32}+b_2}\end{array}\right)=\text{out}.}\end{array}

所以,

\begin{array}{l}{o_{11}=z_{11} w_{11}+z_{12} w_{21}+z_{13} w_{31}+b_{1}} \\  {o_{12}=z_{11} w_{12}+z_{12} w_{22}+z_{13} w_{32}+b_{2}} \\  {o_{21}=z_{21} w_{11}+z_{22} w_{21}+z_{23} w_{31}+b_{1}} \\  {o_{22}=z_{21} w_{12}+z_{22} w_{22}+z_{23} w_{32}+b_{2}}\end{array}

1) 損失

 J

W

的導數:

\begin{aligned} \frac{\partial J}{\partial w_{11}} &=\frac{\partial J}{\partial o_{11}} z_{11}+\frac{\partial J}{\partial o_{21}} z_{21}, \frac{\partial J}{\partial w_{12}}=\frac{\partial J}{\partial o_{12}} z_{11}+\frac{\partial J}{\partial o_{22}} z_{21} \\ \frac{\partial J}{\partial w_{21}} &=\frac{\partial J}{\partial o_{11}} z_{12}+\frac{\partial J}{\partial o_{21}} z_{22}, \frac{\partial J}{\partial w_{22}}=\frac{\partial J}{\partial o_{12}} z_{12}+\frac{\partial J}{\partial o_{22}} z_{22} \\ \frac{\partial J}{\partial w_{31}} &=\frac{\partial J}{\partial o_{11}} z_{13}+\frac{\partial J}{\partial o_{21}} z_{23}, \frac{\partial J}{\partial w_{32}}=\frac{\partial J}{\partial o_{12}} z_{13}+\frac{\partial J}{\partial o_{22}} z_{23} \end{aligned}

\Rightarrow \left(\begin{array}{cc}{\frac{\partial J}{\partial w_{11}}} & {\frac{\partial J}{\partial w_{12}}} \\ {\frac{\partial J}{\partial w_{21}}} & {\frac{\partial J}{\partial w_{22}}} \\ {\frac{\partial J}{\partial w_{31}}} & {\frac{\partial J}{\partial w_{32}}}\end{array}\right)=\left(\begin{array}{cc}{z_{11}} & {z_{21}} \\ {z_{12}} & {z_{22}} \\ {z_{13}} & {z_{23}}\end{array}\right)\left(\begin{array}{cc}{\frac{\partial J}{\partial o_{11}}} & {\frac{\partial J}{\partial o_{12}}} \\ {\frac{\partial J}{\partial o_{21}}} & {\frac{\partial J}{\partial o_{22}}}\end{array}\right)

即,

\frac{\partial J}{\partial W_{L+1}}=Z_L^T\frac{\partial J}{\partial out}

2) 損失對偏置b的導數等於將

\frac{\partial J}{\partial out}

的每一列加起來:

\left\{\begin{array}{l}{\frac{\partial J}{\partial b_{1}}=\frac{\partial J}{\partial o_{11}}+\frac{\partial J}{\partial o_{21}}} \\ {\frac{\partial J}{\partial b_{2}}=\frac{\partial J}{\partial o_{12}}+\frac{\partial J}{\partial o_{22}}}\end{array} \Rightarrow\left(\frac{\partial J}{\partial b_{L+1}}\right)^{T}=\left(\frac{\partial J}{\partial b_{1}} \quad \frac{\partial J}{\partial b_{2}}\right)=\left(\frac{\partial J}{\partial o_{11}}+\frac{\partial J}{\partial o_{21}} \quad \frac{\partial J}{\partial o_{12}}+\frac{\partial J}{\partial o_{22}}\right)\right.

3) 損失

J

 Z

的導數:

\begin{aligned} \frac{\partial J}{\partial z_{11}} &=\frac{\partial J}{\partial o_{11}} w_{11}+\frac{\partial J}{\partial o_{12}} w_{12} ; \frac{\partial J}{\partial z_{12}}=\frac{\partial J}{\partial o_{11}} w_{21}+\frac{\partial J}{\partial o_{12}} w_{22} ; \frac{\partial J}{\partial z_{13}}=\frac{\partial J}{\partial o_{11}} w_{31}+\frac{\partial J}{\partial o_{12}} w_{32} \\ \frac{\partial J}{\partial z_{21}} &=\frac{\partial J}{\partial o_{21}} w_{11}+\frac{\partial J}{\partial o_{22}} w_{12} ; \frac{\partial J}{\partial z_{22}}=\frac{\partial J}{\partial o_{21}} w_{21}+\frac{\partial J}{\partial o_{12}} w_{22} ; \frac{\partial J}{\partial z_{23}}=\frac{\partial J}{\partial o_{21}} w_{31}+\frac{\partial J}{\partial o_{22}} w_{32} \end{aligned}

即,

\left(\begin{array}{ccc}{\frac{\partial J}{\partial z_{11}}} & {\frac{\partial J}{\partial z_{12}}} & {\frac{\partial J}{\partial z_{13}}} \\ {\frac{\partial J}{\partial z_{21}}} & {\frac{\partial J}{\partial z_{22}}} & {\frac{\partial J}{\partial z_{23}}}\end{array}\right)=\left(\begin{array}{cc}{\frac{\partial J}{\partial o_{11}}} & {\frac{\partial J}{\partial o_{12}}} \\ {\frac{\partial J}{\partial \theta_{21}}} & {\frac{\partial J}{\partial o_{22}}}\end{array}\right)\left(\begin{array}{ccc}{w_{11}} & {w_{21}} & {w_{31}} \\ {w_{12}} & {w_{22}} & {w_{32}}\end{array}\right)

 \Rightarrow \frac{\partial J}{\partial Z_{L}}=\frac{\partial J}{\partial out}W_{L+1}^T

4) 損失

 J

 h

的導數:

 Z_L = f_L(h_L)

f_L

為sigmoid時,

Z_L=\frac{1}{1+e^{-h_L}} .

\begin{array}{l}{\frac{\partial J}{\partial h_{L}}=\frac{\partial J}{\partial Z_{L}} \frac{d z_{L}}{d h_{L}}=\frac{\partial J}{\partial Z_{L}} \frac{e^{-h L}}{\left(1+e^{-h_{L}}\right)^{2}}=\frac{\partial J}{\partial Z_{L}} \frac{1}{1+e^{-h_{L}}} \frac{e^{-h_{L}}}{1+e^{-h_{L}}}} \\ {=\frac{\partial J}{\partial Z_{L}} Z_{L}\left(1-Z_{L}\right)}\end{array}

f_L

為tanh時,

 {Z_{L}=\frac{e^{h_{L}}-e^{-h_{L}}}{e^{h_{L}}+e^{-h_{L}}}}

\begin{array}{l} {\frac{\partial J}{\partial h_{L}}=\frac{\partial J}{\partial Z_{L}} \frac{d Z_{L}}{d h_{L}}=\frac{\partial J}{\partial Z_{L}} \frac{4}{\left(e^{h_{L}}+e^{-h_{L}}\right)^{2}}=\frac{\partial J}{\partial Z_{L}}\left[1-\left(\frac{e^{h_{L}}-e^{-h_{L}}}{e^{h_{L}}+e^{-h_{L}}}\right)^{2}\right]} \\ {=\frac{\partial J}{\partial z_{L}}\left[1-z_{L}^{2}\right]}\end{array}

f_L

為relu時,

 Z_L=relu(h_L)=\left\{\begin{matrix}  0,&h_L\leq 0 \\   h_L,&h_L > 0  \end{matrix}\right. .

\begin{array}{l}     \frac{\partial J}{\partial h_L}=\frac{\partial J}{\partial Z_L}\frac{\partial Z_L}{\partial h_L}=\left\{\begin{matrix}  0,&h_L\leq 0 \\   \frac{\partial J}{\partial Z_L},&h_L > 0  \end{matrix}\right. \end{array}

3。 梯度更新

對於不同演算法 ,梯度更新方式如下:

\frac{\partial J}{\partial out} \Rightarrow \left \{\begin{matrix}     \frac{\partial J}{\partial W_{L+1}}=Z_L^T\frac{\partial J}{\partial out} \\     \frac{\partial J}{\partial Z_{L}}=\frac{\partial J}{\partial out}W_{L+1}^T \\     \left(\frac{\partial J}{\partial b}\right)^{T}=SumCol(\frac{\partial J}{\partial out}) \\     W_{L+1}^{t+1} = W_{L+1}^t-\eta \frac{\partial J}{\partial W_{L+1}} \\     b_{L+1}^{t+1} = b_{L+1}^t-\eta \frac{\partial J}{\partial b_{L+1}} \end{matrix} \right. \Rightarrow \frac{\partial J}{\partial h_L}=\frac{\partial J}{\partial Z_L}\frac{\partial Z_L}{\partial h_L} \Rightarrow \left \{\begin{matrix}      \frac{\partial J}{\partial W_{L}}=Z_{L-1}^T\frac{\partial J}{\partial h_L} \\     \frac{\partial J}{\partial Z_{L-1}}=\frac{\partial J}{\partial h_L}W_{L}^T \\     \vdots \\     \vdots  \end{matrix}\right. \Rightarrow \cdots