最近手推了一下神经网络梯度更新中的参数偏导,做一个笔记。
我们考虑一个非常简单的神经网络,输入embedding后只通过一个全连接层,然后就softmax输出预测值
Created with Raphaël 2.2.0 Input X: 1xn Embedding Layer Z1 = WX: 1xN Activation Layer (Relu) Z2 = relu(Z1): 1xN Fully Connected Layer Z3 = wZ2+b: 1xK Activation Layer (Softmax) Y = σ(Z3): 1xK Prediction Y: 1xK先只考虑一个样本X,feature有n个维度。embedding层可以视为一个nxN的矩阵,得到Z1的维度被扩展为N,经过relu激活层后输出。然后连接一层有K个神经元的全连接层,权重w是一个NxK的矩阵,b是一个1xK的向量,输出Z3经过softmax层后得到样本属于K个类中类别 i i i的概率值 Y i Y_i Yi。
模型损失函数取简单的交叉熵损失函数: L = − ∑ i = 1 K y × l n Y {L = - \sum\limits_{i=1}^{K} y \times lnY} L=−i=1∑Ky×lnY 其中y是真实值,Y是预测值, i = 1 , 2 , 3 , . . . , K i = 1,2,3,...,K i=1,2,3,...,K表示K个分类
计算全连接层输出 Z 3 i {Z_{3i}} Z3i的梯度∂ L ∂ Z 3 i = ∂ L ∂ Y × ∂ Y ∂ Z 3 i = ∑ j = 1 K ∂ L ∂ Y j × ∂ Y j ∂ Z 3 i = ∑ j = 1 K − y j × 1 Y j × ∂ Y j ∂ Z 3 i \begin{aligned} \frac{\partial L}{\partial Z_{3i}} = \frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z_{3i}} &= \sum\limits_{j=1}^{K} \frac{\partial L}{\partial Y_j} \times \frac{\partial Y_j}{\partial Z_{3i}} \\ &= \sum\limits_{j=1}^{K} -y_j \times \frac{1}{Y_j} \times \frac{\partial Y_j}{\partial Z_{3i}} \end{aligned} ∂Z3i∂L=∂Y∂L×∂Z3i∂Y=j=1∑K∂Yj∂L×∂Z3i∂Yj=j=1∑K−yj×Yj1×∂Z3i∂Yj 计算softmax导数时需要分 j = i {j = i} j=i和 j ≠ i {j \neq i} j̸=i两种情况考虑,所以这里将L求和中元素令为j,与i加以区分 i = j i = j i=j时 ∂ Y j ∂ Z 3 i = ∂ Y i ∂ Z 3 i = ∂ ∂ Z 3 i [ e Z 3 i ∑ k = 1 K e Z 3 k ] = e Z 3 i × ∑ k = 1 K e Z 3 k − e Z 3 i × e Z 3 i ( ∑ k = 1 K e Z 3 k ) 2 = Y i ( 1 − Y i ) \frac{\partial Y_j}{\partial Z_{3i}} = \frac{\partial Y_i}{\partial Z_{3i}} = \frac{\partial }{\partial Z_{3i}} [\frac{e^{Z_{3i}}}{\sum\limits_{k=1}^{K} e^{Z_{3k}}}] = \frac{e^{Z_{3i}} \times \sum\limits_{k=1}^{K} e^{Z_{3k}} - e^{Z_{3i}} \times e^{Z_{3i}}}{(\sum\limits_{k=1}^{K} e^{Z_{3k}})^2} = Y_i(1-Y_i) ∂Z3i∂Yj=∂Z3i∂Yi=∂Z3i∂[k=1∑KeZ3keZ3i]=(k=1∑KeZ3k)2eZ3i×k=1∑KeZ3k−eZ3i×eZ3i=Yi(1−Yi) i ≠ j i \neq j i̸=j时 (不知道为什么不等于号的斜杠会飘) ∂ Y j ∂ Z 3 i = ∂ ∂ Z 3 i [ e Z 3 j ∑ k = 1 K e Z 3 k ] = − e Z 3 j × e Z 3 i ( ∑ k = 1 K e Z 3 k ) 2 = − Y i Y j \frac{\partial Y_j}{\partial Z_{3i}} = \frac{\partial }{\partial Z_{3i}} [\frac{e^{Z_{3j}}}{\sum\limits_{k=1}^{K} e^{Z_{3k}}}] = \frac{ - e^{Z_{3j}} \times e^{Z_{3i}}}{(\sum\limits_{k=1}^{K} e^{Z_{3k}})^2} = -Y_i Y_j ∂Z3i∂Yj=∂Z3i∂[k=1∑KeZ3keZ3j]=(k=1∑KeZ3k)2−eZ3j×eZ3i=−YiYj ∂ L ∂ Y × ∂ Y ∂ Z 3 i = ∑ j = 1 K − y j × 1 Y j × ∂ Y j ∂ Z 3 i = ∑ j = 1 , j ≠ i K − y j × 1 Y j × ( − Y i Y j ) + − y i × 1 Y i × Y i ( 1 − Y i ) = ∑ j = 1 K − y j × 1 Y j × ( − Y i Y j ) + − y i × 1 Y i × Y i = ∑ j = 1 K y j × Y i − y i \begin{aligned} \frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z_{3i}} &= \sum\limits_{j=1}^{K} -y_j \times \frac{1}{Y_j} \times \frac{\partial Y_j}{\partial Z_{3i}} \\ &= \sum\limits_{j=1, j \neq i}^{K} -y_j \times \frac{1}{Y_j} \times (-Y_i Y_j) + -y_i \times \frac{1}{Y_i} \times Y_i(1-Y_i) \\ &= \sum\limits_{j=1}^{K} -y_j \times \frac{1}{Y_j} \times (-Y_i Y_j) + \ -y_i \times \frac{1}{Y_i} \times Y_i \\ &= \sum\limits_{j=1}^{K} y_j \times Y_i - y_i \end{aligned} ∂Y∂L×∂Z3i∂Y=j=1∑K−yj×Yj1×∂Z3i∂Yj=j=1,j̸=i∑K−yj×Yj1×(−YiYj)+−yi×Yi1×Yi(1−Yi)=j=1∑K−yj×Yj1×(−YiYj)+ −yi×Yi1×Yi=j=1∑Kyj×Yi−yi 对于预测分类问题,只有一类的值为1,其余均为0,即 ∑ i = 1 K y i = 1 \sum\limits_{i=1}^{K} y_i =1 i=1∑Kyi=1,可以进一步将上式简化为 ∂ L ∂ Y × ∂ Y ∂ Z 3 i = Y i − y i \frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z_{3i}} = Y_i - y_i ∂Y∂L×∂Z3i∂Y=Yi−yi
考虑一个最简单的二分类问题, L = − ∑ i = 1 2 y × l n Y = − y × l n Y − ( 1 − y ) × l n ( 1 − Y ) Y = 1 1 + e − z ∂ L ∂ Z = ∂ L ∂ Y × ∂ Y ∂ Z = [ − y × 1 Y + ( 1 − y ) × 1 Y ( 1 − Y ) ] × Y ( 1 − Y ) = Y − y L = - \sum\limits_{i=1}^{2} y \times lnY = - y \times lnY - (1-y) \times ln(1-Y) \\ Y=\frac{1}{1+e^{-z}} \\ \frac{\partial L}{\partial Z} = \frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z} = [- y \times \frac{1}{Y} + (1-y) \times \frac{1}{Y(1-Y)} ] \times Y(1-Y) = Y - y L=−i=1∑2y×lnY=−y×lnY−(1−y)×ln(1−Y)Y=1+e−z1∂Z∂L=∂Y∂L×∂Z∂Y=[−y×Y1+(1−y)×Y(1−Y)1]×Y(1−Y)=Y−y
计算Embedding矩阵 W l m {W_{lm}} Wlm的梯度 ∂ L ∂ W l m = ∂ L ∂ Y × ∂ Y ∂ Z 3 × ∂ Z 3 ∂ Z 2 × ∂ Z 2 ∂ Z 1 × ∂ Z 1 ∂ W l m = ∑ i = 1 K [ ∂ L ∂ Y × ∂ Y ∂ Z 3 i ] 1 × [ ∂ Z 3 i ∂ Z 2 × ∂ Z 2 ∂ Z 1 × ∂ Z 1 ∂ W l m ] 2 \begin{aligned} \frac{\partial L}{\partial W_{lm}} &= \frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z_3} \times \frac{\partial Z_3}{\partial Z_2} \times \frac{\partial Z_2}{\partial Z_1} \times \frac{\partial Z_1}{\partial W_{lm}} \\ &= \sum\limits_{i=1}^{K} [\frac{\partial L}{\partial Y} \times \frac{\partial Y}{\partial Z_{3i}}]_1 \times [\frac{\partial Z_{3i}}{\partial Z_2} \times \frac{\partial Z_2}{\partial Z_1} \times \frac{\partial Z_1}{\partial W_{lm}}]_2 \end{aligned} ∂Wlm∂L=∂Y∂L×∂Z3∂Y×∂Z2∂Z3×∂Z1∂Z2×∂Wlm∂Z1=i=1∑K[∂Y∂L×∂Z3i∂Y]1×[∂Z2∂Z3i×∂Z1∂Z2×∂Wlm∂Z1]2 [ ] 1 []_1 []1中我们在前面已经得到了,这里只用计算 [ ] 2 []_2 []2。 ∂ Z 3 i ∂ Z 2 × ∂ Z 2 ∂ Z 1 × ∂ Z 1 ∂ W l m = ∂ Z 3 i ∂ Z 2 m × ∂ Z 2 m ∂ Z 1 m × ∂ Z 1 m ∂ W l m = w m i × m a x ( Z 1 m ∣ Z 1 m ∣ , 0 ) × X l \begin{aligned} \frac{\partial Z_{3i}}{\partial Z_2} \times \frac{\partial Z_2}{\partial Z_1} \times \frac{\partial Z_1}{\partial W_{lm}} &= \frac{\partial Z_{3i}}{\partial Z_{2m}} \times \frac{\partial Z_{2m}}{\partial Z_{1m}} \times \frac{\partial Z_{1m}}{\partial W_{lm}} \\ &= w_{mi} \times max(\frac{Z_{1m}}{|Z_{1m}|}, 0) \times X_l \end{aligned} ∂Z2∂Z3i×∂Z1∂Z2×∂Wlm∂Z1=∂Z2m∂Z3i×∂Z1m∂Z2m×∂Wlm∂Z1m=wmi×max(∣Z1m∣Z1m,0)×Xl Note: Z 3 i = ∑ j = 1 N Z 2 j w j i Z_{3i} = \sum\limits_{j=1}^N Z_{2j} w_{ji} Z3i=j=1∑NZ2jwji Z 1 m = ∑ j = 1 n X j W j m Z_{1m} = \sum\limits_{j=1}^n X_{j} W_{jm} Z1m=j=1∑nXjWjm 合起来得到 ∂ L ∂ W l m = ∑ i = 1 K ( Y i − y i ) × w m i × m a x ( Z 1 m ∣ Z 1 m ∣ , 0 ) × X l \frac{\partial L}{\partial W_{lm}} = \sum\limits_{i=1}^{K} (Y_i - y_i) \times w_{mi} \times max(\frac{Z_{1m}}{|Z_{1m}|}, 0) \times X_l ∂Wlm∂L=i=1∑K(Yi−yi)×wmi×max(∣Z1m∣Z1m,0)×Xl顺便推荐一下这篇博客,讲为何用softmax而不是MSE: https://blog.csdn.net/xg123321123/article/details/80781611