对于一个样本x,模型模拟函数如下: y ^ = a = σ ( z ) = σ ( w T x + b ) \hat{y}=a=\sigma(z)=\sigma(w^Tx+b) y^=a=σ(z)=σ(wTx+b) σ ( z ) = 1 1 + e − z \sigma(z)=\frac 1{1+e^{-z}} σ(z)=1+e−z1 z = w T x + b z=w^Tx+b z=wTx+b 其中, σ ( z ) \sigma(z) σ(z)是 sigmod 函数称为激活函数,如下图所示。可依据器输出的值大小对样本x进行分类。由于其值域范围为0~1,可以将该激活函数的输出看做为某一类别的概率。
loss function: l ( y ^ , y ) = − y l o g y ^ − ( 1 − y ) l o g ( 1 − y ^ ) l(\hat{y},y)=-ylog\hat{y}-(1-y)log(1-\hat{y}) l(y^,y)=−ylogy^−(1−y)log(1−y^) cost function: J ( w , b ) = 1 m ∑ i = 1 m l ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) J(w,b)=\frac 1m\sum_{i=1}^{m}l(\hat{y}^{(i)},y^{(i)})=-\frac 1m\sum_{i=1}^{m}ylog\hat{y}+(1-y)log(1-\hat{y}) J(w,b)=m1i=1∑ml(y^(i),y(i))=−m1i=1∑mylogy^+(1−y)log(1−y^) 其中,m表示样本个数i表示样本个数。损失函数 l l l衡量的时单个样本的预测值和实际值的差距,代价函数cost衡量的时所有样本的差距
对于一个假设分布q 和一个真实已确定的分布p,其交叉熵公式如下: C E H ( p , q ) = E p [ − l o g ( q ) ] = − ∑ x ϵ X [ p ( x ) l o g q ( x ) ] CEH(p,q)=E_p[-log(q)]=-\sum_{x\epsilon{X}}[p(x)logq(x)] CEH(p,q)=Ep[−log(q)]=−xϵX∑[p(x)logq(x)] 其中假设样本数据,为已知分布p, 则对于样本x的概率为: p ( x ) = { y , y = 1 1 − y , y = 0 p(x)=\left\{ \begin{aligned} y &&, && y=1 \\ 1-y &&,&& y=0 \\ \end{aligned} \right. p(x)={y1−y,,y=1y=0 其中预测的分布为q,则对于样本x 的概率为: q ( x ) = q ( y ∣ w , b , x ) = { y ^ , y = 1 1 − y ^ , y = 0 q(x)=q(y|w,b,x)=\left\{ \begin{aligned} \hat{y} &&, && y=1 \\ 1-\hat{y} &&,&& y=0 \\ \end{aligned} \right. q(x)=q(y∣w,b,x)={y^1−y^,,y=1y=0 所以,将上面两个概率带入交叉熵函数,可得: J ( w , b ) = 1 m C E H ( p , q ) = − 1 m ∑ i = 1 m y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) J(w,b)=\frac1mCEH(p,q)=-\frac 1m\sum_{i=1}^{m}ylog\hat{y}+(1-y)log(1-\hat{y}) J(w,b)=m1CEH(p,q)=−m1i=1∑mylogy^+(1−y)log(1−y^)
L = ∏ ( y ^ ) y ( 1 − y ^ ) ( 1 − y ) L=\prod(\hat y)^y(1-\hat y)^{(1-y)} L=∏(y^)y(1−y^)(1−y) 也就是目标为: − L m i n ( w , b ) -L_{min}(w,b) −Lmin(w,b), 对其求对数,转换成对数似然就可得到代价函数。
求解的基本方法梯度下降
1、目标是找到,w,b使得代价函数最小,因此需要求,损失函数对w 和 b的偏导。(复合函数求导,chain rule) ∂ l ∂ w = ∂ l ∂ y ^ ∗ ∂ y ^ ∂ z ∗ ∂ z ∂ w \frac{\partial l}{\partial w}=\frac{\partial l}{\partial \hat{y}}*\frac{\partial \hat{y}}{\partial z}*\frac{\partial z}{\partial w} ∂w∂l=∂y^∂l∗∂z∂y^∗∂w∂z ∂ l ∂ b = ∂ l ∂ y ^ ∗ ∂ y ^ ∂ z ∗ ∂ z ∂ b \frac{\partial l}{\partial b}=\frac{\partial l}{\partial \hat{y}}*\frac{\partial \hat{y}}{\partial z}*\frac{\partial z}{\partial b} ∂b∂l=∂y^∂l∗∂z∂y^∗∂b∂z
2、activation function sigmoid , σ ( z ) = 1 1 + e − z \sigma(z)=\frac 1{1+e^{-z}} σ(z)=1+e−z1 的导数为:
σ ′ ( z ) = d y ^ d z = σ ( z ) ( 1 − σ ( z ) ) = y ^ ( 1 − y ^ ) \sigma^{'}(z)=\frac{d\hat y}{dz}=\sigma(z)(1-\sigma(z))=\hat y(1-\hat y) σ′(z)=dzdy^=σ(z)(1−σ(z))=y^(1−y^) 3、loss函数对 y ^ \hat{y} y^求偏导 ∂ l ∂ y ^ = − y y ^ + 1 − y 1 − y ^ \frac{\partial l}{\partial \hat{y}}=-\frac{y}{\hat y}+\frac{1-y}{1-\hat y} ∂y^∂l=−y^y+1−y^1−y 4、x为n维向量 ∂ z ∂ w 1 = x 1 \frac{\partial z}{\partial w_1}=x_1 ∂w1∂z=x1 ∂ z ∂ w 2 = x 2 \frac{\partial z}{\partial w_2}=x_2 ∂w2∂z=x2 … ∂ z ∂ w n = x n \frac{\partial z}{\partial w_n}=x_n ∂wn∂z=xn ∂ z ∂ b = 1 \frac{\partial z}{\partial b}=1 ∂b∂z=1
对于样本x依据链式法则: ∂ l ∂ w = ( y ^ − y ) x \frac{\partial l}{\partial w}=(\hat y-y)x ∂w∂l=(y^−y)x 因此: ∂ J ∂ w = 1 m ∑ i = 1 m ( y ^ i − y i ) x i = 1 m X ( Y ^ − Y ) T \frac{\partial J}{\partial w}=\frac{1}{m}\sum_{i=1}^{m}(\hat y^{i}-y^{i})x^{i}=\frac{1}{m}X(\hat Y-Y)^{T} ∂w∂J=m1i=1∑m(y^i−yi)xi=m1X(Y^−Y)T ∂ l ∂ b = 1 m ∑ i = 1 m ( y ^ − y ) \frac{\partial l}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}(\hat y-y) ∂b∂l=m1i=1∑m(y^−y) 注: X为nxm维的矩阵,你为特征维度,m为样本个数
repeat:{ w : w − α ∂ w w : w-\alpha \partial w w:w−α∂w b : b − α ∂ b b:b-\alpha \partial b b:b−α∂b }
logistics regression wikipedia Interpretable Machine Learning stochastic gradient descent wikipedia Cross entropy wikipedia