其实k-means聚类和高斯混合模型都是EM的推广。 分类模型试图从数据的内在联系分析出数据可以分为几类,分别属于哪一类。
设待估计的模型参数为 θ \theta θ。 例如对于k-means来说, θ \theta θ就是各聚类的中心 μ 1 , ⋯ , μ k \mu_1,\cdots,\mu_k μ1,⋯,μk;隐变量Z就是最终的K个分类 1 , ⋯ , k 1,\cdots,k 1,⋯,k。 对于混合高斯分布来说, θ \theta θ就是各高斯分布的参数 α i , μ i , Σ i \alpha_i,\mu_i,\Sigma_i αi,μi,Σi;隐变量Z就是K个分布 1 , ⋯ , k 1,\cdots,k 1,⋯,k。
每个样本 x i x_i xi的真实类别 z i z_i zi是隐随机变量,未知;所以EM算法的步骤: 初始化 θ 0 \theta^0 θ0E步: 计算 E ( z i ) E(z_i) E(zi), E ( z i ) E(z_i) E(zi)可用 θ ( n ) \theta^{(n)} θ(n)表示。即计算在参数值为 θ ( n ) \theta^{(n)} θ(n)的情况下,样本真实类别的期望 E ( z i ) E(z_i) E(zi)。 对于k-means,这一步计算的是在当前聚类中心为 μ 1 ( n ) , ⋯ , μ k ( n ) \mu_1^{(n)},\cdots,\mu_k^{(n)} μ1(n),⋯,μk(n)的条件下,样本的可能分类 z ^ i \hat z_i z^i。M步:用 E ( z i ) E(z_i) E(zi)代替 z i z_i zi带入 L ( θ ) L(\theta) L(θ),求本轮迭代中使得极大似然函数最大的 θ \theta θ,即 θ ( n + 1 ) = arg max θ L ( θ ) \theta^{(n+1)}=\arg \max_{\theta}L(\theta) θ(n+1)=argmaxθL(θ)。 对于k-means来说,即按照上一轮聚类中心将样本集划分后,将聚类中心更新,值为当前分类子集的质心。当数据完整时
X X X和 Z Z Z的联合概率分布为 P ( x , z ∣ θ ) P(x,z|\theta) P(x,z∣θ)极大似然函数 P ( X , Z ∣ θ ) = ∏ i = 1 m P ( x i , z i ∣ θ ) P(X,Z|\theta) = \prod_{i=1}^m P(x_i,z_i|\theta) P(X,Z∣θ)=i=1∏mP(xi,zi∣θ)对数极大似然函数为 L ( θ ) = log P ( X , Z ∣ θ ) = log ∏ i = 1 m P ( x i , z i ∣ θ ) = ∑ i = 1 m log P ( x i , z i ∣ θ ) L(\theta)=\log P(X,Z|\theta)=\log \prod_{i=1}^m P(x_i,z_i|\theta) = \sum_{i=1}^m\log P(x_i,z_i|\theta) L(θ)=logP(X,Z∣θ)=logi=1∏mP(xi,zi∣θ)=i=1∑mlogP(xi,zi∣θ)当数据不完整时
极大似然函数,假设数据集共有m个样本 l = P ( X ∣ θ ) = ∏ i = 1 m P ( x i ∣ θ ) = ∏ i = 1 m ( ∑ z i = j k P ( x i , z i ∣ θ ) ) l = P(X|\theta) = \prod_{i=1}^m P(x_i|\theta) =\prod_{i=1}^m\Big( \sum_{z_i = j}^k P(x_i,z_i|\theta)\Big) l=P(X∣θ)=i=1∏mP(xi∣θ)=i=1∏m(zi=j∑kP(xi,zi∣θ)) 其中 j = 1 , ⋯ , k j=1,\cdots,k j=1,⋯,k对数极大似然函数 L ( θ ) = log P ( X ∣ θ ) = ∑ i = 1 m log P ( x i ∣ θ ) = ∑ i = 1 m log ∑ z i = j k P ( x i , z i ∣ θ ) L(\theta)=\log P(X|\theta)= \sum_{i=1}^m\log P(x_i|\theta) = \sum_{i=1}^m\log \sum_{z_i = j}^k P(x_i,z_i|\theta) L(θ)=logP(X∣θ)=i=1∑mlogP(xi∣θ)=i=1∑mlogzi=j∑kP(xi,zi∣θ)EM算法通过一步步迭代 θ \theta θ值,逐步最大化似然函数。 假设在第n次迭代取值 θ ( n ) \theta^{(n)} θ(n),更新 θ \theta θ值时,希望 L ( θ ) − L ( θ ( n ) ) > 0 L(\theta) - L(\theta^{(n)})>0 L(θ)−L(θ(n))>0,以此逐步最大化似然函数。 考虑两者的差 L ( θ ) − L ( θ ( n ) ) = ∑ i = 1 m log ∑ z i = j k P ( x i , z i ∣ θ ) − ∑ i = 1 m log P ( x i ∣ θ ( n ) ) = ∑ i = 1 m log [ ∑ z i = j k ( P ( z i ∣ x i , θ ( n ) ) P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) ) ] − ∑ i = 1 m log P ( x i ∣ θ ( n ) ) \begin{aligned} L(\theta)-L(\theta^{(n)}) &= \sum_{i=1}^m\log \sum_{z_i = j}^k P(x_i,z_i|\theta) - \sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ &=\sum_{i=1}^m\log\Big[\sum_{z_i = j}^k(P(z_i|x_i,\theta^{(n)})\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})})\Big] -\sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ \end{aligned} L(θ)−L(θ(n))=i=1∑mlogzi=j∑kP(xi,zi∣θ)−i=1∑mlogP(xi∣θ(n))=i=1∑mlog[zi=j∑k(P(zi∣xi,θ(n))P(zi∣xi,θ(n))P(xi,zi∣θ))]−i=1∑mlogP(xi∣θ(n))
有Jensen不等式,当 f ( x ) f(x) f(x)是凸函数时 f ( ∑ i α i x i ) ⩾ ∑ i α i f ( x i ) f(\sum_i \alpha_ix_i) \geqslant \sum_i \alpha_if(x_i) f(∑iαixi)⩾∑iαif(xi);此时 f ( x ) = log x f(x)=\log x f(x)=logx,满足不等式且 ∑ z i = j k P ( z i ∣ x ) = 1 \sum_{z_i=j}^kP(z_i|x)=1 ∑zi=jkP(zi∣x)=1根据两个性质,得到: L ( θ ) − L ( θ ( n ) ) = ∑ i = 1 m log [ ∑ z i = j k ( P ( z i ∣ x i , θ ( n ) ) P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) ) ] − ∑ i = 1 m log P ( x i ∣ θ ( n ) ) ⩾ ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) − ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i ∣ θ ( n ) ) ] = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) P ( x i ∣ θ ( n ) ) ] = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] 令 B ( θ , θ ( n ) ) = L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] 则 L ( θ ) ⩾ B ( θ , θ ( n ) ) \begin{aligned} L(\theta)-L(\theta^{(n)}) &= \sum_{i=1}^m\log\Big[\sum_{z_i = j}^k(P(z_i|x_i,\theta^{(n)})\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})})\Big] -\sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ &\geqslant \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})} - \sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i|\theta^{(n)})\Big]\\ &= \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})P(x_i|\theta^{(n)})} \Big]\\ & = \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ \text{令}B(\theta,\theta^{(n)}) &= L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ \text{则}L(\theta) &\geqslant B(\theta,\theta^{(n)}) \end{aligned} L(θ)−L(θ(n))令B(θ,θ(n))则L(θ)=i=1∑mlog[zi=j∑k(P(zi∣xi,θ(n))P(zi∣xi,θ(n))P(xi,zi∣θ))]−i=1∑mlogP(xi∣θ(n))⩾i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(zi∣xi,θ(n))P(xi,zi∣θ)−zi=j∑kP(zi∣xi,θ(n))logP(xi∣θ(n))]=i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(zi∣xi,θ(n))P(xi∣θ(n))P(xi,zi∣θ)]=i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣,θ(n))P(xi,zi∣θ)]=L(θ(n))+i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣,θ(n))P(xi,zi∣θ)]⩾B(θ,θ(n)) 上式当 θ \theta θ取 θ ( n ) \theta^{(n)} θ(n)时等号成立,证: B ( θ ( n ) , θ ( n ) ) = L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ( n ) ) P ( x i , z i ∣ , θ ( n ) ) ] = L ( θ ( n ) ) \begin{aligned} B(\theta^{(n)},\theta^{(n)}) & = L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta^{(n)})}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ &= L(\theta^{(n)}) \end{aligned} B(θ(n),θ(n))=L(θ(n))+i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣,θ(n))P(xi,zi∣θ(n))]=L(θ(n)) 此时 B ( θ , θ ( n ) ) B(\theta,\theta^{(n)}) B(θ,θ(n))相当于 L ( θ ) L(\theta) L(θ)的下界,如果能最大化 B ( θ , θ ( n ) ) B(\theta,\theta^{(n)}) B(θ,θ(n)),也能够使 L ( θ ) L(\theta) L(θ)增大。 更新 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)为使 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))最大的值 θ ( n + 1 ) = arg max θ B ( θ , θ ( n ) ) = arg max θ L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] = arg max θ ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) ] + c o n s t a n t = arg max θ Q ( θ , θ ( n ) ) \begin{aligned} \theta^{(n+1)} &=\arg \max_{\theta} B(\theta,\theta^{(n)}) \\ &=\arg \max_{\theta} L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ &= \arg \max_{\theta} \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big]+constant \\ &= \arg \max_{\theta} Q(\theta,\theta^{(n)}) \end{aligned} θ(n+1)=argθmaxB(θ,θ(n))=argθmaxL(θ(n))+i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣,θ(n))P(xi,zi∣θ)]=argθmaxi=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣θ)]+constant=argθmaxQ(θ,θ(n)) 其中 Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) ] = ∑ i = 1 m E ( log P ( x i , z i ∣ θ ) ∣ x i , θ ( n ) ) Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] = \sum_{i=1}^mE(\log P(x_i,z_i|\theta)\Big|x_i,\theta^{(n)}) Q(θ,θ(n))=∑i=1m[∑zi=jkP(zi∣xi,θ(n))logP(xi,zi∣θ)]=∑i=1mE(logP(xi,zi∣θ)∣∣∣xi,θ(n)); 其中 P ( z i ∣ x i , θ ( n ) ) P(z_i|x_i,\theta^{(n)}) P(zi∣xi,θ(n))是在给定观测数据 x i x_i xi和当前参数 θ ( n ) \theta^{(n)} θ(n)下,对隐变量 z i z_i zi的期望。 下图为不完全数据的对数似然函数 L ( θ ) L(\theta) L(θ)和 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))的关系
当 L ( θ ) L(\theta) L(θ)取 θ ( i ) \theta^{(i)} θ(i)时,求 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))曲线,求得时 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))最大点 θ ( i + 1 ) \theta^{(i+1)} θ(i+1);令 L ( θ ) L(\theta) L(θ)取 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)时,继续求 B ( θ , θ ( i + 1 ) ) B(\theta,\theta^{(i+1)}) B(θ,θ(i+1))曲线,继续循环。
(EM算法选择不同的初值可能得到不同的参数估计值)
输入:观测变量X,隐变量数据Z,联合分布 P ( X , Z ∣ θ ) P(X,Z|\theta) P(X,Z∣θ),条件分布 P ( Z ∣ X , θ ) P(Z|X,\theta) P(Z∣X,θ)(求得Z的期望,带入联合分布中,以求未知数的最大值) 输出:模型参数 θ \theta θ
1)选择参数的初始值 θ ( 0 ) \theta^{(0)} θ(0)开始迭代;2)E步: θ ( n ) \theta^{(n)} θ(n)为第n次迭代参数的估计值,在第n+1次迭代的E步,计算 Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] \end{aligned} Q(θ,θ(n))=i=1∑m[zi=j∑kP(zi∣xi,θ(n))logP(xi,zi∣θ)] 此时 P ( z i ∣ x i , θ ( n ) ) P(z_i|x_i,\theta^{(n)}) P(zi∣xi,θ(n))就是在给定观测数据 x i x_i xi和当前参数估计 θ ( n ) \theta^{(n)} θ(n)下隐变量 z i z_i zi的条件概率分布3)M步:求使 Q ( θ , θ ( n ) ) Q(\theta,\theta^{(n)}) Q(θ,θ(n))最大化的 θ \theta θ,确定第n+1次的参数估计值 θ ( n + 1 ) \theta^{(n+1)} θ(n+1) θ ( n + 1 ) = arg max θ Q ( θ , θ ( n ) ) \theta^{(n+1)} = \arg \max_{\theta}Q(\theta,\theta^{(n)}) θ(n+1)=argθmaxQ(θ,θ(n))4)重复2、3步,直到收敛(停止的条件,例如 ∣ ∣ θ ( n + 1 ) − θ ( n ) ∣ ∣ ⩽ ϵ 1 ||\theta^{(n+1)} - \theta^{(n)}|| \leqslant \epsilon_1 ∣∣θ(n+1)−θ(n)∣∣⩽ϵ1)高斯混合模型:样本以不同的可能性来自不同的高斯分布。 P ( x ∣ θ ) = ∑ z p ( x , z ∣ θ ) = ∑ z p ( z ∣ θ ) p ( x ∣ z , θ ) = ∑ k = 1 K α k ϕ ( x ∣ θ k ) \begin{aligned} P(x|\theta) &= \sum_z p(x,z|\theta) = \sum_z p(z|\theta)p(x|z,\theta) \\ &=\sum_{k=1}^K\alpha_k \phi(x|\theta_k) \end{aligned} P(x∣θ)=z∑p(x,z∣θ)=z∑p(z∣θ)p(x∣z,θ)=k=1∑Kαkϕ(x∣θk)
其中 z = 1 , ⋯ , K z={1,\cdots,K} z=1,⋯,K是隐数据,代表取自第几个高斯分布,其中 p ( z = k ∣ θ ) = α k p(z=k|\theta)=\alpha_k p(z=k∣θ)=αk,代表第k个分模型的权重, α k ⩾ 0 , ∑ k = 1 K α k = 1 \alpha_k\geqslant 0 ,\sum_{k=1}^K \alpha_k=1 αk⩾0,∑k=1Kαk=1;
p ( x ∣ z = k , θ ) = ϕ ( x ∣ θ k ) p(x|z=k,\theta)= \phi(x|\theta_k) p(x∣z=k,θ)=ϕ(x∣θk), ϕ ( x ∣ θ k ) \phi(x|\theta_k) ϕ(x∣θk)是高斯分布密度, θ k = ( μ k , σ k 2 ) \theta_k=(\mu_k,\sigma_k^2) θk=(μk,σk2),其中第k个分模型 ϕ ( x ∣ θ k ) = 1 2 π σ k e x p ( − ( x − μ k ) 2 2 σ k 2 ) \phi(x|\theta_k) = \frac{1}{\sqrt{2\pi}\sigma_k}exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) ϕ(x∣θk)=2π σk1exp(−2σk2(x−μk)2)
θ = ( α 1 , ⋯ , α K ; θ 1 , ⋯ , θ K , σ 1 2 , ⋯ , σ K 2 , ) \theta=(\alpha_1,\cdots,\alpha_K;\theta_1,\cdots,\theta_K,\sigma_1^2,\cdots,\sigma_K^2,) θ=(α1,⋯,αK;θ1,⋯,θK,σ12,⋯,σK2,)
观测数据x, x j = 1 , 2 , ⋯ , N x_j = 1,2,\cdots,N xj=1,2,⋯,N
隐变量 γ j k \gamma_{jk} γjk γ j k = { 1 , 第j个观测来自第k个分模型 0 , 否则 \gamma_{jk}=\begin{cases}1,& \text{第j个观测来自第k个分模型}\\0,& \text{否则}\end{cases} γjk={1,0,第j个观测来自第k个分模型否则 因此 E γ j k = P ( γ j k = 1 ) E\gamma_{jk} = P(\gamma_{jk}=1) Eγjk=P(γjk=1) (样本中只有观测数据 x j x_j xj,并不知道其由哪个模型生成的观测数据)
观测数据 x j x_j xj对应的未观测数据 r j = ( r j 1 , r j 2 , ⋯ , r j K ) r_j=(r_{j1},r_{j2},\cdots,r_{jK}) rj=(rj1,rj2,⋯,rjK),取值可能为 ( 1 , 0 , ⋯ , 0 ) , ( 0 , 1 , ⋯ , 0 ) , ⋯ , ( 0 , 0 , ⋯ , 1 ) (1,0,\cdots,0),(0,1,\cdots,0),\cdots,(0,0,\cdots,1) (1,0,⋯,0),(0,1,⋯,0),⋯,(0,0,⋯,1),只有一个值为1,其他都为0
EM算法的Q函数 Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j K P ( z i ∣ x i , θ ( n ) ) log P ( x i , z i ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^K P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] \end{aligned} Q(θ,θ(n))=i=1∑m[zi=j∑KP(zi∣xi,θ(n))logP(xi,zi∣θ)] 对于高斯混合模型来说:(其中 θ z i \theta_{z_i} θzi是总参数 θ \theta θ中第 z i z_i zi个高斯分布的参数, α z i \alpha_{z_i} αzi是第 z i z_i zi个高斯分布的权重) P ( x i , z i ∣ θ ) = P ( x i ∣ z i , θ ) P ( z i ∣ θ ) = α z i ϕ ( x i ∣ θ z i ) P ( z i ∣ x i , θ ( n ) ) = α z i ϕ ( x i ∣ θ z i ( n ) ) ∑ k = 1 K α k ϕ ( x i ∣ θ k ( n ) ) Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j K α z i ϕ ( x i ∣ θ z i ( n ) ) ∑ k = 1 K α k ϕ ( x i ∣ θ k ( n ) ) log α z i ϕ ( x i ∣ θ z i ) ] \begin{aligned} P(x_i,z_i|\theta) &= P(x_i|z_i,\theta)P(z_i|\theta)= \alpha_{z_i}\phi(x_i|\theta_{z_i})\\ P(z_i|x_i,\theta^{(n)}) &= \frac{\alpha_{z_i}\phi(x_i|\theta_{z_i}^{(n)})}{\sum_{k=1}^K \alpha_k\phi(x_i|\theta_k^{(n)})}\\ Q(\theta,\theta^{(n)})&=\sum_{i=1}^m\Big[ \sum_{z_i = j}^K \frac{\alpha_{z_i}\phi(x_i|\theta_{z_i}^{(n)})}{\sum_{k=1}^K \alpha_k\phi(x_i|\theta_k^{(n)})}\log \alpha_{z_i}\phi(x_i|\theta_{z_i})\Big] \end{aligned} P(xi,zi∣θ)P(zi∣xi,θ(n))Q(θ,θ(n))=P(xi∣zi,θ)P(zi∣θ)=αziϕ(xi∣θzi)=∑k=1Kαkϕ(xi∣θk(n))αziϕ(xi∣θzi(n))=i=1∑m[zi=j∑K∑k=1Kαkϕ(xi∣θk(n))αziϕ(xi∣θzi(n))logαziϕ(xi∣θzi)]
下面的公式中,对于每个样本 x j x_j xj的参数 r j 1 , r j 2 , ⋯ , r j K r_{j1},r_{j2},\cdots,r_{jK} rj1,rj2,⋯,rjK中,只有一个值为1,其他值都为0。 那么每个样本的概率可以表示为 p ( x j , r j ∣ θ ) = ∏ k = 1 K [ α k r j k ϕ ( x j ∣ θ k ) r j k ] p(x_j,r_j|\theta)=\prod_{k=1}^K [\alpha_k^{r_{jk}}\phi(x_j|\theta_k)^{r_{jk}}] p(xj,rj∣θ)=k=1∏K[αkrjkϕ(xj∣θk)rjk] 所有样本的极大似然函数可以表示为 P ( x , γ ∣ θ ) = ∏ j = 1 N ∏ k = 1 K [ α k r j k ϕ ( x j ∣ θ k ) r j k ] ∏ j = 1 N ∏ k = 1 K α k r j k = ∏ k = 1 K ∏ j = 1 N α k r j k = ∏ k = 1 K α k ∑ j = 1 N r j k = ∏ k = 1 K α k n k 令nk为依赖第k个模型生成观测值的样本数 P ( x , γ ∣ θ ) = ∏ k = 1 K [ α k n k ∏ j = 1 N ϕ ( x j ∣ θ k ) r j k ] = ∏ k = 1 K [ α k n k ∏ j = 1 N [ 1 2 π σ k e x p ( − ( x j − μ k ) 2 2 σ k 2 ) ] r j k ] \begin{aligned} P(x,\gamma|\theta) &= \prod_{j=1}^N\prod_{k=1}^K\big[\alpha_k^{r_{jk}}\phi(x_j|\theta_k)^{r_{jk}}\big] \\ \prod_{j=1}^N\prod_{k=1}^K\alpha_k^{r_{jk}} &= \prod_{k=1}^K\prod_{j=1}^N\alpha_k^{r_{jk}} = \prod_{k=1}^K\alpha_k^{\sum_{j=1}^Nr_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k} \quad \text{令nk为依赖第k个模型生成观测值的样本数}\\ P(x,\gamma|\theta)&= \prod_{k=1}^K\Big[\alpha_k^{n_k}\prod_{j=1}^N\phi(x_j|\theta_k)^{r_{jk}}\Big] \\ &= \prod_{k=1}^K\Big[\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}exp(-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]^{r_{jk}} \Big] \end{aligned} P(x,γ∣θ)j=1∏Nk=1∏KαkrjkP(x,γ∣θ)=j=1∏Nk=1∏K[αkrjkϕ(xj∣θk)rjk]=k=1∏Kj=1∏Nαkrjk=k=1∏Kαk∑j=1Nrjk=k=1∏Kαknk令nk为依赖第k个模型生成观测值的样本数=k=1∏K[αknkj=1∏Nϕ(xj∣θk)rjk]=k=1∏K[αknkj=1∏N[2π σk1exp(−2σk2(xj−μk)2)]rjk] 其中 n k = ∑ j = 1 N r j k , ∑ k = 1 K n k = N n_k = \sum_{j=1}^Nr_{jk},\sum_{k=1}^Kn_k = N nk=∑j=1Nrjk,∑k=1Knk=N
对数似然函数 log P ( x , γ ∣ θ ) = ∑ k = 1 K { n k log α k + ∑ j = 1 N r j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] } \begin{aligned} \log P(x,\gamma|\theta)= \sum_{k=1}^K \Big\{ n_k\log \alpha_k + \sum_{j=1}^N r_{jk}[\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2] \Big\} \end{aligned} logP(x,γ∣θ)=k=1∑K{nklogαk+j=1∑Nrjk[log(2π 1)−logσk−2σk21(xj−μk)2]} 算法E步 Q ( θ , θ ( i ) ) = E γ [ log P ( x , γ ∣ θ ) ∣ x , θ ( i ) ] = E γ { ∑ k = 1 K [ n k log α k + ∑ j = 1 N r j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] } = ∑ k = 1 K [ E ( n k ) log α k + ∑ j = 1 N E ( r j k ) [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] \begin{aligned} Q(\theta,\theta^{(i)}) &= E_{\gamma}[\log P(x,\gamma|\theta)|x,\theta^{(i)}] \\ &= E_{\gamma} \Big\{ \sum_{k=1}^K \Big[n_k\log \alpha_k + \sum_{j=1}^N r_{jk}[\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2] \Big]\Big\} \\ &= \sum_{k=1}^K\Big[E(n_k)\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] \\ \end{aligned} Q(θ,θ(i))=Eγ[logP(x,γ∣θ)∣x,θ(i)]=Eγ{k=1∑K[nklogαk+j=1∑Nrjk[log(2π 1)−logσk−2σk21(xj−μk)2]]}=k=1∑K[E(nk)logαk+j=1∑NE(rjk)[log(2π 1)−logσk−2σk21(xj−μk)2]] 计算 E ( n k ) = E ( ∑ j = 1 N r j k ) = ∑ j = 1 N E ( r j k ) E ( r j k ) = E ( r j k ∣ x , θ ( i ) ) = P ( r j k = 1 ∣ x , θ ( i ) ) = P ( r j k = 1 , x ∣ θ ( i ) ) P ( x ∣ θ ( i ) ) = P ( r j k = 1 , x ∣ θ ( i ) ) ∑ k = 1 K P ( r j k = 1 , x ∣ θ ( i ) ) = P ( r j k = 1 ∣ θ ( i ) ) P ( x ∣ r j k = 1 , θ ( i ) ) ∑ k = 1 K P ( r j k = 1 ∣ θ ( i ) ) P ( x ∣ r j k = 1 , θ ( i ) ) = α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) \begin{aligned} E(n_k) &= E(\sum_{j=1}^Nr_{jk}) = \sum_{j=1}^NE(r_{jk}) \\ E(r_{jk}) &= E(r_{jk}|x,\theta^{(i)}) = P(r_{jk}=1|x,\theta^{(i)}) = \frac{P(r_{jk}=1,x|\theta^{(i)})}{P(x|\theta^{(i)})} \\ &=\frac{P(r_{jk}=1,x|\theta^{(i)})}{\sum_{k=1}^KP(r_{jk}=1,x|\theta^{(i)})} \\ &= \frac{P(r_{jk}=1|\theta^{(i)})P(x|r_{jk}=1,\theta^{(i)})}{\sum_{k=1}^KP(r_{jk}=1|\theta^{(i)})P(x|r_{jk}=1,\theta^{(i)})} \\ &= \frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{k=1}^K\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})} \\ \end{aligned} E(nk)E(rjk)=E(j=1∑Nrjk)=j=1∑NE(rjk)=E(rjk∣x,θ(i))=P(rjk=1∣x,θ(i))=P(x∣θ(i))P(rjk=1,x∣θ(i))=∑k=1KP(rjk=1,x∣θ(i))P(rjk=1,x∣θ(i))=∑k=1KP(rjk=1∣θ(i))P(x∣rjk=1,θ(i))P(rjk=1∣θ(i))P(x∣rjk=1,θ(i))=∑k=1Kαk(i)ϕ(xj,θk(i))αk(i)ϕ(xj,θk(i)) 带入Q函数得到 Q ( θ , θ ( i ) ) = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log α k + ∑ j = 1 N E ( r j k ) [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] Q(\theta,\theta^{(i)}) = \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk})\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] Q(θ,θ(i))=k=1∑K[j=1∑NE(rjk)logαk+j=1∑NE(rjk)[log(2π 1)−logσk−2σk21(xj−μk)2]]
两种方法结果等价 Q ( θ , θ ( i ) ) = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log α k + ∑ j = 1 N E ( r j k ) [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log α k ϕ ( x j ∣ θ ) ] = ∑ j = 1 N [ ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ l = 1 K α l ( i ) ϕ ( x j , θ l ( i ) ) log α k ϕ ( x j ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(i)}) &= \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk})\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] \\ &= \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk}) \log \alpha_k \phi(x_j|\theta) \Big] \\ &= \sum_{j=1}^N \Big[\sum_{k=1}^K \frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{l=1}^K\alpha_l^{(i)}\phi(x_j,\theta_l^{(i)})} \log \alpha_k \phi(x_j|\theta)\Big] \end{aligned} Q(θ,θ(i))=k=1∑K[j=1∑NE(rjk)logαk+j=1∑NE(rjk)[log(2π 1)−logσk−2σk21(xj−μk)2]]=k=1∑K[j=1∑NE(rjk)logαkϕ(xj∣θ)]=j=1∑N[k=1∑K∑l=1Kαl(i)ϕ(xj,θl(i))αk(i)ϕ(xj,θk(i))logαkϕ(xj∣θ)]
算法M步
对于 μ k , σ k 2 \mu_k,\sigma_k^2 μk,σk2,没有约束条件,直接对Q函数求导 ∂ Q ( θ , θ ( i ) ) ∂ μ k = ∑ j = 1 N E ( r j k ) [ 1 2 σ 2 ( x j − μ k ) ] = 0 ∂ Q ( θ , θ ( i ) ) ∂ σ k 2 = ∑ j = 1 N E ( r j k ) [ − 1 2 σ k 2 + 1 2 σ k 4 ( x j − μ k ) 2 ] = 0 \begin{aligned} \frac{\partial Q(\theta,\theta^{(i)})}{\partial \mu_k} & = \sum_{j=1}^N E(r_{jk})[\frac{1}{2\sigma^2}(x_j-\mu_k)]=0 \\ \frac{\partial Q(\theta,\theta^{(i)})}{\partial \sigma_k^2}&= \sum_{j=1}^NE(r_{jk})[-\frac{1}{2\sigma_k^2} + \frac{1}{2\sigma_k^4}(x_j-\mu_k)^2]=0 \\ \end{aligned} ∂μk∂Q(θ,θ(i))∂σk2∂Q(θ,θ(i))=j=1∑NE(rjk)[2σ21(xj−μk)]=0=j=1∑NE(rjk)[−2σk21+2σk41(xj−μk)2]=0 得到 μ k = ∑ j = 1 N E ( r j k ) x j ∑ j = 1 N E ( r j k ) σ k = ∑ j = 1 N E ( r j k ) ( x j − μ k ) 2 ∑ j = 1 N E ( r j k ) \begin{aligned} \mu_k &=\frac{\sum_{j=1}^NE(r_{jk})x_j}{\sum_{j=1}^NE(r_{jk})} \\ \sigma_k &= \frac{\sum_{j=1}^NE(r_{jk})(x_j-\mu_k)^2}{\sum_{j=1}^NE(r_{jk})} \end{aligned} μkσk=∑j=1NE(rjk)∑j=1NE(rjk)xj=∑j=1NE(rjk)∑j=1NE(rjk)(xj−μk)2对于 α k \alpha_k αk,由于存在条件 ∑ k = 1 K α k = 1 \sum_{k=1}^K\alpha_k=1 ∑k=1Kαk=1,原问题的拉格朗日函数为 F ( α ) = Q ( α ) + β ( ∑ k = 1 K α k − 1 ) = ∑ k = 1 K ∑ j = 1 N E ( r j k ) log α k + G ( ) + β ( ∑ k = 1 K α k − 1 ) ∂ F ( α ) ∂ α k = ∑ j = 1 N E ( r j k ) α k + β = 0 → α k = ∑ j = 1 N E ( r j k ) β ∑ k = 1 K α k = ∑ k = 1 K ∑ j = 1 N E ( r j k ) β = 1 → β = ∑ k = 1 K ∑ j = 1 N E ( r j k ) = N \begin{aligned} F(\alpha) &= Q(\alpha) + \beta(\sum_{k=1}^K\alpha_k-1)= \sum_{k=1}^K\sum_{j=1}^N E(r_{jk})\log \alpha_k + G(\text{}) + \beta(\sum_{k=1}^K\alpha_k-1) \\ \frac{\partial F(\alpha)}{\partial \alpha_k} &= \frac{\sum_{j=1}^N E(r_{jk})}{\alpha_k} + \beta = 0 \rightarrow \alpha_k = \frac{\sum_{j=1}^N E(r_{jk})}{\beta} \\ \sum_{k=1}^K\alpha_k &= \frac{\sum_{k=1}^K\sum_{j=1}^N E(r_{jk})}{\beta}=1 \rightarrow \beta=\sum_{k=1}^K\sum_{j=1}^N E(r_{jk})=N \\ \end{aligned} F(α)∂αk∂F(α)k=1∑Kαk=Q(α)+β(k=1∑Kαk−1)=k=1∑Kj=1∑NE(rjk)logαk+G()+β(k=1∑Kαk−1)=αk∑j=1NE(rjk)+β=0→αk=β∑j=1NE(rjk)=β∑k=1K∑j=1NE(rjk)=1→β=k=1∑Kj=1∑NE(rjk)=N 得到 α k 2 = ∑ j = 1 N E ( r j k ) N \alpha_k^2 = \frac{\sum_{j=1}^N E(r_{jk})}{N} αk2=N∑j=1NE(rjk)西瓜书上并没有利用Q函数,直接拉格朗日函数求导了 P ( x i ∣ θ ) = ∑ k = 1 K α k ϕ ( x i ∣ θ k ) L = ∑ j = 1 N log ∑ k = 1 K α k ϕ ( x i ∣ θ k ) s . t . ∑ k = 1 K α k = 1 \begin{aligned} P(x_i|\theta)&= \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k)\\ L &=\sum_{j=1}^N\log \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k) \\ s.t. &\quad \sum_{k=1}^K \alpha_k=1 \end{aligned} P(xi∣θ)Ls.t.=k=1∑Kαkϕ(xi∣θk)=j=1∑Nlogk=1∑Kαkϕ(xi∣θk)k=1∑Kαk=1 拉格朗日函数为 l = ∑ j = 1 N log ∑ k = 1 K α k ϕ ( x i ∣ θ k ) + λ ( ∑ k = 1 K α k − 1 ) l = \sum_{j=1}^N\log \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k) + \lambda(\sum_{k=1}^K \alpha_k -1) l=j=1∑Nlogk=1∑Kαkϕ(xi∣θk)+λ(k=1∑Kαk−1) 对 α k , μ k , Σ k \alpha_k,\mu_k,\Sigma_k αk,μk,Σk求导会和用EM算法得到一样的结果
输入:观测数据 x i , ⋯ , x N x_i,\cdots,x_N xi,⋯,xN,高斯混合模型 输出:高斯混合模型参数
1)取参数的初始值开始迭代2)E步:依据当前模型参数,计算分模型k对观测数据的影响度 γ ^ j k = E ( γ j k ) = α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) , j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K \hat{\gamma}_{jk} = E(\gamma_{jk})=\frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{k=1}^K\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})} ,\quad j=1,2,\cdots,N;k=1,2,\cdots,K γ^jk=E(γjk)=∑k=1Kαk(i)ϕ(xj,θk(i))αk(i)ϕ(xj,θk(i)),j=1,2,⋯,N;k=1,2,⋯,K3)M步:计算新一轮的模型参数 μ k = ∑ j = 1 N E ( r j k ) x j ∑ j = 1 N E ( r j k ) \mu_k =\frac{\sum_{j=1}^NE(r_{jk})x_j}{\sum_{j=1}^NE(r_{jk})} μk=∑j=1NE(rjk)∑j=1NE(rjk)xj σ k 2 = ∑ j = 1 N E ( r j k ) ( x j − μ k ) 2 ∑ j = 1 N E ( r j k ) \sigma_k^2 = \frac{\sum_{j=1}^NE(r_{jk})(x_j-\mu_k)^2}{\sum_{j=1}^NE(r_{jk})} σk2=∑j=1NE(rjk)∑j=1NE(rjk)(xj−μk)2 α k = ∑ j = 1 N E ( r j k ) N \alpha_k = \frac{\sum_{j=1}^N E(r_{jk})}{N} αk=N∑j=1NE(rjk)4)重复2,3步至收敛