前面已经求得的 L ( t ∣ σ 2 , α ) = − 1 2 log ∣ σ 2 I + Φ Λ α Φ T ∣ − 1 2 t T ( σ 2 I + Φ Λ α Φ T ) − 1 t \boxed{ \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = -\frac{1}{2} \log\Big\vert \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} \Big\vert -\frac{1}{2} \mathbf{t}^{\rm T} \Big( \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} \Big)^{-1} \mathbf{t} \end{aligned} } L(t∣σ2,α)=−21log∣∣∣σ2I+ΦΛαΦT∣∣∣−21tT(σ2I+ΦΛαΦT)−1t
Σ w − 1 = 1 σ 2 Φ T Φ + Λ α − 1 μ w = σ − 2 Σ w Φ T t \boxed{ \begin{aligned} \mathbf \Sigma_w^{-1} &= \frac{1}{\sigma^2}\mathbf{\Phi}^{\rm T}\mathbf{\Phi}+\mathbf{\Lambda}_{\alpha}^{-1} \\ \bm\mu_w &= \sigma^{-2} \mathbf \Sigma_w \mathbf{\Phi}^{\rm T}\mathbf t \end{aligned} } Σw−1μw=σ21ΦTΦ+Λα−1=σ−2ΣwΦTt
稀疏贝叶斯学习过程如下: α i new = γ i μ i 2 ( σ 2 ) new = ∥ t − Φ μ w ∥ 2 N − ∑ i γ i γ i = 1 − [ Σ w ] i , i ⋅ α i \begin{aligned} \alpha_i^{\text{new}} &= \frac{\gamma_i}{\mu_i^2} \\ (\sigma^2)^{\text{new}} &= \frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu_w\Vert^2}{N-\sum_i \gamma_i} \\ \gamma_i &= 1- [\mathbf\Sigma_w]_{i,i} \cdot \alpha_i \end{aligned} αinew(σ2)newγi=μi2γi=N−∑iγi∥t−Φμw∥2=1−[Σw]i,i⋅αi 实际操作中, α \bm \alpha α 中大部分元素将变得很大,这在迭代更新过程中会引入较大的运算量,如果仅仅将无限大的元素停止更新不删除,导致的(ill-conditioned) Σ \bf \Sigma Σ 矩阵来更新 α \bm \alpha α 将毫无意义。
上一节中得到的对数,为 L ( t ∣ σ 2 , α ) = − N 2 log ( 2 π ) − 1 2 log ∣ Σ t ∣ − 1 2 t T Σ t − 1 t \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = {-\frac{N}{2}} \log (2\pi)-\frac{1}{2} \log\Big\vert \mathbf{\Sigma}_t \Big\vert -\frac{1}{2} \mathbf{t}^{\rm T} \mathbf{\Sigma}^{-1}_t \mathbf{t} \end{aligned} L(t∣σ2,α)=−2Nlog(2π)−21log∣∣∣Σt∣∣∣−21tTΣt−1t 其中 Σ t = σ 2 I + Φ Λ α Φ T \mathbf{\Sigma}_t=\sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf{\Lambda}_{\alpha}\mathbf{\Phi}^{\rm T} Σt=σ2I+ΦΛαΦT,令 C = Σ t \mathbf C=\mathbf{\Sigma}_t C=Σt,以及 A − 1 = Λ α \mathbf A^{-1} = \mathbf{\Lambda}_{\alpha} A−1=Λα。
重新整理得到 L ( t ∣ σ 2 , α ) = − 1 2 [ N log ( 2 π ) + log ∣ C ∣ + t T C − 1 t ] \boxed{ \begin{aligned} \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) = -\frac{1}{2} \bigg[ N\log (2\pi)+ \log\Big\vert \mathbf C \Big\vert + \mathbf{t}^{\rm T} \mathbf{C}^{-1} \mathbf{t} \bigg] \end{aligned} } L(t∣σ2,α)=−21[Nlog(2π)+log∣∣∣C∣∣∣+tTC−1t] 其中, C = σ 2 I + Φ A − 1 Φ T \mathbf C=\sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf A^{-1}\mathbf{\Phi}^{\rm T} C=σ2I+ΦA−1ΦT。
为了减小运算量,我们做以下分解:
C = σ 2 I + Φ A − 1 Φ T = σ 2 I + ∑ m ≠ i α m − 1 ϕ m ϕ m T + α i − 1 ϕ i ϕ i T = C − i + α i − 1 ϕ i ϕ i T \begin{aligned} \mathbf C& = \sigma^2 \mathbf{I}+\mathbf{\Phi}\mathbf A^{-1}\mathbf{\Phi}^{\rm T} \\ &= \sigma^2 \mathbf{I}+\sum_{m\neq i}\alpha_m^{-1}{\bm \phi_m} \bm \phi_m^{\rm T}+ \alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T}\\ &=\mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} \end{aligned} C=σ2I+ΦA−1ΦT=σ2I+m=i∑αm−1ϕmϕmT+αi−1ϕiϕiT=C−i+αi−1ϕiϕiT 其中 C − i \mathbf C_{-i} C−i 为 C \mathbf C C 去掉基向量 i i i 贡献后的结果。给定 L ( t ∣ σ 2 , α ) \mathcal L( \mathbf t \vert \sigma^2, \boldsymbol \alpha) L(t∣σ2,α) 中的行列式计算(大名鼎鼎的 The Matrix Determinant Lemma 定理来啦)为 ∣ C ∣ = ∣ C − i + α i − 1 ϕ i ϕ i T ∣ = ∣ C − i ∣ ⋅ ∣ 1 + α i − 1 ϕ i T C − i − 1 ϕ i ∣ \begin{aligned} \Big\vert \mathbf C \Big\vert &= \Big\vert \mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} \Big\vert \\ &= \Big\vert \mathbf C_{-i}\Big\vert \cdot \Big\vert 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big\vert \end{aligned} ∣∣∣C∣∣∣=∣∣∣C−i+αi−1ϕiϕiT∣∣∣=∣∣∣C−i∣∣∣⋅∣∣∣1+αi−1ϕiTC−i−1ϕi∣∣∣ 那么接下来去计算逆矩阵,利用 ( P + ρ a a H ) − 1 = P − 1 − ρ P − 1 a ( 1 + ρ a H P − 1 a ) − 1 a H P − 1 (\mathbf P+\rho \mathbf a\mathbf a^{H})^{-1} =\mathbf P^{-1} -\rho \mathbf P^{-1}\mathbf a ( 1+ \rho\mathbf a^{H} \mathbf P^{-1}\mathbf a )^{-1} \mathbf a^{H} \mathbf P^{-1} (P+ρaaH)−1=P−1−ρP−1a(1+ρaHP−1a)−1aHP−1 因此可以得到 C − 1 = ( C − i + α i − 1 ϕ i ϕ i T ) − 1 = C − i − 1 − α i − 1 C − i − 1 ϕ i ( 1 + α i − 1 ϕ i T C − i − 1 ϕ i ) − 1 ϕ i T C − i − 1 = C − i − 1 − C − i − 1 ϕ i ϕ i T C − i − 1 α i + ϕ i T C − i − 1 ϕ i \begin{aligned} \mathbf C^{-1} &= \bigg(\mathbf C_{-i}+\alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T}\bigg)^{-1} \\ &=\mathbf C_{-i}^{-1}- \alpha_i^{-1}\mathbf C_{-i}^{-1} \bm \phi_i \bigg(1+\alpha_i^{-1}\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i \bigg)^{-1} \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \\ &=\mathbf C_{-i}^{-1}- \frac{\mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} }{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i} \end{aligned} C−1=(C−i+αi−1ϕiϕiT)−1=C−i−1−αi−1C−i−1ϕi(1+αi−1ϕiTC−i−1ϕi)−1ϕiTC−i−1=C−i−1−αi+ϕiTC−i−1ϕiC−i−1ϕiϕiTC−i−1 于是令 1 + α i − 1 ϕ i T C − i − 1 ϕ i = α i − 1 ( α i + ϕ i T C − i − 1 ϕ i ) 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} =\alpha_i^{-1}(\alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} ) 1+αi−1ϕiTC−i−1ϕi=αi−1(αi+ϕiTC−i−1ϕi), L ( α ) \mathcal L( \boldsymbol \alpha) L(α) 可以写为 L ( α ) = − 1 2 [ N log ( 2 π ) + log ∣ C ∣ + t T C − 1 t ] = − 1 2 [ N log ( 2 π ) + log ∣ C − i ∣ + log ( 1 + α i − 1 ϕ i T C − i − 1 ϕ i ) + t T ( C − i − 1 − C − i − 1 ϕ i ϕ i T C − i − 1 α i + ϕ i T C − i − 1 ϕ i ) t ] = − 1 2 [ N log ( 2 π ) + log ∣ C − i ∣ − log α i + log ( α i + ϕ i T C − i − 1 ϕ i ) + t T C − i − 1 t − t T C − i − 1 ϕ i ϕ i T C − i − 1 t α i + ϕ i T C − i − 1 ϕ i ] = − 1 2 [ N log ( 2 π ) + log ∣ C − i ∣ − log α i + log ( α i + ϕ i T C − i − 1 ϕ i ) + t T C − i − 1 t − ( ϕ i T C − i − 1 t ) 2 α i + ϕ i T C − i − 1 ϕ i ] \begin{aligned} \mathcal L( \boldsymbol \alpha) &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log\Big\vert \mathbf C \Big\vert + \mathbf{t}^{\rm T} \mathbf{C}^{-1} \mathbf{t} \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert + \log \Big( 1 +\alpha_i^{-1} \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big)+ \mathbf{t}^{\rm T} \bigg(\mathbf C_{-i}^{-1}- \frac{\mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} }{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i} \bigg) \mathbf{t} \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert - \log\alpha_i+\log \Big( \alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big) + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t}- \frac{\mathbf{t}^{\rm T} \mathbf C_{-i}^{-1} \bm \phi_i \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t}}{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i } \bigg] \\ &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert - \log\alpha_i+\log \Big( \alpha_i+\bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \Big) + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t} - \frac{(\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t})^2}{\alpha_i+\bm \phi_i^{\rm T}\mathbf C_{-i}^{-1}\bm \phi_i } \bigg] \end{aligned} L(α)=−21[Nlog(2π)+log∣∣∣C∣∣∣+tTC−1t]=−21[Nlog(2π)+log∣∣∣C−i∣∣∣+log(1+αi−1ϕiTC−i−1ϕi)+tT(C−i−1−αi+ϕiTC−i−1ϕiC−i−1ϕiϕiTC−i−1)t]=−21[Nlog(2π)+log∣∣∣C−i∣∣∣−logαi+log(αi+ϕiTC−i−1ϕi)+tTC−i−1t−αi+ϕiTC−i−1ϕitTC−i−1ϕiϕiTC−i−1t]=−21[Nlog(2π)+log∣∣∣C−i∣∣∣−logαi+log(αi+ϕiTC−i−1ϕi)+tTC−i−1t−αi+ϕiTC−i−1ϕi(ϕiTC−i−1t)2] 我们令 s i = ϕ i T C − i − 1 ϕ i q i = ϕ i T C − i − 1 t \begin{aligned} s_i &= \bm \phi_i^{\rm T } \mathbf C_{-i}^{-1} {\bm \phi_i} \\ q_i &= \bm \phi_i^{\rm T}\mathbf C_{-i}^{-1} \mathbf{t} \end{aligned} siqi=ϕiTC−i−1ϕi=ϕiTC−i−1t 从而 L ( α ) \mathcal L( \boldsymbol \alpha) L(α) 可以写为 L ( α ) = − 1 2 [ N log ( 2 π ) + log ∣ C − i ∣ + t T C − i − 1 t − log α i + log ( α i + s i ) − q i 2 α i + s i ] = L ( α − i ) + 1 2 [ log α i − log ( α i + s i ) + q i 2 α i + s i ] = L ( α − i ) + ℓ ( α i ) \begin{aligned} \mathcal L( \boldsymbol \alpha) &= -\frac{1}{2} \bigg[ N\log (2\pi)+ \log \Big\vert \mathbf C_{-i}\Big\vert + \mathbf{t}^{\rm T} \mathbf C_{-i}^{-1}\mathbf{t} - \log\alpha_i+\log \Big( \alpha_i+s_i \Big) - \frac{q_i^2}{\alpha_i+ s_i } \bigg] \\ &=\mathcal L( \boldsymbol \alpha_{-i})+ \frac{1}{2} \bigg[ \log\alpha_i-\log \Big( \alpha_i+s_i \Big) + \frac{q_i^2}{\alpha_i+ s_i } \bigg] \\ &= \mathcal L( \boldsymbol \alpha_{-i})+ \ell(\alpha_i) \end{aligned} L(α)=−21[Nlog(2π)+log∣∣∣C−i∣∣∣+tTC−i−1t−logαi+log(αi+si)−αi+siqi2]=L(α−i)+21[logαi−log(αi+si)+αi+siqi2]=L(α−i)+ℓ(αi)
其中两个常数系数:
稀疏系数 s i s_i si 可以被视为衡量基向量 ϕ i {\bm \phi_i} ϕi 与模型中已经存在的 ϕ i {\bm \phi_i} ϕi 重叠的程度。质量系数 q i q_i qi 是 ϕ i {\bm \phi_i} ϕi 与不包括该向量的模型误差对齐的一种度量,是衡量 ϕ i {\bm \phi_i} ϕi 增加 L ( α ) \mathcal L( \boldsymbol \alpha) L(α) 的效果。上面的分析(求导,通分,令分子为 0,注意结果应该是合理的方差取值范围)表明 L ( α ) \mathcal L( \boldsymbol \alpha) L(α) 相对于 α i \alpha_i αi 具有唯一的最大值:
α i = s i 2 q i 2 − s i , q i 2 > s i α i = ∞ , q i 2 ≤ s i \begin{aligned} \alpha_i &= \frac{s_i^2}{q_i^2-s_i},\quad \quad q_i^2>s_i \\ \alpha_i &= \infty,\quad \quad\quad\quad q_i^2\leq s_i \end{aligned} αiαi=qi2−sisi2,qi2>si=∞,qi2≤si
通过迭代过程,计算所有的基向量 ϕ i \bm \phi_i ϕi 对应的 s i s_i si 和 q i q_i qi,更新参数向量 α \bm \alpha α,总结基于快速 RVM 的贝叶算法如下:
1) 对于稀疏回归问题,初始化 σ 2 \sigma^2 σ2。
2) 初始化一个基向量 ϕ i \bm \phi_i ϕi,根据 α i = s i 2 q i 2 − s i , q i 2 > s i \alpha_i = \frac{s_i^2}{q_i^2-s_i},\quad q_i^2>s_i αi=qi2−sisi2,qi2>si 这时候初始化的 C = σ 2 I + α i − 1 ϕ i ϕ i T \mathbf C =\sigma^2\mathbf I+ \alpha_i^{-1}{\bm \phi_i} \bm \phi_i^{\rm T} C=σ2I+αi−1ϕiϕiT,因此 C − i = σ 2 I → C − i − 1 = σ − 2 I \mathbf C_{-i}=\sigma^2\mathbf I \rightarrow \mathbf C_{-i}^{-1}=\sigma^{-2}\mathbf I C−i=σ2I→C−i−1=σ−2I。使超参数 α i = s i 2 q i 2 − s i = ∥ σ − 2 ϕ i T ϕ i ∥ 2 ∥ σ − 2 ϕ i T t ∥ 2 − σ − 2 ϕ i T ϕ i = ∥ ϕ i ∥ 2 ∥ ϕ i T t ∥ 2 / ∥ ϕ i ∥ 2 − σ 2 \begin{aligned} \alpha_i &= \frac{s_i^2}{q_i^2-s_i}=\frac{\Vert \sigma^{-2}\bm \phi_i^{\rm T } {\bm \phi_i} \Vert^2}{\Vert \sigma^{-2}\bm \phi_i^{\rm T } \mathbf{t} \Vert^2- \sigma^{-2}\bm \phi_i^{\rm T } {\bm \phi_i}} \\ &=\frac{\Vert {\bm \phi_i} \Vert^2}{\Vert \bm \phi_i^{\rm T } \mathbf{t} \Vert^2/\Vert \bm \phi_i \Vert^2- \sigma^{2}} \end{aligned} αi=qi2−sisi2=∥σ−2ϕiTt∥2−σ−2ϕiTϕi∥σ−2ϕiTϕi∥2=∥ϕiTt∥2/∥ϕi∥2−σ2∥ϕi∥2 其他 α m \alpha_m αm 理论上为无穷大。
3)显示计算初始化的均值 μ \bm \mu μ 和方差 Σ \mathbf \Sigma Σ(最初是标量): Σ = [ σ − 2 Φ T Φ + A ] − 1 μ = σ − 2 Σ Φ T t \begin{aligned} \mathbf \Sigma &= \Big[ {\sigma^{-2}}\mathbf{\Phi}^{\rm T}\mathbf{\Phi}+\mathbf{A}\Big] ^{-1}\\ \bm\mu &= \sigma^{-2} \mathbf \Sigma \mathbf{\Phi}^{\rm T}\mathbf t \end{aligned} Σμ=[σ−2ΦTΦ+A]−1=σ−2ΣΦTt 连同一起初始化其他基向量 ϕ m \bm \phi_m ϕm 以及对应的 s m s_m sm 和 q m q_m qm。
4)从具有 M M M 个基向量的全部集合中选择候选基 ϕ i \bm \phi_i ϕi。
5)计算 θ i = q i 2 − s i \theta_i =q_i^2-s_i θi=qi2−si:
如果 θ i > 0 \theta_i >0 θi>0 且 α i < ∞ \alpha_i < \infty αi<∞,更新估计 α i \alpha_i αi。如果 θ i > 0 \theta_i >0 θi>0 且 α i = ∞ \alpha_i = \infty αi=∞,添加基向量 ϕ i \bm \phi_i ϕi 并更新估计 α i \alpha_i αi。如果 θ i ≤ 0 \theta_i \leq 0 θi≤0 且 α i < ∞ \alpha_i < \infty αi<∞,删除基向量 ϕ i \bm \phi_i ϕi 并设置 α i = ∞ \alpha_i =\infty αi=∞。6) 估计噪声水平,更新其方差值 σ 2 = ∥ t − Φ μ ∥ 2 N − ∑ i γ i = ∥ t − Φ μ ∥ 2 N − M + ∑ m α m [ Σ ] m , m \sigma^2= \frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu\Vert^2}{N-\sum_i \gamma_i} =\frac{\Vert \mathbf t - \mathbf{\Phi}\bm\mu\Vert^2}{N-M+\sum_m \alpha_m [\mathbf\Sigma]_{m,m}} σ2=N−∑iγi∥t−Φμ∥2=N−M+∑mαm[Σ]m,m∥t−Φμ∥2
7)再次更新均值 μ \bm \mu μ 和方差 Σ \mathbf \Sigma Σ,及所有的 s m s_m sm 和 q m q_m qm。
8)判断是否收敛,收敛则算法结束,否则转到步骤(4)继续迭代。