Q函数
Q ( θ , θ ( i ) ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log P ( Y , Z ∣ θ ) ⋅ P ( Z ∣ Y , θ ( i ) ) \begin{aligned} Q\left(\theta, \theta^{(i)}\right) & =E_Z \left[\log P(Y, Z \mid \theta) \mid Y, \theta^{(i)}\right] \\ & =\sum_Z \log P(Y, Z \mid \theta) \cdot P\left(Z \mid Y, \theta^{(i)}\right) \end{aligned} Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=Z∑logP(Y,Z∣θ)⋅P(Z∣Y,θ(i))
Q函数是EM算法中的一个重要函数,全称为“期望完全数据对数似然函数”。它的作用是在E步中计算出完全数据的对数似然函数的期望值,以便在M步中求出模型参数的最大似然估计值。
在之前的一篇文章(EM算法求解三硬币模型参数推导)中,为大家介绍了李航教授《统计学习方法》中求解三硬币模型的参数推导过程,其中使用的EM算法是从一个Q函数直接展开求解的,限于篇幅,文章并未展示证明过程,本篇文章作为上一篇文章以及《统计学习方法-第九章-179页》推导的补充,详细推导Q函数的由来。
Q函数推导证明
我们已知关于参数 θ \theta θ的似然函数
L ( θ ) = log P ( Y ∣ θ ) = log P ( Y , θ ) P ( θ ) = log ∑ Z P ( Y , θ , Z ) P ( θ ) = log ∑ Z P ( Y , θ , Z ) P ( θ ) = log ∑ Z P ( Y , Z ∣ θ ) = log ∑ Z P ( Y , Z , θ ) P ( Z , θ ) ⋅ P ( Z , θ ) P ( θ ) L(\theta)=\log P(Y \mid \theta) \\ =\log \frac{P(Y, \theta)}{ P(\theta)} \\=\log \frac{ \sum_Z P(Y, \theta,Z)}{ P(\theta)} \\=\log \sum_Z \frac{ P(Y, \theta,Z)}{ P(\theta)}=\log \sum_Z P(Y, Z \mid \theta) \\=\log \sum_Z \frac{ P(Y, Z, \theta)}{ P(Z,\theta) }\cdot \frac{ P(Z, \theta)}{ P(\theta) } L(θ)=logP(Y∣θ)=logP(θ)P(Y,θ)=logP(θ)∑ZP(Y,θ,Z)=logZ∑P(θ)P(Y,θ,Z)=logZ∑P(Y,Z∣θ)=logZ∑P(Z,θ)P(Y,Z,θ)⋅P(θ)P(Z,θ)
即
L ( θ ) = log ∑ Z P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) L(\theta)=\log \sum_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta) L(θ)=logZ∑P(Y∣Z,θ)⋅P(Z∣θ)
假设第i次参数取 θ ( i ) \theta^{(i)} θ(i) ,我们希望优化后 L ( θ ) > L ( θ ( i ) ) L(\theta)>L(\theta^{(i)}) L(θ)>L(θ(i))
于是可以作差
即
L ( θ ) − L ( θ ( i ) ) = log Σ Z P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) − log P ( Y ∣ θ ( i ) ) L(\theta)-L\left(\theta^{(i)}\right)=\log \Sigma_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta)-\log P\left(Y \mid \theta^{(i)}\right) L(θ)−L(θ(i))=logΣZP(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))
第一项可以凑一个分式出来
L ( θ ) − L ( θ ( i ) ) = log ( Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log P ( Y ∣ θ ( i ) ) L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right) L(θ)−L(θ(i))=log(ΣZP(Z∣Y,θ(i))⋅P(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ))−logP(Y∣θ(i))
利用 ∑ Z P ( Z ∣ Y , θ ( i ) ) = 1 \sum_Z P \left(Z \mid Y, \theta^{(i)}\right)=1 ∑ZP(Z∣Y,θ(i))=1的特性,第二项乘以这一串,可以得到
L ( θ ) − L ( θ ( i ) ) = log ( Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log P ( Y ∣ θ ( i ) ) ⋅ Σ Z P ( Z ∣ Y , θ ( i ) ) L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) L(θ)−L(θ(i))=log(ΣZP(Z∣Y,θ(i))⋅P(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ))−logP(Y∣θ(i))⋅ΣZP(Z∣Y,θ(i))
利用 J e n s e n Jensen Jensen不等式
l o g ∑ j λ j ⋅ y j ⩾ ∑ j λ j ⋅ l o g y j log\sum_{j}\lambda_j \cdot y_j \geqslant \sum_j \lambda_j \cdot log y_j log∑jλj⋅yj⩾∑jλj⋅logyj,其中 λ ⩾ 0 , ∑ j λ j = 1 \lambda \geqslant 0,\sum_j \lambda_j =1 λ⩾0,∑jλj=1
可知
⩾ ∑ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − log P ( Y ∣ θ ( i ) ) ⋅ Σ Z P ( Z ∣ Y , θ ( i ) ) \geqslant \sum_Z P \left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) ⩾Z∑P(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))⋅ΣZP(Z∣Y,θ(i))
= ∑ Z P ( Z ∣ Y , θ ( i ) ) ⋅ [ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) p ( Z ∣ Y , θ ( i ) ) − log P ( Y ∣ θ ( i ) ) ] =\sum_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot\left[\log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{p\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right)\right] =Z∑P(Z∣Y,θ(i))⋅[logp(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))]
= Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) =\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) } =ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
即此时 L ( θ ) − L ( θ ( i ) ) ⩾ Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) L(\theta)-L\left(\theta^{(i)}\right) \geqslant \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) } L(θ)−L(θ(i))⩾ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
即 L ( θ ) ⩾ L ( θ ( i ) ) + Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) L(\theta) \geqslant L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) } L(θ)⩾L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
令 B ( θ , θ ( i ) ) = L ( θ ( i ) ) + Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) B\left(\theta, \theta^{(i)}\right)=L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) } B(θ,θ(i))=L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
此时 B ( θ , θ ( i ) ) B\left(\theta, \theta^{(i)}\right) B(θ,θ(i)) 是 L ( θ ) L(\theta) L(θ) 的下界,使 B ( θ , θ ( i ) ) B\left(\theta, \theta^{(i)}\right) B(θ,θ(i)) 最大化的 θ \theta θ 也可使 L ( θ ) L\left( \theta\right) L(θ) 最大
于是我们的目标是 θ ( i + 1 ) = argmax θ B ( θ , θ ( i ) ) \theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} B\left(\theta, \theta^{(i)}\right) θ(i+1)=θargmaxB(θ,θ(i))
也即
θ ( i + 1 ) = argmax θ [ L ( θ ( i ) ) + Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) ] \theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} \left[L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{\left.P(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)}\right] θ(i+1)=θargmax[L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)]
可把 L ( θ ( i ) ) 、 P ( Z ∣ Y , θ ( i ) ) 、 P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) L\left( \theta^{(i)}\right)、P\left(Z \mid Y, \theta^{(i)}\right)、 P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right) L(θ(i))、P(Z∣Y,θ(i))、P(Z∣Y,θ(i))⋅P(Y∣θ(i)) 三项视为常数
且已知 P ( Z ∣ Y , θ ( i ) ) ⋅ P ( Y ∣ θ ( i ) ) > 0 P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)>0 P(Z∣Y,θ(i))⋅P(Y∣θ(i))>0 ,这一项从分母去掉,不影响求最大值,注意这里的 P ( Z ∣ Y , θ ( i ) ) \left.P(Z \mid Y, \theta^{(i)}\right) P(Z∣Y,θ(i)) 不能省略,因为它是 ∑ \sum ∑ 后面中的每一项的系数
于是
θ ( i + 1 ) = argmax θ [ Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) ] \theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}}\left[\Sigma_Z P\left(Z \mid Y,{ \theta}^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta)\right] θ(i+1)=θargmax[ΣZP(Z∣Y,θ(i))⋅logP(Y∣Z,θ)⋅P(Z∣θ)]
我们令 Q ( θ , θ ( i ) ) = Σ Z P ( Z ∣ Y , θ ( i ) ) ⋅ log P ( Y ∣ Z , θ ) ⋅ P ( Z ∣ θ ) Q\left(\theta, \theta^{(i)}\right)=\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta) Q(θ,θ(i))=ΣZP(Z∣Y,θ(i))⋅logP(Y∣Z,θ)⋅P(Z∣θ)
即
θ ( i + 1 ) = argmax θ Q ( θ , θ ( i + 1 ) ) \theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} Q\left(\theta, \theta^{(i+1)}\right) θ(i+1)=θargmaxQ(θ,θ(i+1))
其中 Q ( θ , θ ( i ) ) Q\left(\theta, \theta^{(i)}\right) Q(θ,θ(i)) 就是所谓的 Q Q Q 函数
参考资料
[1].EM算法求解三硬币模型参数推导