机器学习基础专题:高斯判别分析

高斯判别分析

全称是Gaussian Discriminant Analysis (GDA)。大家不要被名字所误导,这是一种概率生成模型。

原理

对联合概率进行建模,我们假设 y ∼ B e r n o u l l i ( Φ ) y \sim Bernoulli(\Phi) yBernoulli(Φ),且 x ∣ y = 1 ∼ N ( μ 1 , Σ ) x|y=1 \sim N(\mu_1, \Sigma) xy=1N(μ1,Σ) x ∣ y = 0 ∼ N ( μ 0 , Σ ) x|y=0 \sim N(\mu_0, \Sigma) xy=0N(μ0,Σ)

输入

训练集数据 D = ( x 1 , y 1 ) . . . ( x M , y M ) D = {(x_1,y_1) ... (x_M,y_M)} D=(x1,y1)...(xM,yM) x i ∈ X ⊆ R n x_i \in \mathcal{X} \subseteq R^n xiXRn y i ∈ Y ⊆ R K y_i \in \mathcal{Y} \subseteq R^K yiYRK,二分类 y i ∈ { − 1 , + 1 } y_i \in \{-1, +1\} yi{ 1,+1}

X i , μ i , Σ i X_i, \mu_i, \Sigma_i Xi,μi,Σi 分别表示第i类样例的集合、均值向量、协方差矩阵。

输出

各分类的概率。如果是二分类就是{+1, -1}的概率。

损失函数

θ = ( μ 0 , μ 1 , σ , Φ ) \theta = (\mu_0, \mu_1, \sigma, \Phi) θ=(μ0,μ1,σ,Φ)。计算似然
L ( θ ) = ∑ i = 1 M l o g [ P ( x i ∣ y i ) P ( y i ) ] = ∑ i = 1 M l o g P ( x i ∣ y i ) + l o g P ( y i ) = ∑ i = 1 M l o g [ N ( μ 1 ∣ Σ ) y i N ( μ 0 ∣ Σ ) 1 − y i ] + l o g [ Φ y i ( 1 − Φ ) 1 − y i ] = ∑ i = 1 M l o g [ N ( μ 1 ∣ Σ ) y i ] + l o g [ N ( μ 0 ∣ Σ ) 1 − y i ] + l o g [ Φ y i ( 1 − Φ ) 1 − y i ] L(\theta) = \sum_{i=1}^M log[P(x_i|y_i)P(y_i)] \\\\ = \sum_{i=1}^M log P(x_i|y_i) + log P(y_i) \\\\ = \sum_{i=1}^M log[N(\mu_1|\Sigma)^{y_i} N(\mu_0|\Sigma)^{1-y_i}] + log [\Phi^{y_i}(1-\Phi)^{1-y_i}] \\\\ = \sum_{i=1}^M log[N(\mu_1|\Sigma)^{y_i}] + log[N(\mu_0|\Sigma)^{1-y_i}] + log [\Phi^{y_i}(1-\Phi)^{1-y_i}] \\\\ L(θ)=i=1Mlog[P(xiyi)P(yi)]=i=1MlogP(xiyi)+logP(yi)=i=1Mlog[N(μ1Σ)yiN(μ0Σ)1yi]+log[Φyi(1Φ)1yi]=i=1Mlog[N(μ1Σ)yi]+log[N(μ0Σ)1yi]+log[Φyi(1Φ)1yi]
Φ \Phi Φ进行求偏导,
∂ L ( θ ) ∂ Φ = ∑ i = 1 M y i Φ + 1 − y i 1 − Φ = 0 ∑ i = 1 M Φ ( 1 − y i ) − ( 1 − Φ ) y i = 0 ∑ i = 1 M [ y i − Φ ] = 0 Φ = 1 M ∑ i = 1 M y i \frac{\partial L(\theta)}{\partial \Phi} = \sum_{i=1}^M \frac{y_i} {\Phi} + \frac {1-y_i} {1-\Phi} = 0\\\\ \sum_{i=1}^M {\Phi}(1-y_i) - (1-\Phi)y_i = 0 \\\\ \sum_{i=1}^M [y_i-\Phi] = 0 \\\\ \Phi = \frac {1} {M} \sum_{i=1}^M y_i \\\\ ΦL(θ)=i=1MΦyi+1Φ1yi=0i=1MΦ(1yi)(1Φ)yi=0i=1M[yiΦ]=0Φ=M1i=1Myi
μ 1 \mu_1 μ1进行求偏导,

∑ i = 1 M l o g [ N ( μ 1 ∣ Σ ) y i ] = ∑ i = 1 M y i l o g [ e x p ( − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ) ( 2 π ) p / 2 ( Σ ) 1 / 2 ] = 0 μ 1 ∗ = a r g m a x μ 1 ∑ i = 1 M y i ( − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ) = − 1 2 ∑ i = 1 M ( x i T Σ − 1 − μ 1 T Σ − 1 ) ( x i − μ 1 ) ∂ L ( θ ) ∂ μ 1 = ∑ i = 1 M y i ( − Σ − 1 x i + Σ − 1 μ 1 T ) = 0 μ 1 ∗ = ∑ i = 1 M y i x i ∑ i = 1 M y i \sum_{i=1}^M log[N(\mu_1|\Sigma)^{y_i}] = \sum_{i=1}^M y_i log[\frac {exp(- \frac 1 2 (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1))} {(2\pi)^{p/2}(\Sigma)^{1/2}} ] = 0 \\\\ \mu_1^* = argmax_{\mu_1} \sum_{i=1}^M y_i (- \frac 1 2 (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1)) \\\\ = - \frac 1 2 \sum_{i=1}^M (x_i^T \Sigma^{-1} - \mu_1^T \Sigma^{-1})(x_i-\mu_1) \\\\ \frac{\partial L(\theta)}{\partial \mu_1} = \sum_{i=1}^M y_i(-\Sigma^{-1}x_i + \Sigma^{-1}\mu_1^T) = 0 \\\\ \mu_1^* = \frac{\sum_{i=1}^M y_i x_i} {\sum_{i=1}^M y_i} i=1Mlog[N(μ1Σ)yi]=i=1Myilog[(2π)p/2(Σ)1/2exp(21(xiμ1)TΣ1(xiμ1))]=0μ1=argmaxμ1i=1Myi(21(xiμ1)TΣ1(xiμ1))=21i=1M(xiTΣ1μ1TΣ1)(xiμ1)μ1L(θ)=i=1Myi(Σ1xi+Σ1μ1T)=0μ1=i=1Myii=1Myixi
类似的,我们可以推导出
μ 0 ∗ = ∑ i = 1 M ( 1 − y i ) x i ∑ i = 1 M ( 1 − y i ) \mu_0^* = \frac{\sum_{i=1}^M (1-y_i) x_i} {\sum_{i=1}^M (1-y_i)} μ0=i=1M(1yi)i=1M(1yi)xi
S S S记录矩阵的方差,我们先对 L ( θ ) L(\theta) L(θ)与之相关的部分进行化简,
L ( θ ) = ∑ i = 1 M l o g [ N ( μ 1 ∣ Σ ) y i ] + l o g [ N ( μ 0 ∣ Σ ) 1 − y i ] + l o g [ Φ y i ( 1 − Φ ) 1 − y i ] = ∑ k = 1 l o g [ N ( μ 1 ∣ Σ ) ] + ∑ k = 0 l o g [ N ( μ 0 ∣ Σ ) ] + ∑ i = 1 M l o g [ Φ y i ( 1 − Φ ) 1 − y i ] − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − ∑ i = 1 M l o g N ( μ , Σ ) = ∑ i = 1 M l o g [ e x p ( − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ) ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 ] = l o g [ 1 ( 2 π ) p / 2 ] + l o g [ ∣ Σ ∣ − 1 2 ] − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) = ∑ i = 1 M C − 1 2 l o g ∣ Σ ∣ − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) = C − 1 2 ∑ i = 1 M l o g ∣ Σ ∣ − 1 2 ∑ i = 1 M ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) = C − M 2 l o g ∣ Σ ∣ − 1 2 ∑ i = 1 M t r [ ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ] = C − M 2 l o g ∣ Σ ∣ − 1 2 ∑ i = 1 M t r [ ( x i − μ 1 ) T ( x i − μ 1 ) Σ − 1 ] = C − M 2 l o g ∣ Σ ∣ − 1 2 t r [ ∑ i = 1 M ( x i − μ 1 ) T ( x i − μ 1 ) Σ − 1 ] = C − M 2 l o g ∣ Σ ∣ − M 2 t r [ S ∗ Σ − 1 ] L(\theta) = \sum_{i=1}^M log[N(\mu_1|\Sigma)^{y_i}] + log[N(\mu_0|\Sigma)^{1-y_i}] + log [\Phi^{y_i}(1-\Phi)^{1-y_i}] \\\\ = \sum_{k=1} log[N(\mu_1|\Sigma)] + \sum_{k=0} log[N(\mu_0|\Sigma)] + \sum_{i=1}^M log [\Phi^{y_i}(1-\Phi)^{1-y_i}] \\\\ -------------------------------------- \\\\ \sum_{i=1}^M log N(\mu, \Sigma) = \sum_{i=1}^Mlog[\frac {exp(- \frac 1 2 (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1))} {(2\pi)^{p/2}|\Sigma|^{1/2}} ] \\\\ = log[\frac 1 {(2\pi)^{p/2}}] + log[|\Sigma|^{- \frac 1 2}] - \frac 1 2 (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1) \\\\ = \sum_{i=1}^M C -\frac 1 2 log|\Sigma| -\frac 1 2 (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1) \\\\ = C - \frac 1 2 \sum_{i=1}^M log|\Sigma| -\frac 1 2 \sum_{i=1}^M (x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1) \\\\ = C - \frac M 2 log|\Sigma| -\frac 1 2 \sum_{i=1}^M tr[(x_i-\mu_1)^T \Sigma^{-1} (x_i-\mu_1)] \\\\ = C - \frac M 2 log|\Sigma| -\frac 1 2 \sum_{i=1}^M tr[(x_i-\mu_1)^T (x_i-\mu_1) \Sigma^{-1}] \\\\ = C - \frac M 2 log|\Sigma| -\frac 1 2 tr[\sum_{i=1}^M (x_i-\mu_1)^T (x_i-\mu_1) \Sigma^{-1}] \\\\ = C - \frac M 2 log|\Sigma| -\frac M 2 tr[S*\Sigma^{-1}] \\\\ L(θ)=i=1Mlog[N(μ1Σ)yi]+log[N(μ0Σ)1yi]+log[Φyi(1Φ)1yi]=k=1log[N(μ1Σ)]+k=0log[N(μ0Σ)]+i=1Mlog[Φyi(1Φ)1yi]i=1MlogN(μ,Σ)=i=1Mlog[(2π)p/2Σ1/2exp(21(xiμ1)TΣ1(xiμ1))]=log[(2π)p/21]+log[Σ21]21(xiμ1)TΣ1(xiμ1)=i=1MC21logΣ21(xiμ1)TΣ1(xiμ1)=C21i=1MlogΣ21i=1M(xiμ1)TΣ1(xiμ1)=C2MlogΣ21i=1Mtr[(xiμ1)TΣ1(xiμ1)]=C2MlogΣ21i=1Mtr[(xiμ1)T(xiμ1)Σ1]=C2MlogΣ21tr[i=1M(xiμ1)T(xiμ1)Σ1]=C2MlogΣ2Mtr[SΣ1]

M k M_k Mk表示各类数据的个数,对 Σ \Sigma Σ进行求偏导,
L ( θ ) = − M 1 2 [ l o g ∣ Σ ∣ − t r ( S 1 ∗ Σ − 1 ) ] − M 0 2 [ l o g ∣ Σ ∣ − t r ( S 0 ∗ Σ − 1 ) ] = − M 2 l o g ∣ Σ ∣ − M 1 2 t r ( S 1 ∗ Σ − 1 ) − M 0 2 t r ( S 0 ∗ Σ − 1 ) − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − ∂ L ( θ ) ∂ Σ = − 1 2 [ ( M Σ − 1 ) − M 0 S 0 T Σ − 2 − M 1 S 1 T Σ − 2 ] = 0 M Σ = M 1 S 0 + M 1 S 1 Σ ∗ = M 1 S 0 + M 1 S 1 M L(\theta) = - \frac {M_1} 2 [log|\Sigma| - tr(S_1 * \Sigma^{-1})] - \frac {M_0} 2 [log|\Sigma| - tr(S_0 * \Sigma^{-1})] \\\\ = - \frac {M} 2 log|\Sigma| - \frac {M_1} 2 tr(S_1 * \Sigma^{-1}) - \frac {M_0} 2 tr(S_0 * \Sigma^{-1}) \\\\ -------------------------------\\\\ \frac {\partial L(\theta)}{\partial \Sigma} = - \frac {1} 2 [(M \Sigma^{-1}) - M_0 S_0^T\Sigma^{-2} - M_1 S_1^T\Sigma^{-2}] = 0 \\\\ M \Sigma = M_1 S_0 + M_1 S_1 \\\\ \Sigma^* = \frac {M_1 S_0 + M_1 S_1} {M} L(θ)=2M1[logΣtr(S1Σ1)]2M0[logΣtr(S0Σ1)]=2MlogΣ2M1tr(S1Σ1)2M0tr(S0Σ1)ΣL(θ)=21[(MΣ1)M0S0TΣ2M1S1TΣ2]=0MΣ=M1S0+M1S1Σ=MM1S0+M1S1

适用场景

无。

优点

  1. 鲁棒性较好

缺点

  1. 需要数据服从分布(具有严格假设)
  2. 参数较多,计算相对比较复杂

Reference

猜你喜欢

转载自blog.csdn.net/qq_40136685/article/details/108746735