数据学习(3)·生成学习算法

版权声明:wang https://blog.csdn.net/m0_37846020/article/details/83932054

作者课堂笔记整理,[email protected]

Preview

  • 判别和生成模型
  • 高斯判别分析
  • 朴素贝叶斯

两种学习方法

分类输入的数据x,成两个类别 y { 0 , 1 } y\in\{0,1\}
算法

判别学习算法

该算法学习条件概率 p ( y x ) p(y|x) 或者直接学习函数映射。
举例:线性回归,Logistic回归,K近邻…

生成学习算法

该学习算法学习联合概率 p ( x , y ) p(x,y)

  • 生成算法学习 p ( x y ) , p ( y ) p(x|y),p(y)
  • p ( y ) p(y) 称为先验概率。
  • 使用Bayes规则,转变 p ( y x ) p(y|x)

贝叶斯法则

p ( y x ) = p ( x y ) p ( y ) p ( x ) p(y|x)=\frac{p(x|y)p(y)}{p(x)}
a r g m a x y p ( y x ) = a r g m a x y p ( x y ) p ( y ) p ( x ) = a r g m a x y p ( x y ) p ( y ) argmax_yp(y|x)=argmax_y\frac{p(x|y)p(y)}{p(x)}=argmax_yp(x|y)p(y)
没有必要计算p(x).

1 生成模型

生成分类算法:

  • 连续输入:高斯判别分析
  • 离散输入:朴素贝叶斯

1.1 多元高斯分布 N ( μ , Σ ) N(\mu,\Sigma)

  • μ R n \mu\in R^n 是平均算法
  • Σ R n × n \Sigma\in R^{n\times n} 是协方差矩阵, Σ \Sigma 是对称的和SPD。
    p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 Σ 1 / 2 e 1 2 ( x μ ) T Σ 1 ( x μ ) p(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} 123
    E [ X ] = μ , C o v ( x ) = E [ ( X E ( X ) ) ( X E [ X ] ) T ] = Σ E[X]=\mu,Cov(x)=E[(X-E(X))(X-E[X])^T]=\Sigma

1.2 高斯判别分析(GDA)

给定参数 ϕ , μ 0 , μ 1 , Σ \phi,\mu_0,\mu_1,\Sigma
y B e r n o u l l i ( ϕ ) x y = 0 N ( μ 0 , Σ ) x y = 1 N ( μ 1 , Σ ) y\sim Bernoulli(\phi)\quad x|y=0\sim N(\mu_0,\Sigma)\quad x|y=1\sim N(\mu_1,\Sigma)
p ( y ) = ϕ y ( 1 ϕ ) 1 y p(y)=\phi^y(1-\phi)^{1-y}
p ( x y = 0 ) = 1 ( 2 π ) n / 2 Σ 1 / 2 e 1 2 ( x μ 0 ) T Σ 1 ( x μ 0 ) p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)}
p ( x y = 1 ) = 1 ( 2 π ) n / 2 Σ 1 / 2 e 1 2 ( x μ 1 ) T Σ 1 ( x μ 1 ) p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)}
Log 数据似然函数:
I ( ϕ , μ 0 , μ 1 , Σ ) = l o g i = 1 m p ( x ( i ) , y ( i ) ; ϕ , μ 0 , μ 1 , Σ ) = l o g i = 1 m p ( x ( i ) y ( i ) ) ; μ 0 , μ 1 , Σ ) p ( y ( i ) ; ϕ ) I(\phi,\mu_0,\mu_1,\Sigma)=log\prod_{i=1}^{m}p(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)=log\prod_{i=1}^{m}p(x^{(i)}|y^{(i)});\mu_0,\mu_1,\Sigma)p(y^{(i)};\phi)
最大似然函数估计:

l ( ϕ , μ 0 , μ 1 , Σ ) = l o g i = 1 m p ( x ( i ) , y ( i ) ) = l o g i = 1 m p ( x ( i ) y ( i ) ) p ( y ( i ) ) = i = 1 m l o g    p ( x ( i ) y ( i ) ) + i = 1 m l o g    p ( y ( i ) ) = i = 1 m l o g    (    p ( x ( i ) y ( i ) = 0 ) 1 y ( i ) p ( x ( i ) y ( i ) = 1 ) y ( i )    ) + i = 1 m l o g    p ( y ( i ) ) = i = 1 m ( 1 y ( i ) ) l o g    p ( x ( i ) y ( i ) = 0 ) + i = 1 m y ( i ) l o g    p ( x ( i ) y ( i ) = 1 ) + i = 1 m l o g    p ( y ( i ) ) \begin{aligned} l(\phi,\mu_0,\mu_1,\Sigma) &=log{\prod_{i=1}^m{p(x^{(i)},y^{(i)})}}=log{\prod_{i=1}^m{p(x^{(i)}|y^{(i)})p(y^{(i)})}} \\ &=\sum_{i=1}^m{log\;p(x^{(i)}|y^{(i)})}+\sum_{i=1}^m{log\;p(y^{(i)})} \\ &=\sum_{i=1}^m{log\;(\;p(x^{(i)}|y^{(i)}=0)^{1-y^{(i)}}*p(x^{(i)}|y^{(i)}=1)^{y^{(i)}}\;)}+\sum_{i=1}^m{log\;p(y^{(i)})} \\ &=\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}+\sum_{i=1}^m{{y^{(i)}}log\;p(x^{(i)}|y^{(i)}=1)}+\sum_{i=1}^m{log\;p(y^{(i)})} \end{aligned}
ϕ \phi 求导:
   l ( ϕ , μ 0 , μ 1 , Σ ) ϕ = i = 1 m l o g    p ( y ( i ) ) ϕ = i = 1 m l o g    ϕ y ( i ) ( 1 ϕ ) 1 y ( i ) ) ϕ = i = 1 m y ( i )    l o g    ϕ + ( 1 y ( i ) ) l o g ( 1 ϕ ) ϕ = i = 1 m ( y ( i ) 1 ϕ ( 1 y ( i ) ) 1 1 ϕ ) = i = 1 m ( I ( y ( i ) = 1 ) 1 ϕ I ( y ( i ) = 0 ) 1 1 ϕ ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma)}{\partial\phi}&=\frac{\sum_{i=1}^m{log\;p(y^{(i)})}}{\partial\phi} \\&= \frac{\partial\sum_{i=1}^m{log\;\phi^{y^{(i)}}(1-\phi)^{1-y^{(i)}})}}{\partial\phi} \\&=\frac{\partial\sum_{i=1}^m{y^{(i)}\;log\;\phi+(1-y^{(i)})log(1-\phi)}}{\partial\phi} \\&=\sum_{i=1}^m{(y^{(i)}\frac{1}{\phi}-(1-y^{(i)})\frac{1}{1-\phi})} \\&=\sum_{i=1}^m{(I(y^{(i)}=1)\frac{1}{\phi}-I(y^{(i)}=0)\frac{1}{1-\phi})} \end{aligned}
上式等于0,可得:
ϕ = I ( y ( i ) = 1 ) I ( y ( i ) = 0 ) + I ( y ( i ) = 1 ) = i = 1 m I ( y ( i ) = 1 ) m \begin{aligned} \phi=\frac{I(y^{(i)}=1)}{I(y^{(i)}=0)+I(y^{(i)}=1)}=\frac{\sum_{i=1}^mI(y^{(i)}=1)}{m}\end{aligned}
μ 0 \mu_0 求导:
   l ( ϕ , μ 0 , μ 1 , Σ ) μ 0 = i = 1 m ( 1 y ( i ) ) l o g    p ( x ( i ) y ( i ) = 0 ) μ 0 = i = 1 m ( 1 y ( i ) ) ( l o g 1 ( 2 π ) n Σ 1 2 ( x ( i ) μ 0 ) T Σ 1 ( x ( i ) μ 0 ) ) μ 0 = i = 1 m ( 1 y ( i ) ) ( Σ 1 ( x ( i ) μ 0 ) ) = i = 1 m I ( y ( i ) = 0 ) Σ 1 ( x ( i ) μ 0 ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma)}{\partial\mu_0}&=\frac{\partial\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}}{\partial\mu_0} \\&=\frac{\partial\sum_{i=1}^m{(1-y^{(i)})(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0))}}{\partial\mu_0} \\&=\sum_{i=1}^m{(1-y^{(i)})(\Sigma^{-1}(x^{(i)}-\mu_0))} \\&=\sum_{i=1}^m{I(y^{(i)}=0)\Sigma^{-1}(x^{(i)}-\mu_0)} \end{aligned}
上式等于0:
μ 0 = i = 1 m I ( y ( i ) = 0 ) x ( i ) i = 1 m I ( y ( i ) = 0 ) \begin{aligned} \mu_0=\frac{\sum_{i=1}^m{I(y^{(i)}=0)x^{(i)}}}{\sum_{i=1}^m{I(y^{(i)}=0)}} \end{aligned}
μ 1 = i = 1 m I ( y ( i ) = 1 ) x ( i ) i = 1 m I ( y ( i ) = 1 ) \begin{aligned} \mu_1=\frac{\sum_{i=1}^m{I(y^{(i)}=1)x^{(i)}}}{\sum_{i=1}^m{I(y^{(i)}=1)}} \end{aligned}
i = 1 m ( 1 y ( i ) ) l o g    p ( x ( i ) y ( i ) = 0 ) + i = 1 m y ( i ) l o g    p ( x ( i ) y ( i ) = 1 ) = i = 1 m ( 1 y ( i ) ) ( l o g 1 ( 2 π ) n Σ 1 2 ( x ( i ) μ 0 ) T Σ 1 ( x ( i ) μ 0 ) ) + i = 1 m y ( i ) ( l o g 1 ( 2 π ) n Σ 1 2 ( x ( i ) μ 1 ) T Σ 1 ( x ( i ) μ 1 ) ) = i = 1 m l o g 1 ( 2 π ) n Σ 1 2 i = 1 m ( x ( i ) μ y ( i ) ) T Σ 1 ( x ( i ) μ y ( i ) ) = i = 1 m ( n 2 l o g ( 2 π ) 1 2 l o g ( Σ ) ) 1 2 i = 1 m ( x ( i ) μ y ( i ) ) T Σ 1 ( x ( i ) μ y ( i ) ) \begin{aligned} &\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}+\sum_{i=1}^m{{y^{(i)}}log\;p(x^{(i)}|y^{(i)}=1)}\\&=\sum_{i=1}^m{(1-y^{(i)})(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0))}+\sum_{i=1}^m{{y^{(i)}}(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_1)^T\Sigma^{-1}(x^{(i)}-\mu_1))}\\&=\sum_{i=1}^m{log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}}-\frac{1}{2}\sum_{i=1}^m{(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}})}\\&=\sum_{i=1}^m{(-\frac{n}{2}log(2\pi)-\frac{1}{2}log(|\Sigma|))}-\frac{1}{2}\sum_{i=1}^m{(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}})} \end{aligned}
进而对 Σ \Sigma 求导:
   l ( ϕ , μ 0 , μ 1 , Σ ) ) Σ = 1 2 i = 1 m ( 1 Σ Σ Σ 1 ) 1 2 i = 1 m ( x ( i ) μ y ( i ) ) ( x ( i ) μ y ( i ) ) T Σ 1 Σ = m 2 Σ 1 1 2 i = 1 m ( x ( i ) μ y ( i ) ) ( x ( i ) μ y ( i ) ) T ( Σ 2 ) ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma))}{\partial\Sigma}&=-\frac{1}{2}\sum_{i=1}^m(\frac{1}{|\Sigma|}|\Sigma|\Sigma^{-1})-\frac{1}{2}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\frac{\partial\Sigma^{-1}}{\partial\Sigma}\\&=-\frac{m}{2}\Sigma^{-1}-\frac{1}{2}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T(-\Sigma^{-2})) \end{aligned}
Σ Σ = Σ Σ 1 \begin{aligned} \frac{\partial|\Sigma|}{\partial\Sigma}=|\Sigma|\Sigma^{-1}\end{aligned}
Σ 1 Σ = Σ 2 \begin{aligned} \frac{\partial\Sigma^{-1}}{\partial\Sigma}=-\Sigma^{-2}\end{aligned}
上式为0求得:
Σ = 1 m i = 1 m ( x ( i ) μ y ( i ) ) ( x ( i ) μ y ( i ) ) T \begin{aligned} \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\end{aligned}


1.3 高斯判别模型和Logistic回归

p ( y = 1 x ; μ 0 , μ 1 , Σ ) p(y=1|x;\mu_0,\mu_1,\Sigma) 可以被写成:
p ( y = 1 x ; ϕ , Σ , μ 0 , μ 1 ) = 1 1 + e θ T x p(y=1|x;\phi,\Sigma,\mu_0,\mu_1)=\frac{1}{1+e^{-\theta^Tx}}
θ = [ l o g 1 ϕ ϕ 1 2 ( μ 0 T Σ 1 μ 0 μ 1 T Σ 1 μ 1 ) Σ 1 ( μ 0 μ 1 ) ] \theta=\begin{bmatrix}log\frac{1-\phi}{\phi}-\frac{1}{2}(\mu_0^T\Sigma^{-1}\mu_0-\mu_1^T\Sigma^{-1}\mu_1)\\\Sigma^{-1}(\mu_0-\mu_1)\end{bmatrix}


GDA:

  • 最大化联合似然函数: i = 1 m p ( x ( i ) , y ( i ) ) \prod_{i=1}^mp(x^{(i)},y^{(i)})
  • 模型假设: x y = b N ( μ b , Σ ) , y B e r n o u l l i ( ϕ ) x|y=b\sim N(\mu_b,\Sigma),y\sim Bernoulli(\phi)
  • 当假设正确时,GDA渐进有效,数据有效。

Logistic:

扫描二维码关注公众号,回复: 4124275 查看本文章
  • 最大化条件概率 i = 1 m p ( y ( i ) x ( i ) ) \prod_{i=1}^mp(y^{(i)}|x^{(i)})
  • 模型假设: p ( y x ) p(y|x) 是Logistic函数。
  • 对不太正确的建模假设更加稳健和不太敏感。

1.4 朴素贝叶斯

是一种针对离散输入的一个简单的生成学习算法。
实战:垃圾邮件分类

项目源代码:
给出每封信x,判断是否属于垃圾邮件(y=0或者y=1)
每封信的词由一个字典维度大小的向量代表, x i = 1 x_i=1 表示第i个词在这封信中,反之则不在。

朴素贝叶斯模型

朴素贝叶斯假设:
p ( x 1 , x 2 , . . . x n y ) = i = 1 n p ( x i y ) p(x_1,x_2,...x_n|y)=\prod_{i=1}^np(x_i|y)
参数学习
多变量伯努利事件模型:
p ( x , y ) = p ( y ) i = 1 n p ( x i y ) p(x,y)=p(y)\prod_{i=1}^np(x_i|y)

  • 假设垃圾邮件和非垃圾邮件是随机的, p ( y ) = ϕ y p(y)=\phi_y
  • 给出y,每封信中的单词可以表示成 p ( x i = 1 y ) = ϕ i y p(x_i=1|y)=\phi_{i|y}

最大似然:
L = i = 1 n p ( x ( i ) , y ( i ) ) L=\prod_{i=1}^np(x^{(i)},y^{(i)})
由此可以求出最大似然估计的各个参数值。
补充进行预测
给一个新的样本,计算 p ( y = 1 x ) p(y=1|x) :
p ( y = 1 x ) = p ( x y = 1 ) p ( y = 1 ) p ( x ) = p ( x y = 1 ) p ( y = 1 ) p ( x y = 1 ) p ( y = 1 ) + p ( x y = 0 ) p ( y = 0 ) p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x)}=\frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)}
. . . = i = 1 n p ( x i y = 1 ) p ( y = 1 ) i = 1 n p ( x i y = 1 ) p ( y = 1 ) + i = 1 n p ( x i y = 0 ) p ( y = 0 ) ...=\frac{\prod_{i=1}^np(x_i|y=1)p(y=1)}{\prod_{i=1}^np(x_i|y=1)p(y=1)+\prod_{i=1}^np(x_i|y=0)p(y=0)}
如果 p ( y = 1 x ) > 0.5 , y = 1 p(y=1|x)>0.5,y=1 .


拉普拉斯平滑

  • 如果一个新的单词没有出现在训练集中,所以导致 ϕ i y = 1 = ϕ i y = 0 \phi_{i|y=1}=\phi_{i|y=0} ,导致不能计算 p ( y = 1 x ) = 0 0 p(y=1|x)=\frac{0}{0}
    定义符号:
    n 1 n_1 :在所有垃圾邮件中单词 x 1 x_1 出现的次数。如果 x 1 x_1 没有出现过,则 n 1 = 0 n_1 = 0
    n:属于 c 1 c_1 类的所有文档的出现过的单词总数目。
    得到公式 p ( x 1 c 1 ) = n 1 / n p(x_1|c_1)= n_1 /n
    而拉普拉斯平滑就是将上式修改为:
    p ( x 1 c 1 ) = ( n 1 + 1 ) / ( n + N ) p(x_1|c_1)= (n_1 + 1) / (n + N)
    p ( x 2 c 1 ) = ( n 2 + 1 ) / ( n + N ) p(x_2|c_1)= (n_2 + 1) / (n + N)
    其中,N是所有单词的数目。修正分母是为了保证概率和为1。

2 练习二次判别分析

源代码:https://github.com/Miraclemin/Quadratic-Discriminant-Analysis
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/m0_37846020/article/details/83932054