应用机器学习（三）：朴素贝叶斯分类器

朴素贝叶斯( Naive Bayes )分类器，是指将贝叶斯( Bayes )原理应用于具有较强独立性假设的特征(变量)，而得到的一族简单的概率分类器。

Naive Bayes 分类器假设：

给定类变量，一个特定的变量独立于其它任何变量。称该独立性为条件独立，例如，给定 $Z$ ， $X$ 条件独立于 $Y$ ，当且仅当

P (X = x, Y = y | Z = z) = P (X = x | Z = z) P (Y = y | Z = z), \forall x, y, z

$P(X = x, Y = y | Z = z) = P(X = x | Z = z) P(Y = y | Z = z), \forall\, x, y, z$
简记为

X || −−Y|Z $X \underline{\ || \ } \, Y \,| \,Z$ . 举个例子，
已知一个人的年龄，那么他的身高和头发灰度是条件独立的。

概率模型

假设有 K 个类，不妨记为 $\mathcal{C}_1,\mathcal{C}_2,\dots,\mathcal{C}_K$ 。一个待分类的对象，用 n 维向量 $\mathbf{x}=(x_1, x_2,\dots,x_n)$ 表示，其中的 n 个分量代表 n 个独立的特征（变量）。Naive Bayes 本质上是一个条件概率模型，它计算 $\mathbf{x}$ 属于每一个类的概率 $\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x}), \, k=1,2,\dots,K$ 。使用 Bayes 公式，分解条件概率为

P (C k | x) = P ( x | C k ) P ( C k ) P ( x )

$\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x}) = \dfrac{\mathcal{P}(\mathbf{x}\,|\,\mathcal{C}_k)\mathcal{P}(\mathcal{C}_k)} {\mathcal{P}(\mathbf{x})}$
在 Bayes 统计中，称

P(Ck|x) $\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x})$ 后验概率，

P(Ck) $\mathcal{P}(\mathcal{C}_k)$ 类

Ck $\mathcal{C}_k$ 的先验概率，

P(x|Ck) $\mathcal{P}(\mathbf{x}\,|\,\mathcal{C}_k)$ 似然函数，

P(x) $\mathcal{P}(\mathbf{x})$ 样本的联合概率。给定

x $\mathbf{x}$ ， Bayes 公式里的分母是常数，分子等于

x,Ck $\mathbf{x},\,\mathcal{C}_k$ 的联合概率, 则利用条件概率的定义，有

P (x, C k) = P (x 1, x 2, \dots, x n, C k) = P (x 1 | x 2, \dots, x n, C k) P (x 2, \dots, x n, C k) = P (x 1 | x 2, \dots, x n, C k) P (x 2 | x 3, \dots, x n, C k) P (x 3, \dots, x n, C k) = . . . = P (x 1 | x 2, \dots, x n, C k) P (x 2 | x 3, \dots, x n, C k) . . . P (x n - 1 | x n, C k) P (x n | C k) P (C k)

$\begin{align*} \mathcal{P}(\mathbf{x},\,\mathcal{C}_k) &= \mathcal{P}(x_1, x_2, \dots, x_n,\,\mathcal{C}_k) \\ &= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k)\mathcal{P}(x_2, \dots, x_n,\,\mathcal{C}_k) \\ &= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k) \mathcal{P}(x_2\, | \, x_3, \dots, x_n,\,\mathcal{C}_k) \mathcal{P}(x_3, \dots, x_n,\,\mathcal{C}_k) \\ &= ... \\ &= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k) \mathcal{P}(x_2\, | \, x_3, \dots, x_n,\,\mathcal{C}_k)... \mathcal{P}(x_{n-1}\,|\,x_n, \mathcal{C}_k) \mathcal{P}(x_n\,|\,\mathcal{C}_k)\mathcal{P}(\mathcal{C}_k) \end{align*}$

“naive” 条件独立假设：

给定类 $\mathcal{C}$ ，对任意的特征 $F_i$ ，条件独立于其它任何特征 $F_j, \, \forall j \ne i$ 。那么，

P (x i | x i + 1, \dots, x n, C k) = P (x i | C k)

$\mathcal{P}(x_i \, |\, x_{i+1},\dots, x_{n}, \mathcal{C}_k)=\mathcal{P}(x_i\,|\,\mathcal{C}_k)$
这样，后验概率可以表示为

P (C k | x 1, x 2, \dots, x n) \propto P (x 1, x 2, \dots, x n, C k) \propto P (C k) P (x 1 | C k) P (x 2 | C k) \dots P (x n | C k) \propto P (C k) \prod i = 1 n P (x i | C k)

$\begin{align*} \mathcal{P}(\mathcal{C}_k\,|\,x_1, x_2, \dots,x_n) &\propto \mathcal{P}(x_1, x_2, \dots,x_n,\,\mathcal{C}_k) \\ &\propto \mathcal{P}(\mathcal{C}_k) \mathcal{P}(x_1\,|\,\mathcal{C}_k) \mathcal{P}(x_2\,|\,\mathcal{C}_k)\dots\mathcal{P}(x_n\,|\,\mathcal{C}_k) \\ &\propto \mathcal{P}(\mathcal{C}_k)\prod_{i=1}^n\mathcal{P}(x_i\,|\,\mathcal{C}_k) \end{align*}$
故

P (C k | x 1, x 2, \dots, x n) = 1 Z P (C k) \prod i = 1 n P (x i | C k)

$\mathcal{P}(\mathcal{C}_k\,|\,x_1, x_2, \dots,x_n)= \dfrac{1}{Z} \mathcal{P}(\mathcal{C}_k)\prod_{i=1}^n\mathcal{P}(x_i\,|\,\mathcal{C}_k)$
这里，

Z=P(x) $Z=\mathcal{P}(\mathbf{x})$ ，仅依赖于

x1,x2,…,xn $x_1, x_2, \dots, x_n$ ，也就是说，如果特征变量的值已知，则它是一个常数。

Bayes 分类器

根据最大后验概率( maximum a posteriori or MAP )的决策规则，将待分类的向量 $\mathbf{x}$ 分到后验概率最大的类里，对应的分类器，称为 Bayes 分类器。此时， $\mathbf{x}$ 的类标签

y^=argmaxk∈{1,2,…,K}P(Ck)∏i=1nP(xi|Ck)

$\hat{y}=\mathop{\arg\max}_{k\in \{1, 2, \dots, K\}} \mathcal{P} (\mathcal{C}_k)\prod_{i=1}^n\mathcal{P} (x_i\,|\,\mathcal{C}_k)$

参数估计和事件模型

通常，在没有关于类的先验知识的情况下，类的先验概率可取均匀分布，即， $\mathcal{C}_k = 1/K,\,k=1, 2, \dots, K$ 。也可以在训练集上，用频率来估计，即，一个给定类的先验概率 $=$ 该类的样本数 $/$ 总样本数。为了估计特征分布 $\mathcal{P} (x\,|\,\mathcal{C})$ 的参数，有必要假设特征分布的类型，或者，从训练集产生特征分布的非参数模型。称特征分布的假设为 Naive Bayes 分类器的事件模型。对于连续型特征，通常假设为正态分布；对于离散型特征，常假设为多项分布或Bernoulli分布。

一个性别分类的例子

问题描述：根据一个人的身高(height )、体重(weight )和脚的尺寸(footsize )这三个特征，预测该人的性别(sex )。

训练

实例训练集见下表：

假设特征都服从正态分布，利用两个性别类的样本分别估计这三个特征的均值和方差，确定正态分布。

假设性别类的先验概率 $\mathcal{P}(male) = \mathcal{P}(female)=0.5$ 。先验概率也可以根据人口的历史经验知识，或者训练集的性别频率代替。

检验

给定一个待分类的样本：

分别计算该样本属于两类的后验概率：

P (m a l e | s a m p l e) = P ( m a l e ) P ( h e i g h t | m a l e ) P ( w e i g h t | m a l e ) P ( f o o t s i z e | m a l e ) P ( s a m p l e )

$\mathcal{P}(male\,|\,sample)= \\ \dfrac{\mathcal{P}(male) \mathcal{P}(height\,|\,male)\mathcal{P}(weight\,|\,male)\mathcal{P}(foot\,size\,|\,male)} {\mathcal{P}(sample)}$

其中，分母

P (s a m p l e) = P (m a l e) P (h e i g h t | m a l e) P (w e i g h t | m a l e) P (f o o t s i z e | m a l e) + P (f e m a l e) P (h e i g h t | f e m a l e) P (w e i g h t | f e m a l e) P (f o o t s i z e | f e m a l e)

$\begin{align*} \mathcal{P}(sample) &= \mathcal{P}(male) \mathcal{P}(height\,|\,male)\mathcal{P}(weight\,|\,male)\mathcal{P}(foot\,size\,|\,male) \\ &+ \mathcal{P}(female) \mathcal{P}(height\,|\,female)\mathcal{P}(weight\,|\,female)\mathcal{P}(foot\,size\,|\,female) \end{align*}$

注意到，给定样本后，分母是常数，因此不影响分类，可以忽略。计算分子中的各项概率：

P (m a l e) = 0.5

$\begin{align*} \mathcal{P}(male)=0.5 \end{align*}$

P (h e i g h t | m a l e) = 1 2 π - - \sqrt σ ^e - ( 6 - μ ^ ) 2 2 σ ^ 2 \approx 1.5789

$\begin{align*} \mathcal{P}(height\,|\,male)=\dfrac{1}{\sqrt{2\pi}\hat{\sigma}}e^{\dfrac{-(6-\hat{\mu})^2}{2\hat{\sigma}^2}}\approx 1.5789 \end{align*}$

P (w e i g h t | m a l e) \approx 5.9881 \times 10 - 6

$\begin{align*} \mathcal{P}(weight\,|\,male)\approx 5.9881 \times 10^{-6} \end{align*}$

P (f o o t s i z e | m a l e) \approx 1.3112 \times 10 - 3

$\begin{align*} \mathcal{P}(foot\,size\,|\,male)\approx 1.3112 \times 10^{-3} \end{align*}$
所以

P (m a l e | s a m p l e) \propto 6.1984 \times 10 - 9

$\begin{align*} \mathcal{P}(male\,|\,sample)\propto 6.1984 \times 10^{-9} \end{align*}$

同理，可以计算得到

P (f e m a l e | s a m p l e) \propto 5.3778 \times 10 - 4

$\begin{align*} \mathcal{P}(female\,|\,sample)\propto 5.3778 \times 10^{-4} \end{align*}$

显然，女性类的后验概率大于男性类的，因此，预测该样本为女性。

数据试验

我们在机器学习的基准数据集 HouseVotes84 上训练 Naive Bayes 分类器，并在该数据集上检验分类效果。HouseVotes84 数据集由美国众议院435名议员在1984年对16项议案的投票结果组成。每名众议员分别对16项议案投赞成(简记为y )、反对(简记为n )或中立。该数据集位于 R 包 mlbench 里，由435行观测、17个变量(列)的数据框组成。这17个变量分别为：

Class Name: 2 (democrat, republican)
handicapped-infants: 2 (y,n)
water-project-cost-sharing: 2 (y,n)
adoption-of-the-budget-resolution: 2 (y,n)
physician-fee-freeze: 2 (y,n)
el-salvador-aid: 2 (y,n)
religious-groups-in-schools: 2 (y,n)
anti-satellite-test-ban: 2 (y,n)
aid-to-nicaraguan-contras: 2 (y,n)
mx-missile: 2 (y,n)
immigration: 2 (y,n)
synfuels-corporation-cutback: 2 (y,n)
education-spending: 2 (y,n)
superfund-right-to-sue: 2 (y,n)
crime: 2 (y,n)
duty-free-exports: 2 (y,n)
export-administration-act-south-africa: 2 (y,n)

在R 环境加载数据集 HouseVotes84，并显示前6行

library(mlbench)
data(HouseVotes84)
head(HouseVotes84)

数据集中的 NA，代表“中立”的投票，在训练分类器时被忽略。

现在，以HouseVotes84 为训练集，使用 e1071 包的函数naiveBayes 建立 Naive Bayes 分类器，预测该数据集前10行
的分类结果，并与真实类作比较。

library(e1071)
data(HouseVotes84, package = "mlbench")
model <- naiveBayes(Class ~ ., data = HouseVotes84)
pred <- predict(model, HouseVotes84[1:10,])
table(pred, HouseVotes84$Class[1:10])