算法流程
- 将数据二值化
- 计算每类数字的先验概率
- 计算条件概率
- 计算后验概率
(具体计算过程请见书上77页)
算法实现
贝叶斯算法
def bayeserzhi(x_train,y_train,sample):
"""
:function 基于二值数据的贝叶斯分类器
:param x_train: 训练集 M*N M为样本个数 N为特征个数
:param y_train: 训练集标签 1*M
:param sample: 待识别样品
:return: 返回判断类别
"""
#后验概率
pwx = []
target = np.unique(y_train)
spit = 0.5 * (np.max(x_train) - np.min(x_train))
train = np.where(x_train > spit, 1, 0)
sample = np.where(sample > spit, 1, 0)
for i in target:
trainIndex = (([j for j, y in enumerate(y_train) if y == i]))
trainNum = len(trainIndex)
# 计算先验概率
pw = trainNum/x_train.shape[0]
# 计算类条件概率
p = (np.sum(train[trainIndex],axis=0)+1)/(trainNum+2)
pxw = 1
for j in range(train.shape[1]):
if sample[j]:
pxw *= p[j]
else:
pxw *= (1-p[j])
#计算pxw*pw
pwx.append(pxw*pw)
pwx = pwx/np.sum(pwx)
maxId = np.argmax(pwx)
label = target[maxId]
return label
划分数据集
def train_test_split(x,y,ratio = 3):
"""
:function: 对数据集划分为训练集、测试集
:param x: m*n维 m表示数据个数 n表示特征个数
:param y: 标签
:param ratio: 产生比例 train:test = 3:1(默认比例)
:return: x_train y_train x_test y_test
"""
n_samples , n_train = x.shape[0] , int(x.shape[0]*(ratio)/(1+ratio))
train_id = random.sample(range(0,n_samples),n_train)
x_train = x[train_id,:]
y_train = y[train_id]
x_test = np.delete(x,train_id,axis = 0)
y_test = np.delete(y,train_id,axis = 0)
return x_train,y_train,x_test,y_test
测试代码
from sklearn import datasets
from Include.chapter4 import function
import numpy as np
#读取数据
digits = datasets.load_digits()
x , y = digits.data,digits.target
#划分数据集
x_train, y_train, x_test, y_test = function.train_test_split(x,y)
testId = np.random.randint(0, x_test.shape[0])
sample = x_test[testId, :]
#模板匹配
ans = function.bayeserzhi(x_train,y_train,sample)
y_test[testId]
print("预测的数字类型",ans)
print("真实的数字类型",y_test[testId])
算法结果
预测的数字类型 0
真实的数字类型 0