案例：垃圾邮件二分类

参考博客：http://blog.csdn.net/u013508213/article/details/52326420

【邮件预处理】

%regexprep对字符串查找并替换

A、输入：email_contents

将整封邮件单词转换为小写 lower(email_contents)
去除所有的html格式<> regexprep(email_contents,'<[^<>]+>', ' ')
将数字替换为'number' regexprep(email_contents, '[0-9]+','number')
将URL替换为'httpaddr' regexprep(email_contents, '(http|https)://[^\s]*','httpaddr')
将邮件地址替换为'emailaddr' regexprep(email_contents,'[^\s]+@[^\s]+', 'emailaddr')
将表示money的符号替换为'dollar' regexprep(email_contents, '[$]+','dollar')
将单词时态进行还原。e.g ”discount,discounts, discounted” -> “discount”；”include,including, includes” -> “includ”

B、把每个单词map成索引号组成的向量

处理完后，将邮件映射到一个词表vocabList中，这个词表数据集由垃圾邮件中常出现的高频率词汇组成，得到word_indices：把每个单词map成索引号。

for i =1:length(vocabList)
        if strcmp(str, vocabList{i})
            word_indices = [word_indices i]; 
        end
end

C、特征提取

对输入邮件，利用word_indices判断其单词是否在词表中出现，输出一个特征提取向量x = [ 00 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..]

function x = emailFeatures(word_indices)
    n = 1899;
    x = zeros(n, 1);
    x(word_indices) = 1;
end

【法一：朴素贝叶斯模型】

Train

Classify

算法及代码参见http://blog.csdn.net/a786150017/article/details/78618365

【法二：支持向量机】

使用matlab自带库函数来训练SVM。

Train

1）使用线性核来训练数据，出现的情况是在训练集上没有误差，可是在测试集上误差非常大。

分别针对训练集和测试集：

model = svmtrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);

2）使用rbf核来训练数据，默认σ值为1，结果欠拟合。调试参数之后，得到了比较好的效果。

model = svmtrain(X, y, 'kernel_function', 'rbf', 'rbf_sigma', 70);

【总结】

SVM方法在测试中的准确率为92%，在调试的时候需要尝试变换核与核参数。

NaiveBayes方法在测试中的准确率到达了98%，并且训练的复杂度低于SVM。

因此就目前的认知来看，NaiveBayes分类器原理更简单，实现与使用的效率比SVM更高。