Java学习(Day 29)

学习来源:日撸 Java 三百行(51-60天,kNN 与 NB)_闵帆的博客-CSDN博客

符号型数据的 NB 算法

符号型数据是指数据集中的数据是由字符或者字符串构成. NB (Native Bayes) 算法通常被翻译成朴素贝叶斯算法, 基于贝叶斯算法, 常用于分类问题. 同时这也是贝叶斯算法中最简单、最常见的一种.

一、 符号型数据集

https://gitee.com/fansmale/javasampledata 中可以获得 weather.arff 文件.

@relation weather
@attribute Outlook {Sunny, Overcast, Rain}
@attribute Temperature {Hot, Mild, Cool}
@attribute Humidity {High, Normal, Low}
@attribute Windy {FALSE, TRUE}
@attribute Play {N, P}
@data
Sunny,Hot,High,FALSE,N
Sunny,Hot,High,TRUE,N
Overcast,Hot,High,FALSE,P
Rain,Mild,High,FALSE,P
Rain,Cool,Normal,FALSE,P
Rain,Cool,Normal,TRUE,N
Overcast,Cool,Normal,TRUE,P
Sunny,Mild,High,FALSE,N
Sunny,Cool,Normal,FALSE,P
Rain,Mild,Normal,FALSE,P
Sunny,Mild,Normal,TRUE,P
Overcast,Mild,High,TRUE,P
Overcast,Hot,Normal,FALSE,P
Rain,Mild,High,TRUE,N

文件中对 weather 有 Outlook、Temperature、Humidity、Windy 和 Play 五个属性. 在五个属性中存在着具体的描述, 我们的任务就是要通过除 Play 之外的所有属性来预测 Play 的值. 也就是说这还是一个分类问题, 但是获得的数据就不再是之前的数值, 而是一段描述字符串.

二、 理论推导

1. 条件概率

P ( A B ) = P ( A ) P ( B ∣ A ) (1) P(AB) = P(A)P(B|A) \tag{1} P(AB)=P(A)P(BA)(1)

  • P ( A ) P(A) P(A) 表示事件 A A A 发生的概率.
  • P ( A B ) P(AB) P(AB) 表示事件 A A A 和 事件 B B B 同时发生的概率.
  • P ( B ∣ A ) P(B|A) P(BA) 表示在事件 A A A 发生的情况下, 事件 B B B也发生的概率.

例: A A A 表示天气是晴天, 即 Outlook = Sunny; B B B 表示湿度高, 即 Humidity = High.

14 天中有 5 天为 Sunny , 则 P ( A ) = P ( O u t l o o k = S u n n y ) = 5 / 14 P(A) = P(Outlook = Sunny) = 5/14 P(A)=P(Outlook=Sunny)=5/14

在 5 天为 Sunny 中又有 3 天湿度高, 则有 P ( B ∣ A ) = P ( H u m i d i t y = H i g h ∣ O u t l o o k = S u n n y ) = 3 / 5 P(B|A) = P(Humidity = High|Outlook = Sunny) = 3/5 P(BA)=P(Humidity=HighOutlook=Sunny)=3/5

最后我们就能得到即是晴天又湿度高的概率 P ( A B ) = P ( O u t l o o k = S u n n y ∧ H u m i d i t y = H i g h ) = P ( A ) P ( B ∣ A ) = 3 / 14 P(AB) = P(Outlook = Sunny \wedge Humidity = High) = P(A)P(B|A) = 3/14 P(AB)=P(Outlook=SunnyHumidity=High)=P(A)P(BA)=3/14

2. 独立性假设

令 $\mathrm{x} = x_1 \wedge x_2 \wedge … \wedge x_m $ 表示一个条件的组合, 如 Play = Sunny ∧ \wedge Hot ∧ \wedge High ∧ \wedge FALSE = N.

D D D 表示一个事件, 如:Play = N, 根据 (1) 式有:

P ( D ∣ x ) = P ( x D ) P ( x ) = P ( D ) P ( x ∣ D ) P ( x ) (2) P(D|\mathrm{x}) = \frac{P(\mathrm{x}D)}{P(\mathrm{x})} = \frac{P(D)P(\mathrm{x}|D)}{P(\mathrm{x})} \tag{2} P(Dx)=P(x)P(xD)=P(x)P(D)P(xD)(2)

接下来就是一个非常精彩的操作, 我们假设 x 1 , x 2 , . . . , x m x_1,x_2,...,x_m x1,x2,...,xm 它们之间相互独立. 这在现实世界中就显得有些荒谬. 比如阴天容易刮风, 晴天湿度低. 这里姑且算其成立.

P ( x ∣ D ) = P ( x 1 ∣ D ) P ( x 2 ∣ D ) . . . P ( x m ∣ D ) = ∏ i = 1 m P ( x i ∣ D ) (3) P(\mathrm{x}|D) = P(x_1|D)P(x_2|D)...P(x_m|D) = \prod_{i=1}^{m}P(x_i|D) \tag{3} P(xD)=P(x1D)P(x2D)...P(xmD)=i=1mP(xiD)(3)

综合 (2) (3) 式可得:

P ( D ∣ x ) = P ( x D ) P ( x ) = P ( D ) ∏ i = 1 m P ( x i ∣ D ) P ( x ) (4) P(D|\mathrm{x}) = \frac{P(\mathrm{x}D)}{P(\mathrm{x})} = \frac{P(D) \prod_{i=1}^{m}P(x_i|D) }{P(\mathrm{x})} \tag{4} P(Dx)=P(x)P(xD)=P(x)P(D)i=1mP(xiD)(4)

因为计算不了分母 P ( x ) P(\mathrm{x}) P(x) 所以我们就只需要比较分子大小, 谁大归哪个类.

3. Laplacian 平滑

由于数据集中不能保证每个数据都出现, 若是按照以上算法就会导致出现 0 的情况. 但是在现实世界中真实存在着导致结果为 0 的数据, 我们自然不希望这样的事情发生. 所以 Laplacian 平滑由此诞生.

P L ( x i ∣ D ) = n P ( x i D ) + 1 n P ( D ) + v i (5) P^{L}(x_i|D) = \frac{nP(x_iD) + 1}{nP(D)+v_i} \tag{5} PL(xiD)=nP(D)+vinP(xiD)+1(5)

其中 n 表示数据集中所有数据的个数, v i v_i vi 表示第 i i i 个属性的可能取值数. 如 i 表示 Humidity 属性时, v i = 3 v_i = 3 vi=3.

三、 代码流程

1. 具体代码

package bayes;

import java.io.FileReader;
import java.util.Arrays;

import weka.core.*;

/**
 * The Naive Bayes algorithm.
 *
 * @author Shi-Huai Wen Email: [email protected].
 */


public class NaiveBayes {
    
    

    /**
     * The data.
     */
    Instances dataset;

    /**
     * The number of classes. For binary classification it is 2.
     */
    int numClasses;

    /**
     * The number of instances.
     */
    int numInstances;

    /**
     * The number of conditional attributes.
     */
    int numConditions;

    /**
     * The prediction, including queried and predicted labels.
     */
    int[] predicts;

    /**
     * Class distribution.
     */
    double[] classDistribution;

    /**
     * Class distribution with Laplacian smooth.
     */
    double[] classDistributionLaplacian;

    /**
     * To calculate the conditional probabilities for all classes over all
     * attributes on all values.
     */
    double[][][] conditionalCounts;

    /**
     * The conditional probabilities with Laplacian smooth.
     */
    double[][][] conditionalProbabilitiesLaplacian;

    /**
     * Data type.
     */
    int dataType;

    /**
     * Nominal.
     */
    public static final int NOMINAL = 0;

    /**
     * *******************
     * The constructor.
     *
     * @param paraFilename The given file.
     * *******************
     */
    public NaiveBayes(String paraFilename) {
    
    
        dataset = null;
        try {
    
    
            FileReader fileReader = new FileReader(paraFilename);
            dataset = new Instances(fileReader);
            fileReader.close();
        } catch (Exception ee) {
    
    
            System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
            System.exit(0);
        } // Of try

        dataset.setClassIndex(dataset.numAttributes() - 1);
        numConditions = dataset.numAttributes() - 1;
        numInstances = dataset.numInstances();
        numClasses = dataset.attribute(numConditions).numValues();
    }// Of the constructor

    /**
     * *******************
     * Set the data type.
     * *******************
     */
    public void setDataType(int paraDataType) {
    
    
        dataType = paraDataType;
    }// Of setDataType

    /**
     * *******************
     * Calculate the class distribution with Laplacian smooth.
     * *******************
     */
    public void calculateClassDistribution() {
    
    
        classDistribution = new double[numClasses];
        classDistributionLaplacian = new double[numClasses];

        double[] tempCounts = new double[numClasses];
        for (int i = 0; i < numInstances; i++) {
    
    
            int tempClassValue = (int) dataset.instance(i).classValue();
            tempCounts[tempClassValue]++;
        } // Of for i

        for (int i = 0; i < numClasses; i++) {
    
    
            classDistribution[i] = tempCounts[i] / numInstances;
            classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
        } // Of for i

        System.out.println("Class distribution: " + Arrays.toString(classDistribution));
        System.out.println(
                "Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
    }// Of calculateClassDistribution

    /**
     * *******************
     * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
     * the dataset once. There was a simpler one, I have removed it because the
     * time complexity is higher.
     * *******************
     */
    public void calculateConditionalProbabilities() {
    
    
        conditionalCounts = new double[numClasses][numConditions][];
        conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];

        // Allocate space
        for (int i = 0; i < numClasses; i++) {
    
    
            for (int j = 0; j < numConditions; j++) {
    
    
                int tempNumValues = dataset.attribute(j).numValues();
                conditionalCounts[i][j] = new double[tempNumValues];
                conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
            } // Of for j
        } // Of for i

        // Count the numbers
        int[] tempClassCounts = new int[numClasses];
        for (int i = 0; i < numInstances; i++) {
    
    
            int tempClass = (int) dataset.instance(i).classValue();
            tempClassCounts[tempClass]++;
            for (int j = 0; j < numConditions; j++) {
    
    
                int tempValue = (int) dataset.instance(i).value(j);
                conditionalCounts[tempClass][j][tempValue]++;
            } // Of for j
        } // Of for i

        // Now for the real probability with Laplacian
        for (int i = 0; i < numClasses; i++) {
    
    
            for (int j = 0; j < numConditions; j++) {
    
    
                int tempNumValues = dataset.attribute(j).numValues();
                for (int k = 0; k < tempNumValues; k++) {
    
    
                    conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
                            / (tempClassCounts[i] + tempNumValues);
                } // Of for k
            } // Of for j
        } // Of for i

        System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
    }// Of calculateConditionalProbabilities

    /**
     * *******************
     * Classify all instances, the results are stored in predicts[].
     * *******************
     */
    public void classify() {
    
    
        predicts = new int[numInstances];
        for (int i = 0; i < numInstances; i++) {
    
    
            predicts[i] = classify(dataset.instance(i));
        } // Of for i
    }// Of classify

    /**
     * *******************
     * Classify an instances.
     * *******************
     */
    public int classify(Instance paraInstance) {
    
    
        if (dataType == NOMINAL) {
    
    
            return classifyNominal(paraInstance);
        }
        return -1;
    }// Of classify

    /**
     * *******************
     * Classify an instances with nominal data.
     * *******************
     */
    public int classifyNominal(Instance paraInstance) {
    
    
        // Find the biggest one
        double tempBiggest = -10000;
        int resultBestIndex = 0;
        for (int i = 0; i < numClasses; i++) {
    
    
            double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
            double tempPseudoProbability = tempClassProbabilityLaplacian;
            for (int j = 0; j < numConditions; j++) {
    
    
                int tempAttributeValue = (int) paraInstance.value(j);

                // Laplacian smooth.
                tempPseudoProbability += Math.log(conditionalCounts[i][j][tempAttributeValue])
                        - tempClassProbabilityLaplacian;
            } // Of for j

            if (tempBiggest < tempPseudoProbability) {
    
    
                tempBiggest = tempPseudoProbability;
                resultBestIndex = i;
            } // Of if
        } // Of for i

        return resultBestIndex;
    }// Of classifyNominal

    /**
     * *******************
     * Compute accuracy.
     * *******************
     */
    public double computeAccuracy() {
    
    
        double tempCorrect = 0;
        for (int i = 0; i < numInstances; i++) {
    
    
            if (predicts[i] == (int) dataset.instance(i).classValue()) {
    
    
                tempCorrect++;
            } // Of if
        } // Of for i

        return tempCorrect / numInstances;
    }// Of computeAccuracy

    /**
     * ************************
     * Test nominal data.
     * ************************
     */
    public static void testNominal() {
    
    
        System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
        String tempFilename = "D:/Work/sampledata/mushroom.arff";

        NaiveBayes tempLearner = new NaiveBayes(tempFilename);
        tempLearner.setDataType(NOMINAL);
        tempLearner.calculateClassDistribution();
        tempLearner.calculateConditionalProbabilities();
        tempLearner.classify();

        System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    }// Of testNominal


    /**
     * ************************
     * Test this class.
     *
     * @param args Not used now.
     * ************************
     */
    public static void main(String[] args) {
    
    
        testNominal();
    }// Of main
} // Of class NaiveBayes

2. 运行截图

总结

优点:

(1) 算法逻辑简单, 易于实现

(2) 分类过程中时空开销小

缺点:

理论上, 朴素贝叶斯模型与其他分类方法相比具有最小的误差率. 但是实际上并非总是如此, 这是因为朴素贝叶斯模型假设属性之间相互独立, 这个假设在实际应用中往往是不成立的, 在属性个数比较多或者属性之间相关性较大时, 分类效果不好.

猜你喜欢

转载自blog.csdn.net/qq_44309220/article/details/124597956