1. 说明

spark对常见的机器学习算法封装的很好，这里用文本分类的例子简单说明一下MLlib的使用方法。
原始数据一般存放在Vector类中，带标签/评分的数据存放在LabeledPoint/Rating中，它们都是RDD，通过MLlib的model进行训练和预测。

2. 示例

读入文件：

spam = sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")

构建词频向量：

from pyspark.mllib.feature import HashingTF
tf = HashingTF(numFeatures = 100)
spamFeatures = spam.map(lambda x:tf.transform(x.split(" ")))
normalFeatures = normal.map(lambda x:tf.transform(x.split(" ")))

我们可以看看词频向量是什么：

>>> spamFeatures.collect()
[SparseVector(100, {36: 1.0, 40: 1.0, 49: 1.0}), SparseVector(100, {36: 1.0, 42: 1.0, 63: 1.0}), SparseVector(100, {26: 1.0, 46: 1.0, 65: 1.0})]

大致就是说生成了一个100维的稀疏向量，每个单次落在一个维度里，用key进行记录；然后value保存这个维度的单词出现的次数。
接下来打标签，并生成训练数据：

from pyspark.mllib.regression import LabeledPoint
p = spamFeatures.map(lambda x:LabeledPoint(1,x))
n = normalFeatures.map(lambda x:LabeledPoint(0,x))
trainData = p.union(n)
trainData.cache()

使用逻辑回归模型，用随机梯度下降进行训练

from pyspark.mllib.classification import LogisticRegressionWithSGD
model = LogisticRegressionWithSGD.train(trainData)

然后进行一下测试：

>>> pt = tf.transform("Get cheap stuff by sending money to me".split(" "))
>>> nt = tf.transform("Spark spark spark".split(" "))
>>> model.predict(pt)
1
>>> model.predict(nt)                       
0

PySpark的文本分类示例

1. 说明

2. 示例

猜你喜欢