聚类
K-means
k均值是最常用的聚类算法之一,它将数据点聚集成预定数量的聚类。 MLlib实现包括称为kmeans ||的k-means ++方法的并行变体。
Means被实现为一个估计器,并生成一个KMeansModel作为基础模型。
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Latent Dirichlet allocation (LDA)``
LDA被实现为支持EMLDAOptimizer和OnlineLDAOptimizer的Estimator,并生成LDAModel作为基础模型。如果需要,专家用户可以将EMLDAOptimizer生成的LDAModel强制转换为DistributedLDAModel。
from pyspark.ml.clustering import LDA
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")
# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)
Bisecting k-means
将k均值平分是一种使用除法(或“自上而下”)方法的层次聚类:所有观测值都在一个聚类中开始,并且随着一个人向下移动,递归执行拆分。
平分K均值通常会比常规K均值快得多,但通常会产生不同的聚类。
BisectingKMeans被实现为一个估计器,并生成一个BisectingKMeansModel作为基础模型。
from pyspark.ml.clustering import BisectingKMeans
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)
# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))
# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
print(center)
4. Gaussian Mixture Model (GMM)
高斯混合模型表示一种复合分布,其中从k个高斯子分布之一中抽取点,每个子分布都有自己的概率。 spark.ml实现使用期望最大化算法在给定一组样本的情况下得出最大似然模型。 GaussianMixture被实现为Estimator,并生成GaussianMixtureModel作为基础模型。
from pyspark.ml.clustering import GaussianMixture
# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)
print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)