如果你想自己定义一个距离的function的话,scikit-learn是不行的,只支持Euclidean distance
如果你觉得spark可以的话,实际上sprk的k-means也是不行的,好一点的是支持Euclidean distance,还支持cosine distance
如果你想自己定义function处理的话,二个方法:
1、自己实现算法,可参考的文档:
一个简单的讲解
一个简单的代码:
这个代码仓库有很多实现不依赖于第三方库
stackoverflow的高赞的回答也实现了一个简单的
自己实现算法的问题:即使不考虑扩展性,也需要考虑实现多线程版本,还需要看训练的精度
2、找其他开源的项目
可以引用的nltk的库:
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
可以引用pyclustering 的库
from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric
user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)
# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
对应github地址:https://github.com/annoviko/pyclustering
可以引用tslearn
if you're clustering time series, you can use the tslearn
python package, when you can specify a metric (dtw
, softdtw
, euclidean
).