Sklearn - 聚类 - 代码天地

文章目录

Sklearn - 2.3. Clustering
https://scikit-learn.org/stable/modules/clustering.html

from sklearn import datasets 
from sklearn.preprocessing import StandardScaler 
from sklearn.cluster import KMeans

iris = datasets.load_iris() 
iris_features = iris.data 
iris_target = iris.target

使用 K-Means 聚类算法

# 标准化特征
scaler = StandardScaler() 
features_std = scaler.fit_transform(iris_features) 

# 创建 K-Means 对象 
cluster = KMeans(n_clusters=3, random_state=0)

model = cluster.fit(features_std)

# 查看预测分类
model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

# 真实分类
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])


new_ob = [[0.8, 0.8, 0.8, 0.8]]
model.predict(new_ob)

array([2], dtype=int32)

# 查看分类的中心点
model.cluster_centers_

array([[-0.05021989, -0.88337647,  0.34773781,  0.2815273 ],
       [-1.01457897,  0.85326268, -1.30498732, -1.25489349],
       [ 1.13597027,  0.08842168,  0.99615451,  1.01752612]])

加速 K-Means 聚类 MiniBatchKMeans

batch_size 控制每个批次中，随机选择的观察值的数量；批次中的观察值越多，训练中需要花费的算力越大。

from sklearn.cluster import MiniBatchKMeans

cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
model.fit(features_std)

KMeans(n_clusters=3, random_state=0)

model = cluster.fit(features_std)

使用 Meanshift 聚类算法

KMeans 的限制：需要设置聚类的数量K，还要假设聚类的形状；Meanshift 没有这样的限制。

Meanshift 参数

bandwidth

丢弃孤儿值 cluster_all = False

from sklearn.cluster import MeanShift

cluster = MeanShift(n_jobs=-1)
model = cluster.fit(features_std)

使用 DBSCAN 聚类算法

主要参数

eps : 从一个观察值到另一个观察值的最远距离；超过这个距离将不再被认为二者是邻居；
min_samples : 最小限度的邻居数量；
metric : 距离度量；

from sklearn.cluster import DBSCAN

cluster = DBSCAN(n_jobs=-1) 
model = cluster.fit(features_std)

model.labels_

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])

使用层次合并聚类算法 AgglomerativeClustering

Agglomerative 是一个强大的、灵活的层次聚类算法；
在 Agglomerative 中，所有的观察值，一开始都是一个独立的聚类；
接着，满足一定条件的聚类被合并；不断重复这个合并过程，让聚类不断增长，直到达到某个临界点；
在 sklearn 中，AgglomerativeClustering 使用 linkage 参数来决定合并策略，使其可以最小化下面的值：
- ward，合并后的聚类的方差
- average，两个聚类之间观察值的平均距离
- complete，两个聚类之间，观察值的最大距离
其它参数
- affinity, 决定 linkage 使用何种距离度量，如 minkowski 或 euclidean
- n_clusters，设定了聚类算法试图寻找的聚类的数量；直到达到 n_clusters 个聚类时，聚类的合并才算结束。

from sklearn.cluster import AgglomerativeClustering 

cluster = AgglomerativeClustering(n_clusters=3)
model = cluster.fit(features_std)

model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Sklearn - 聚类

文章目录

使用 K-Means 聚类算法

加速 K-Means 聚类 MiniBatchKMeans

使用 Meanshift 聚类算法

使用 DBSCAN 聚类算法

使用层次合并聚类算法 AgglomerativeClustering

猜你喜欢