K means

随机初始化 $K$ 个cluster的质心： $\mu=\begin{bmatrix}\mu_1 & \mu_2 & \cdots & \mu_K\end{bmatrix}$

在每一轮迭代中

将每一个样本分配到距离最近的cluster： $c^{(i)}=\arg\min\limits_{j}\left \| x^{(i)}-\mu_j \right \|^2$
更新每一个cluster的中心： $\begin{aligned}\mu_j=\frac{\sum_\limits{i=1}^{m}1\left \{ c^{(i)}=j \right \}x^{(i)}}{\sum_\limits{i=1}^{m}1\left \{ c^{(i)}=j \right \}}\end{aligned}$

定义distortion function： $J(c,\mu)=\sum_\limits{i=1}^{m}\left \| x^{(i)}-\mu_{c^{(i)}} \right \|^2$ ，即所有样本到其cluster中心的距离之和

K-means的算法过程实际上是 $J$ 函数的坐标下降过程，每次坐标下降后都能保证 $J(c,\mu)$ 减小，因此算法最终能够收敛

$J$ 函数不是凸函数，因此K-means最终可能会落在局部最小值

解决的办法是使用不同的初始化方式运行多次K-means算法，然后选择 $J(c,\mu)$ 最小的那次运行结果（前提是 $K$ 值固定）

简单代码实现

import numpy as np

def compute_distance( p1, p2 ):
    d1 = p1[0] - p2[0]
    d2 = p1[1] - p2[1]
    return d1 ** 2 + d2 ** 2


X = np.array( [
     [0, 2], [2, 2], [2, 4], [0, 4],
     [0, -2], [-2, -2], [-2, -4], [0, -4],
     [-1, 1], [-1, 0], [-1, -1]
] )

c1 = np.array( [0, 2] )
c2 = np.array( [2, 2] )

cluster = np.zeros( len(X) )
cluster_last = np.zeros( len(X) )

epoch = 1
while True:
    print( '\n\n【epoch %d】' % epoch )
    epoch += 1

    # 分配点到c1或c2
    for i in range( len(X) ):
        d1 = compute_distance( X[i], c1 )
        d2 = compute_distance( X[i], c2 )
        cluster[i] = 1 if d1 <= d2 else 2

    print( '分配结果：', end='' )
    print( list( cluster.astype(int) ) )

    if list( cluster ) == list( cluster_last ):
        print( '分配结果不再改变，算法停止' )
        break

    cluster_last = cluster.copy()

    # 更新c1和c2
    temp1 = np.array( [0, 0] )
    temp2 = np.array( [0, 0] )

    for i in range( len(X) ):
        if cluster[i] == 1:
            temp1 += X[i]
        else:
            temp2 += X[i]

    c1 = temp1 / ( cluster == 1 ).sum()
    c2 = temp2 / ( cluster == 2 ).sum()

    print( '更新质心：', end='' )
    print( c1, c2 )

猜你喜欢