K means

随机初始化 K 个cluster的质心: μ = [ μ 1 μ 2 μ K ]

在每一轮迭代中

  1. 将每一个样本分配到距离最近的cluster: c ( i ) = arg min j x ( i ) μ j 2

  2. 更新每一个cluster的中心: μ j = i = 1 m 1 { c ( i ) = j } x ( i ) i = 1 m 1 { c ( i ) = j }

定义distortion function: J ( c , μ ) = i = 1 m x ( i ) μ c ( i ) 2 ,即所有样本到其cluster中心的距离之和

K-means的算法过程实际上是 J 函数的坐标下降过程,每次坐标下降后都能保证 J ( c , μ ) 减小,因此算法最终能够收敛

J 函数不是凸函数,因此K-means最终可能会落在局部最小值

解决的办法是使用不同的初始化方式运行多次K-means算法,然后选择 J ( c , μ ) 最小的那次运行结果(前提是 K 值固定)

简单代码实现

import numpy as np

def compute_distance( p1, p2 ):
    d1 = p1[0] - p2[0]
    d2 = p1[1] - p2[1]
    return d1 ** 2 + d2 ** 2


X = np.array( [
     [0, 2], [2, 2], [2, 4], [0, 4],
     [0, -2], [-2, -2], [-2, -4], [0, -4],
     [-1, 1], [-1, 0], [-1, -1]
] )

c1 = np.array( [0, 2] )
c2 = np.array( [2, 2] )

cluster = np.zeros( len(X) )
cluster_last = np.zeros( len(X) )

epoch = 1
while True:
    print( '\n\n【epoch %d】' % epoch )
    epoch += 1

    # 分配点到c1或c2
    for i in range( len(X) ):
        d1 = compute_distance( X[i], c1 )
        d2 = compute_distance( X[i], c2 )
        cluster[i] = 1 if d1 <= d2 else 2

    print( '分配结果:', end='' )
    print( list( cluster.astype(int) ) )

    if list( cluster ) == list( cluster_last ):
        print( '分配结果不再改变,算法停止' )
        break

    cluster_last = cluster.copy()

    # 更新c1和c2
    temp1 = np.array( [0, 0] )
    temp2 = np.array( [0, 0] )

    for i in range( len(X) ):
        if cluster[i] == 1:
            temp1 += X[i]
        else:
            temp2 += X[i]

    c1 = temp1 / ( cluster == 1 ).sum()
    c2 = temp2 / ( cluster == 2 ).sum()

    print( '更新质心:', end='' )
    print( c1, c2 )

猜你喜欢

转载自blog.csdn.net/o0Helloworld0o/article/details/81381291