学习笔记第五篇之聚类算法

今年年初的时候学习了《机器学习》这本书中的算法，并实践了一些。现在整理成笔记，以后需要时还可以找到。

今天先写个简单的聚类算法。

1、K-means聚类

K-means算法是很典型的基于距离的聚类算法，采用距离作为相似性的评价指标，即认为两个对象的距离越近，其相似度就越大。该算法认为簇

是由距离靠近的对象组成的，因此把得到紧凑且独立的簇作为最终目标。k个初始类聚类中心点的选取对聚类结果具有较大的影响，因为在该算法第一

步中是随机的选取任意k个对象作为初始聚类的中心，初始地代表一个簇。该算法在每次迭代中对数据集中剩余的每个对象，根据其与各个簇中心的距

离将每个对象重新赋给最近的簇。当考察完所有数据对象后，一次迭代运算完成，新的聚类中心被计算出来。如果在一次迭代前后，J的值没有发生变

化，说明算法已经收敛。

算法过程如下：

1）从N个文档随机选取K个文档作为质心

2）对剩余的每个文档测量其到每个质心的距离，并把它归到最近的质心的类

3）重新计算已经得到的各个类的质心

4）迭代2～3步直至新的质心与原质心相等或小于指定阈值，算法结束

具体如下：

输入：k, data[n];

（1）选择k个初始中心点，例如c[0]=data[0],…c[k-1]=data[k-1]；

（2）对于data[0]….data[n]，分别与c[0]…c[k-1]比较，假定与c[i]差值最少，就标记为i；

（3）对于所有标记为i点，重新计算c[i]={ 所有标记为i的data[j]之和}/标记为i的个数；

（4）重复(2)(3)，直到所有c[i]值的变化小于给定阈值。

算法优点

K-Means聚类算法的优点主要集中在:

1.算法快速、简单;

2.对大数据集有较高的效率并且是可伸缩性的;

3.时间复杂度近于线性，而且适合挖掘大规模数据集。K-Means聚类算法的时间复杂度是O(nkt) ,其中n代表数据集中对象的数量，t代表着算法迭代的次数，k代表着簇的数目。

下面以2011年到2015年全国主要城市的国内生产总值实例，根据该数据将城市聚为三类，一线城市、二线城市和三线城市。

[python]view plain copy 
      
 #coding:utf-8  
 from __future__ import division  
 from math import sqrt  
 import xlrd  
 import random  
 import numpy  
 import sys  
 reload(sys)  
 sys.setdefaultencoding('utf-8')  
 #数据存储函数，返回城市数据字典，键值为5年的数据列表  
 def storage():  
     table = xlrd.open_workbook('E:/citydata.xls')  
     sheet = table.sheets()[0]  
     col = sheet.col_values(0)  
     list1 = []  
     for i in range(4, 40):  
         row_data = sheet.row_values(i)  
         list1.append(row_data)  
     l = len(list1)  
     list2 = []  
     for j in range(l):  
         temp = list1[j]  
         list2.append(temp[0])  
         del temp[0]  
         list1[j] = temp  
     dict1 = {}  
     for k in range(l):  
         temp1 = list2[k]  
         dict1[temp1] = list1[k]  
     return dict1  
   
 #求两个城市向量的距离  
 def distance(elem, K_mean):  
     global Sum  
     length = len(elem)  
     Sum = 0  
     # print type(K_mean)  
     for i in range(length):  
         Sum = Sum + (elem[i] - K_mean[i])**2  
     Sum = sqrt(Sum)  
     return Sum  
   
 #簇中心向量  
 def mean(cluster):  
     length = len(cluster)  
     S = [0.0, 0.0, 0.0, 0.0, 0.0]  
     for every in cluster:  
         a = numpy.array(every)  
         S = S + a  
     mean_value = S/length  
     mean_value = list(mean_value)  
     # print mean_value  
     return mean_value  
   
 #比较函数  
 def compare(list1, list2):  
     a = len(list1)  
     b = len(list2)  
     # print a, b  
     c = 0  
     for i in range(a):  
         if list1[i] == list2[i]:  
             c = c + 1  
     # print c  
     if c == a:  
         return True  
     else:  
         return False  
   
 #代价函数，返回簇间的距离  
 def Cost_function(K1_cluster, K2_cluster, K3_cluster, K1_mean, K2_mean, K3_mean):  
     global cost  
     cost = 0  
     for each in K1_cluster:  
         cost = cost + distance(each, K1_mean)  
     # print cost  
     for every in K2_cluster:  
         cost = cost + distance(every, K2_mean)  
     for single in K3_cluster:  
         cost = cost + distance(single, K3_mean)  
     return cost  
   
 #算法第一步，返回三个簇  
 def Step_one(dict1, K1, K2, K3):  
     K1_cluster = []  
     K2_cluster = []  
     K3_cluster = []  
     for each in dict1:  
         dist1 = distance(dict1[each], K1)  
         dist2 = distance(dict1[each], K2)  
         dist3 = distance(dict1[each], K3)  
         Min = min(dist1, dist2, dist3)  
         if Min == dist1:  
             K1_cluster.append(dict1[each])  
         if Min == dist2:  
             K2_cluster.append(dict1[each])  
         if Min == dist3:  
             K3_cluster.append(dict1[each])  
     # print K1_cluster  
     return K1_cluster, K2_cluster, K3_cluster  
   
 #求簇的中心向量  
 def Step_two(K1_cluster, K2_cluster, K3_cluster):  
     K1_mean = mean(K1_cluster)  
     K2_mean = mean(K2_cluster)  
     K3_mean = mean(K3_cluster)  
     return K1_mean, K2_mean, K3_mean  
   
 #聚类函数，返回聚类结果  
 def K_means(dict1, K):  
     global K11_mean, K22_mean, K33_mean, error, cost1, cost2  
     length = len(dict1)  
     list1 = random.sample(dict1, K)  
     K1 = dict1[list1[0]]  
     K2 = dict1[list1[1]]  
     K3 = dict1[list1[2]]  
     clu1, clu2, clu3 = Step_one(dict1, K1, K2, K3)     #第一次聚类  
     K1_mean, K2_mean, K3_mean = Step_two(clu1, clu2, clu3)  
     cost1 = Cost_function(clu1, clu2, clu3, K1_mean, K2_mean, K3_mean)  
     new_clu1, new_clu2, new_clu3 = Step_one(dict1, K1_mean, K2_mean, K3_mean)   #第二次聚类  
     K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)  
     cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)  
     error = cost1 - cost2  
     if error > 0.5:  
         cost1 = cost2  
         new_clu1, new_clu2, new_clu3 = Step_one(dict1, K11_mean, K22_mean, K33_mean)  
         K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)  
         cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)  
         error = cost1 - cost2  
     return new_clu1, new_clu2, new_clu3  
   
   
 if __name__ == '__main__':  
     dict1 = storage()  
     K = 3  
     K1_cluster, K2_cluster, K3_cluster = K_means(dict1, K)  
     list1 = []  
     list2 = []  
     list3 = []  
     d = dict1.keys()  
     L1 = len(K1_cluster)  
     L2 = len(dict1)  
     # print K1_cluster  
     for each in d:  
         for i in K1_cluster:  
             # print i  
             if compare(dict1[each], i):  
                 list1.append(each)  
         for j in K2_cluster:  
             if compare(dict1[each], j):  
                 list2.append(each)  
         for k in K3_cluster:  
             if compare(dict1[each], k):  
                 list3.append(each)  
     for i in list1:  
         print i,  
     print  
     for j in list2:  
         print j,  
     print  
     for k in list3:  
         print k,  
     print 

      结果如下，我取的误差是0.5，也可以用迭代次数。 
    
 
      
     
 
    

学习笔记第五篇之聚类算法

猜你喜欢