今年年初的时候学习了《机器学习》这本书中的算法,并实践了一些。现在整理成笔记,以后需要时还可以找到。
今天先写个简单的聚类算法。
1、K-means聚类
K-means算法是很典型的基于距离的聚类算法,采用距离作为相似性的评价指标,即认为两个对象的距离越近,其相似度就越大。该算法认为簇
是由距离靠近的对象组成的,因此把得到紧凑且独立的簇作为最终目标。k个初始类聚类中心点的选取对聚类结果具有较大的影响,因为在该算法第一
步中是随机的选取任意k个对象作为初始聚类的中心,初始地代表一个簇。该算法在每次迭代中对数据集中剩余的每个对象,根据其与各个簇中心的距
离将每个对象重新赋给最近的簇。当考察完所有数据对象后,一次迭代运算完成,新的聚类中心被计算出来。如果在一次迭代前后,J的值没有发生变
化,说明算法已经收敛。
算法过程如下:
1)从N个文档随机选取K个文档作为
质心
2)对剩余的每个文档测量其到每个
质心的距离,并把它归到最近的质心的类
3)重新计算已经得到的各个类的
质心
4)迭代2~3步直至新的质心与原质心相等或小于指定阈值,算法结束
具体如下:
输入:k, data[n];
(1) 选择k个初始中心点,例如c[0]=data[0],…c[k-1]=data[k-1];
(2) 对于data[0]….data[n],分别与c[0]…c[k-1]比较,假定与c[i]差值最少,就标记为i;
(3) 对于所有标记为i点,重新计算c[i]={ 所有标记为i的data[j]之和}/标记为i的个数;
(4) 重复(2)(3),直到所有c[i]值的变化小于给定
阈值。
算法优点
K-Means聚类算法的优点主要集中在:
1.算法快速、简单;
2.对大数据集有较高的效率并且是可伸缩性的;
3.时间复杂度近于线性,而且适合挖掘大规模数据集。K-Means聚类算法的时间复杂度是O(nkt) ,其中n代表数据集中对象的数量,t代表着算法迭代的次数,k代表着簇的数目。
下面以2011年到2015年全国主要城市的国内生产总值实例,根据该数据将城市聚为三类,一线城市、二线城市和三线城市。
- #coding:utf-8
- from __future__ import division
- from math import sqrt
- import xlrd
- import random
- import numpy
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- #数据存储函数,返回城市数据字典,键值为5年的数据列表
- def storage():
- table = xlrd.open_workbook('E:/citydata.xls')
- sheet = table.sheets()[0]
- col = sheet.col_values(0)
- list1 = []
- for i in range(4, 40):
- row_data = sheet.row_values(i)
- list1.append(row_data)
- l = len(list1)
- list2 = []
- for j in range(l):
- temp = list1[j]
- list2.append(temp[0])
- del temp[0]
- list1[j] = temp
- dict1 = {}
- for k in range(l):
- temp1 = list2[k]
- dict1[temp1] = list1[k]
- return dict1
- #求两个城市向量的距离
- def distance(elem, K_mean):
- global Sum
- length = len(elem)
- Sum = 0
- # print type(K_mean)
- for i in range(length):
- Sum = Sum + (elem[i] - K_mean[i])**2
- Sum = sqrt(Sum)
- return Sum
- #簇中心向量
- def mean(cluster):
- length = len(cluster)
- S = [0.0, 0.0, 0.0, 0.0, 0.0]
- for every in cluster:
- a = numpy.array(every)
- S = S + a
- mean_value = S/length
- mean_value = list(mean_value)
- # print mean_value
- return mean_value
- #比较函数
- def compare(list1, list2):
- a = len(list1)
- b = len(list2)
- # print a, b
- c = 0
- for i in range(a):
- if list1[i] == list2[i]:
- c = c + 1
- # print c
- if c == a:
- return True
- else:
- return False
- #代价函数,返回簇间的距离
- def Cost_function(K1_cluster, K2_cluster, K3_cluster, K1_mean, K2_mean, K3_mean):
- global cost
- cost = 0
- for each in K1_cluster:
- cost = cost + distance(each, K1_mean)
- # print cost
- for every in K2_cluster:
- cost = cost + distance(every, K2_mean)
- for single in K3_cluster:
- cost = cost + distance(single, K3_mean)
- return cost
- #算法第一步,返回三个簇
- def Step_one(dict1, K1, K2, K3):
- K1_cluster = []
- K2_cluster = []
- K3_cluster = []
- for each in dict1:
- dist1 = distance(dict1[each], K1)
- dist2 = distance(dict1[each], K2)
- dist3 = distance(dict1[each], K3)
- Min = min(dist1, dist2, dist3)
- if Min == dist1:
- K1_cluster.append(dict1[each])
- if Min == dist2:
- K2_cluster.append(dict1[each])
- if Min == dist3:
- K3_cluster.append(dict1[each])
- # print K1_cluster
- return K1_cluster, K2_cluster, K3_cluster
- #求簇的中心向量
- def Step_two(K1_cluster, K2_cluster, K3_cluster):
- K1_mean = mean(K1_cluster)
- K2_mean = mean(K2_cluster)
- K3_mean = mean(K3_cluster)
- return K1_mean, K2_mean, K3_mean
- #聚类函数,返回聚类结果
- def K_means(dict1, K):
- global K11_mean, K22_mean, K33_mean, error, cost1, cost2
- length = len(dict1)
- list1 = random.sample(dict1, K)
- K1 = dict1[list1[0]]
- K2 = dict1[list1[1]]
- K3 = dict1[list1[2]]
- clu1, clu2, clu3 = Step_one(dict1, K1, K2, K3) #第一次聚类
- K1_mean, K2_mean, K3_mean = Step_two(clu1, clu2, clu3)
- cost1 = Cost_function(clu1, clu2, clu3, K1_mean, K2_mean, K3_mean)
- new_clu1, new_clu2, new_clu3 = Step_one(dict1, K1_mean, K2_mean, K3_mean) #第二次聚类
- K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)
- cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)
- error = cost1 - cost2
- if error > 0.5:
- cost1 = cost2
- new_clu1, new_clu2, new_clu3 = Step_one(dict1, K11_mean, K22_mean, K33_mean)
- K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)
- cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)
- error = cost1 - cost2
- return new_clu1, new_clu2, new_clu3
- if __name__ == '__main__':
- dict1 = storage()
- K = 3
- K1_cluster, K2_cluster, K3_cluster = K_means(dict1, K)
- list1 = []
- list2 = []
- list3 = []
- d = dict1.keys()
- L1 = len(K1_cluster)
- L2 = len(dict1)
- # print K1_cluster
- for each in d:
- for i in K1_cluster:
- # print i
- if compare(dict1[each], i):
- list1.append(each)
- for j in K2_cluster:
- if compare(dict1[each], j):
- list2.append(each)
- for k in K3_cluster:
- if compare(dict1[each], k):
- list3.append(each)
- for i in list1:
- print i,
- for j in list2:
- print j,
- for k in list3:
- print k,
结果如下,我取的误差是0.5,也可以用迭代次数。