python-DBSCAN密度聚类

1.DBSCAN 算法是一种基于密度的聚类算法：

聚类的时候不需要预先指定簇的个数。
最终的簇的个数不定。

2.DBSCAN 算法将数据点分为三类：

核心点：在半径Eps内含有超过MinPts数目的点
边界点：在半径Eps内点的数量小于MinPts，但是落在核心点在邻域内
噪音点：既不是核心点也不是边界的点

3.算法流程

将所有点标记为核心点，边界点或噪声点；
删除噪声点；
为距离在Eps内的所有点之间赋予一条边；
每组连通的核心点点形成一个簇；
将每个边界点指派到一个与之关联的核心点簇中；

学生在线上网时间分析时间：

'''
    DBSCAN主要参数：
        eps:两个样本被看做邻居节点的最大距离
        min_samles:簇的样本数
        metric:距离计算方式
'''
import numpy as np
from sklearn.cluster import DBSCAN
import sklearn.cluster as skc
from sklearn import metrics
import matplotlib.pyplot as plt

mac2id=dict()
onlinetimes=[]
f=open("E:\\python\online.txt")
for line in f:
    #读取每条数据中mac地址，开始上网时间，上网时长
    print(line)#每行信息
    
    mac=line.split(',')[2]
    print(mac)
    
    onlinetime=int(line.split(",")[6])
    print("在线时间：",onlinetime)
    
    starttime=int(line.split(',')[4].split(' ')[1].split(':')[0])
    print("上线时间:",starttime)
    
    #mac2id是一个字典，key是mac地址，value是对应mac地址的上网时长以及开始上网时间
    if mac not in mac2id:
        mac2id[mac]=len(onlinetimes)
        onlinetimes.append((starttime,onlinetime))
    else:
        onlinetimes[mac2id[mac]]=[(starttime,onlinetime)]
real_x=np.array(onlinetimes).reshape((-1,2))

#调用DBSCAN方法进行训练，labels为每个簇的标签
x=real_x[:,0:1]
db=skc.DBSCAN(eps=0.01,min_samples=20).fit(x)
labels=db.labels_

#打印数据被标记的标签，计算标签为-1，即噪声数据的比例
print("Labels:")
print(labels)
raito=len(labels[labels[:]==-1])/len(labels)
print("Noise raito:",format(raito,".2%"))


#计算簇的个数并打印，评价聚类效果
n_cluster_ = len(set(labels))-(1 if -1 in labels else 0)
print("Estimated number of cluster: %d"%n_cluster_)
print("Silhouette Coefficient : %0.3f"%metrics.silhouette_score(x,labels))

#输出各簇标号以及各簇内数据
for i in range(n_cluster_):
    print("cluster ",i,":")
    print(list(x[labels==i].flatten()))

#直方图计算显示
plt.hist(x,24)
plt.show()



x=np.log(1+real_x[:,1:])
db=skc.DBSCAN(eps=0.14,min_samples=10).fit(x)
labels=db.labels_

print("Labels:")
print(labels)
raito=len(labels[labels[:]==-1])/len(labels)
print("Noise raito:",format(raito,".2%"))

n_cluster_=len(set(labels))-(1 if -1 in labels else 0)

print("Estimated number of cluster: %d"%n_cluster_)
print("Silhouette Coefficient : %0.3f"%metrics.silhouette_score(x,labels))


# 统计每个簇内的样本个数，均值，标准差
for i in range(n_cluster_):
    print("cluster",i,':')
    count=len(x[labels==i])
    mean=np.mean(real_x[labels==i][:,1])
    std=np.std(real_x[labels==i][:,1])
    print("\t number of sample:",count)
    print("\t mean of sample:",format(mean,'.1f'))
    print("\t mean of sample:",format(std,'.1f'))
    

plt.hist(x,24)
plt.show()

数据文件分割出的一天记录：

可视化结果：

下面为处理过的结果：

扫描二维码关注公众号，回复： 8813895 查看本文章

TxyITxs

发布了89 篇原创文章 · 获赞 8 · 访问量 8904

私信关注

python-DBSCAN密度聚类

猜你喜欢