需求:通过聚类算法实现从一批经纬度中获得中心点并筛选出离群点
聚类原则:聚类距离簇边界最近的点
算法流程:
核心点:核心点的半径范围内的样本个数≥最少点数
边界点:边界点的半径范围内的样本个数小于最少点数大于0
噪声点:噪声点的半径范围的样本个数为0
1,根据半径、最少点数区分核心点、边界点、噪声点
2,先对核心点归类:
while:
随机选取一核心点作为当前簇
寻找距离当前簇最近的核心点:与簇边缘最近的核心点,,
if 若该核心点距离当前簇的边界核心点的距离小于半径:
则将该核心点归类到当前簇
if 若剩余的未归类的核心点距离当前簇边界距离均大于半径:
说明:距离第numC个簇最近的核心点不在该簇邻域内,说明第numC个簇已全部分类完毕, 可以生成新簇来归类核心点,则在剩余的未归类的核心点随机选取一核心点归类到新的簇中。
if 核心点已全部归类:
停止迭代
3,再根据半径和已归类的核心点来归类边界点
优点:
1.可以自动决定类的数量。不需要人为假设。
2.可以发现任意形状的簇类,而不像K-means那样只能发现圆形簇
3.可以识别噪声点,抗噪声能力较强
缺点:
1.不能很好的应用在高维数据中
2.如果样本集的密度不均匀,效果就不好
from sklearn.cluster import DBSCAN
import numpy as np
# 经纬度数据
data = np.array([[44.34855, 129.4578], [44.34855, 129.45781], [44.348557, 129.45782], [44.348526, 129.4578], [44.34851, 129.4578], [44.348545, 129.45784], [44.348526, 129.4578],
[44.348545, 129.45782], [44.348537, 129.4578], [44.348537, 129.45781], [44.348534, 129.4578], [44.34853, 129.4578], [44.348534, 129.4578],
[44.348553, 129.45789], [44.348545, 129.45781], [44.348526, 129.4578], [44.34851, 129.45778], [44.34855, 129.4578], [44.34853, 129.45784], [44.34852, 129.45778],
[44.348534, 129.45784], [44.34852, 129.45778], [44.34853, 129.45784], [44.348526, 129.45778], [44.348537, 129.45778], [44.348522, 129.45782],
[44.348553, 129.45782], [44.348526, 129.45776], [44.348534, 129.45776], [44.348534, 129.45778], [44.34853, 129.45782], [44.348534, 129.45782],
[44.348522, 129.45782], [44.34851, 129.45782], [44.348534, 129.45781], [44.348537, 129.4578], [44.34855, 129.45782], [44.348537, 129.45782],
[44.348526, 129.45781], [44.348553, 129.45781],[44.34853, 129.45784], [44.34855, 129.4578], [44.348545, 129.45781], [44.348537, 129.4578],
[44.34853, 129.45781],[50.3,112.2]])
# 聚类
dbscan = DBSCAN(eps=0.001, min_samples=2).fit(data)
# 聚类标签
labels = dbscan.labels_
# 计算每个聚类中心点和离群点
centers = []
outliers = []
for i in set(labels):
if i == -1: # 噪声点
outliers.extend(data[labels == i])
continue
centers.append(np.mean(data[labels == i], axis=0))
# 计算每个聚类中心点离所有点的距离
distances = []
for center in centers:
distance = np.sqrt(np.sum(np.square(data - center), axis=1))
distances.append(np.min(distance))
# 选取距离最小的点作为最中心的点
index = np.argmin(distances)
result = centers[index]
print("最中心的点:", result)
print("离群点:", outliers)
结果:
最中心的点: [ 44.34853458 129.45780822]
离群点: [array([ 50.3, 112.2])]