本文由本人原创,仅作为自己的学习记录
KNN算法的实现思路是,分别计算未知数据到其他各个数据的欧几里得距离之和(也可以是其他距离),然后进行从小到大排序,排序的列表前K个值中,属于其他数据类别最多的,说明该未知数据类型与这类数据越相似。
下面是实例,假设有零食,包,电器三类商品,商品分别有价格,评价两类属性(数据我事先分别存在三个不同的TXT中,在实际过程中,应该是先对大量数据作处理得到数据),现在有一个商品,价格10,评价10,需要去预测该商品属于哪一类商品
下面是三个不同的txt内容,每一行分别为 价格,评价
1.零食.txt
0.5,10
1,20
2,30
2.包.txt
100,3
300,2
500,1
3.电器.txt
50,5
40,8
30,9
应该是先读取三类数据,每一类数据生成一个二维列表,然后与新商品的坐标(10,10)计算距离,所有距离形成一个列表,然后从小到大排序,截取前K个值,判断前K个值中属于类别最多的类别,则该商品最可能属于该类别
本文中取得K=总数据长度/2+1,实际上KNN算法最应该注意的是K的取值,因为K的取值直接影响了最后的结果
下面是代码:
#coding=utf-8
from __future__ import division
import numpy
import math
def getfooddata(food_file):
fooddata=[]
foodlines=open(food_file,'r')
for line in foodlines:
fooddata.append(list(line.strip('\n').split(',')))
for i in range(0,len(fooddata)):
for j in range(0,len(fooddata[0])):
fooddata[i][j]=float(fooddata[i][j])
return fooddata
def getpacketdata(pack_file):
packetdata=[]
packetlines=open(pack_file,'r')
for line in packetlines:
packetdata.append(list(line.strip('\n').split(',')))
for i in range(0,len(packetdata)):
for j in range(0,len(packetdata[0])):
packetdata[i][j]=float(packetdata[i][j])
return packetdata
def getelcdata(elc_file):
elcdata=[]
elclines=open(elc_file,'r')
for line in elclines:
elcdata.append(list(line.strip('\n').split(',')))
for i in range(0,len(elcdata)):
for j in range(0,len(elcdata[0])):
elcdata[i][j]=float(elcdata[i][j])
return elcdata
def KNN_arithm(fooddata,packetdata,elcdata,expected_data):
distance = []
food =[]
packet=[]
elc=[]
food_index=0
packet_index=0
elc_index=0
for i in range(0,len(fooddata)):
food.append(math.sqrt((numpy.square(expected_data[0]-fooddata[i][0])+numpy.square(expected_data[1]-fooddata[i][1]))))
distance.append(math.sqrt((numpy.square(expected_data[0]-fooddata[i][0])+numpy.square(expected_data[1]-fooddata[i][1]))))
for j in range(0,len(packetdata)):
packet.append(math.sqrt((numpy.square(expected_data[0]-packetdata[i][0])+numpy.square(expected_data[1]-packetdata[i][1]))))
distance.append(math.sqrt((numpy.square(expected_data[0]-packetdata[i][0])+numpy.square(expected_data[1]-packetdata[i][1]))))
for n in range(0,len(elcdata)):
elc.append(math.sqrt((numpy.square(expected_data[0]-elcdata[i][0])+numpy.square(expected_data[1]-elcdata[i][1]))))
distance.append(math.sqrt((numpy.square(expected_data[0]-elcdata[i][0])+numpy.square(expected_data[1]-elcdata[i][1]))))
distance.sort(reverse=False)
k=int(len(distance)/2)+1
distance = distance[0:k-1]
for m in distance:
if m in food:
food_index=food_index+1
elif m in packet:
packet_index=packet_index+1
else:
elc_index=elc_index+1
final=[food_index,packet_index,elc_index]
final_index= final.index(max(final))
if final_index==0:
print "该类别为零食"
elif final_index==1:
print "该类别为包"
else:
print "该类别为电器"
if __name__ =="__main__":
expected_data=[10,10]
k=8
food_file=u"D:\\eclipse\\eclipse_workplace\\yuntu\\src\\零食.txt"
pack_file=u"D:\\eclipse\\eclipse_workplace\\yuntu\\src\\包.txt"
elc_file=u"D:\\eclipse\\eclipse_workplace\\yuntu\\src\\电器.txt"
fooddata=getfooddata(food_file)
packetdata=getpacketdata(pack_file)
elcdata=getelcdata(elc_file)
print fooddata,packetdata,elcdata
KNN_arithm(fooddata, packetdata, elcdata, expected_data)
下面是我的运行结果:
[[0.5, 10.0], [1.0, 20.0], [2.0, 30.0]] [[100.0, 3.0], [300.0, 2.0], [500.0, 1.0]] [[50.0, 5.0], [40.0, 8.0], [30.0, 9.0]]
该类别为零食
本文仅做为学习的记录,存在很多不足之处。