版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/lx_ros/article/details/81358198
GREAT THANKS TO: http://cs231n.github.io/classification/
- 1.近邻算法
- 给定一个训练数据集合,对新的输入实例,在训练数据集中找到与该实例最近邻的k个实例,这k个实例多数属于某个类,就把这个实例划分为某个类。k=1时,称为最近邻算法。
- k近邻算法的三个基本要素:
- k值的选择: k值的减小就相当于是整体模型变得复杂,容易过拟合,k变大表示模型变的简单,k=N表示完全忽略了训练实例中的大量有用信息(这个时候只是单纯的根据训练数据集的标签来进行分类,新输入的数据属于训练集合中元素数比较多的类,根据标签来进行分类是很不靠谱的)。
上图中NN Classifier是最近邻算法(k==1),可以看到它把蓝色点集中的几个绿色的异常点(Outlier)也区分了出来,这样的模型很容易表现出过拟合,泛化的效果比较差。而最右边的5-NN classifier就可以平滑掉这些绿色的异常点,会有较好的泛化效果。 - 距离的度量
- 分类决策规则: 多数表决。
- k值的选择: k值的减小就相当于是整体模型变得复杂,容易过拟合,k变大表示模型变的简单,k=N表示完全忽略了训练实例中的大量有用信息(这个时候只是单纯的根据训练数据集的标签来进行分类,新输入的数据属于训练集合中元素数比较多的类,根据标签来进行分类是很不靠谱的)。
- 当输入是两张图片时,可以将其转换成两个向量 和 ,这里先选用 距离, ,其过程可视化为:
- 还可以使用 距离,其几何意义是两个向量之间的欧氏距离,
- 近邻算法的有点和缺点(pros and cons)
- 优点:易于实现和理解,无需训练
- 缺点:测试花费的时间太长
- 缺点: 当输入数据的维度很高时,譬如图片很大,像L2距离这些并没有感官上的直接联系。
如下图,
基于像素的高维数据的距离是非常不直观的,上图中最左侧是原始图像,右侧三张与其的L2距离是相同的,但很显然从视觉效果上和语义上它们三个之间并没什么相关性。L1和L2距离只是和图片背景和颜色分布有较强的相关性。
- 2.代码实现
#!/usr/bin/env python2
## -*- coding: utf-8 -*-
"""
Created on Thu Aug 2 09:46:44 2018
@author: rd
"""
from __future__ import division
import numpy as np
class KNearestNeighbor(object):
""" a kNN classifier with L2 distance """
def __init__(self):
pass
"""In kNearestNeighbor,training means store the training data"""
def train(self, X, Y):
self.X_train = X
self.Y_train = Y
def predict(self, X, k=1, num_loops=0):
if num_loops == 0:
dists = self.compute_distances_no_loops(X)
elif num_loops == 1:
dists = self.compute_distances_one_loop(X)
elif num_loops == 2:
dists = self.compute_distances_two_loops(X)
else:
raise ValueError('Invalid value %d for num_loops' % num_loops)
return self.predict_labels(dists, k=k)
def compute_distances_two_loops(self, X):
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
dists[i][j]=np.sum(np.square(self.X_train[j,:] - X[i,:]))
return dists
def compute_distances_one_loops(self,X):
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
dists[i]=np.sum(np.square(self.X_train-X[i]),axis=1)
return dists
def compute_distances_no_loops(self,X):
squa_sum_X=np.sum(np.square(X),axis=1).reshape(-1,1)
squa_sum_Xtr=np.sum(np.square(self.X_train),axis=1)
inner_prod=np.dot(X,self.X_train.T)
dists = -2*inner_prod+squa_sum_X+squa_sum_Xtr
return dists
def predict_labels(self, dists, k=1):
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
pos=np.argsort(dists[i])[:k]
closest_y = self.Y_train[pos]
y_pred[i]=np.argmax(np.bincount(closest_y.astype(int)))
return y_pred
"""
This dataset is part of MNIST dataset,but there is only 3 classes,
classes = {0:'0',1:'1',2:'2'},and images are compressed to 14*14
pixels and stored in a matrix with the corresponding label, at the
end the shape of the data matrix is
num_of_images x 14*14(pixels)+1(lable)
"""
def load_data(split_ratio):
tmp=np.load("data216x197.npy")
data=tmp[:,:-1]
label=tmp[:,-1]
mean_data=np.mean(data,axis=0)
train_data=data[int(split_ratio*data.shape[0]):]-mean_data
train_label=label[int(split_ratio*data.shape[0]):]
test_data=data[:int(split_ratio*data.shape[0])]-mean_data
test_label=label[:int(split_ratio*data.shape[0])]
return train_data,train_label,test_data,test_label
def main():
train_data,train_label,test_data,test_label=load_data(0.4)
knn=KNearestNeighbor()
knn.train(train_data,train_label)
Yte=knn.predict(test_data,k=2)
print "The accuracy is {}".format(np.mean(Yte==test_label))
if __name__=="__main__":
main()
>>>python knn.py
The accuracy is 0.976744186047
#数据很少,图片是单通道尺寸也很小,所以分类结果还不错