问题1:扔下圆球的位置(feature特征变量)变化,最终掉落奖项(label结果标签)的变化
- feature ----输入
- f(x) ----模型,算法
- label ----输出
大量已知的数据,训练得出f(x)
- 一般的三步
- 收集数据feature,label
- 选择数据模型建立feature,label的关系--f(x)
- 根据选择的模型进行预测
- 人类建立模型:多次尝试增加经验提高预测到达公司的时间
- 机器学习:利用统计学,概率论,信息论的数学方法,利用已知的数据,创建一种模型,利用这种模型进行预测(高考与之类似)
- 机器学习经典模型:KNN模型(人以类聚,物以群分)
- 基本步骤
- 1.采集数据
- 2.计算实验数据与预测位置的距离
- 3.按照距离从小到大排序
- 4.选取最顶部的K条记录,得到最大概率的落点
- 5.预测预测位置的落点
- python实现
-
import numpy as np import collections as c # 1.采集数据 data = np.array([ [154, 1], [126, 2], [70, 2], [196, 2], [161, 2], [371, 4], ]) # feature,label feature = data[:, 0] label = data[:, -1] # 2.计算实验数据与预测点的距离 predictPoint = 200 distance = list(map(lambda x: abs(x - predictPoint), feature)) # 3.按照距离从小到大排序,排序下标 sortindex = np.argsort(distance) # 4.选取最顶部的K条记录,得到最大概率的落点,假设K=3 sortedlabel = label[sortindex] k = 3 # 预测结果[(结果,次数)] # 5.预测预测位置的落点 predictlabel = c.Counter(sortedlabel[0:k]).most_common(1)[0][0] print(predictlabel)
-
- 代码封装
-
import numpy as np import collections as c # 代码封装 def knn(k, predictPoint, feature, label): distance = list(map(lambda x: abs(x - predictPoint), feature)) sortindex = np.argsort(distance) sortedlabel = label[sortindex] predictlabel = c.Counter(sortedlabel[0:k]).most_common(1)[0][0] return predictlabel if __name__ == '__main__': data = np.array([ [154, 1], [126, 2], [70, 2], [196, 2], [161, 2], [371, 4], ]) # feature,label feature = data[:, 0] label = data[:, -1] k = 3 predictPoint = 200 res = knn(k, predictPoint, feature, label) print(res)
-
- 实际数据验证
- 数据集https://pan.baidu.com/s/1rw5aFlVzZb6SBdY09Rckhg
-
import numpy as np import collections as c # 代码封装 def knn(k, predictPoint, feature, label): distance = list(map(lambda x: abs(x - predictPoint), feature)) sortindex = np.argsort(distance) sortedlabel = label[sortindex] predictlabel = c.Counter(sortedlabel[0:k]).most_common(1)[0][0] return predictlabel if __name__ == '__main__': data = np.loadtxt("数据集/data0.csv", delimiter=",") # feature,label feature = data[:, 0] label = data[:, -1] k = 3 predictPoint = 300 res = knn(k, predictPoint, feature, label) print(res)
- 打散数据得到训练集和测试集
- import numpy as np
-
import collections as c # 代码封装 def knn(k, predictPoint, feature, label): distance = list(map(lambda x: abs(x - predictPoint), feature)) sortindex = np.argsort(distance) sortedlabel = label[sortindex] predictlabel = c.Counter(sortedlabel[0:k]).most_common(1)[0][0] return predictlabel if __name__ == '__main__': data = np.loadtxt("数据集/data0.csv", delimiter=",") # 打散数据 np.random.shuffle(data) # 训练90% traindata = data[100:-1] # 测试10% testdata = data[0:100] # 保存训练数据,测试数据 np.savetxt("data0-test.csv", testdata, delimiter=",", fmt="%d") np.savetxt("data0-train.csv", traindata, delimiter=",", fmt="%d") traindata = np.loadtxt("data0-train.csv", delimiter=",") # feature,label feature = traindata[:, 0] label = traindata[:, -1] max_accuracy = 0 max_k = [] # 预测点,来自测试数据的每一条记录 testdata = np.loadtxt("data0-test.csv", delimiter=",") for k in range(1,100): count = 0 for item in testdata: predict = knn(k, item[0], feature, label) real = item[1] if predict == real: count += 1 accuracy = count * 100.0 / len(testdata) if accuracy >= max_accuracy: max_accuracy = accuracy max_k.append(k) print("k = {},准确率:{}%".format(k,count*100.0/len(testdata))) print(set(max_k))
-
多维数据的距离的选取
-
欧式距离即可
-
- 预测不是很准确
- 1.模型的参数选择不够好,调参---调参工程师
- K选择太小,噪音干扰太明显
- K选择太大,其他类型的会涵盖
- K值的评价标准
- 训练集trainData---得到模型
- 测试集testData---验证模型
- 经验:K一般选取样本集开平方
- 2.影响因子不够多
- 增加数据的维度
- 球的颜色导致弹性不同
- 特殊数据的导入
-
import numpy as np def colornum(str): dict = {"红":0.50,"黄":0.51,"蓝":0.52,"绿":0.53,"紫":0.54,"粉":0.55} return dict[str] data = np.loadtxt('数据集/data1.csv',delimiter=",",converters={1:colornum},encoding="gbk") print(data)
-
- 增加数据的维度
- 3.样本数据量不够
- 4.选择的模型不够好,选择其他机器学习的模型
- 1.模型的参数选择不够好,调参---调参工程师
- 基本步骤
- 机器学习经典模型:KNN模型(人以类聚,物以群分)