day05-轮廓系数-噪声聚类-欧氏距离-皮氏距离-相似度排列-推荐清单-自然语言-词频-文本分类-性别分类
5.轮廓系数:聚类的评价指标
1)概念:
比较好聚类:内密外疏
对于每个样本计算内部距离a和外部距离b,
得到该样本的轮廓系数s=(b-a)/max(a, b),
对所有样本的轮廓系数取平均值,
即为整个样本空间的轮廓系数S=ave(s)。
内部距离a: 一个样本与同聚类其它样本的平均欧氏距离
外部距离b: 一个样本与离其聚类最近的另一个聚类中所有样本的平均欧氏距离。
2)接口:
import sklearn.metrics as sm #评价模块
轮廓系数 = sm.silhouette_score(输入集,输出集,
sample_size=样本数,metric=距离算法)
距离算法:euclidean(欧氏距离)
3)代码:score.py
import numpy as np
import sklearn.cluster as sc
import sklearn.metrics as sm
x= []
with open('../../day01/data/multiple3.txt','r') as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(',')]
x.append(data)
x = np.array(x)
model = sc.KMeans(n_clusters=4)#K均值聚类器
model.fit(x)
pred_y = model.predict(x)#在这里测试并产生预测值
score = sm.silhouette_score(x,pred_y,
sample_size=len(x),metric='euclidean')
print(score) #打印聚类的评价指标:轮廓系数
6.基于密度的带噪声聚类
1)概念:朋友的朋友也是朋友
从样本空间中任意选择一个样本,以事先给定的半径做园,凡是
在该园范围内的样本都视为与该样本处于相同的聚类,以这些被
圈中的样本为圆心继续做园,重复以上过程,不断扩大被圈中的
样本规模,直到再也没有新的样本加入为止,至此即得到一个聚
类。剩余样本中,重复以上过程,直到耗尽样本空间中的所有
样本为止。
这里的园是广义概念的,如在二维平面为园,三维中为球....
2)特点:
1.事先给定的半径会影响最后的聚类效果,
可以借助轮廓系数选择较优的方案。
2.根据聚类的形成过程,把样本分为以下三类:
外周样本:被其他样本聚集到某个聚类中,无法再引入新样本。
孤立样本:聚类中的样本数低于所设定的下限,则不称为聚类,
称为孤立样本。
核心样本:除了外周样本和孤立样本,就是核心样本
3)应用:
如在销售或电商业务中:处于外周或孤立样本的客户,是需要挖掘的
4)接口:
import sklearn.cluster as sc
model = sc.DBSCAN(eps=epsilon,min_sample=5)
eps:半径大小 min_sample:最小样本数,即下限
5)代码示例:dbscan.py
import numpy as np
import sklearn.cluster as sc
import matplotlib.pyplot as mp
import sklearn.metrics as sm
x= []
with open('../../day01/data/perf.txt','r') as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(',')]
x.append(data)
x = np.array(x)
epsilons,scores,models = np.linspace(0.3,1.2,10),[],[]
for epsilon in epsilons:
#构建(DBSCAN)基于密度的带噪声聚类
model = sc.DBSCAN(eps=epsilon,min_samples=5)
model.fit(x)
score = sm.silhouette_score(x,model.labels_,sample_size=len(x),
metric='euclidean') #获得轮廓系数
scores.append(score)
models.append(model)
scores=np.array(scores)
# print(epsilons)
# print(scores) #这样也可得知半径为0.8对应的轮廓系数最好,但需要人为判断
best_index = scores.argmax() #获得最好轮廓系数的下标
best_epsilon = epsilons[best_index] #获得最好轮廓系数对应的半径数据
print(best_epsilon) # 自动获得最好轮廓系数对应的半径数据
best_model = models[best_index] #获得最优的模型
pred_y = best_model.fit_predict(x)
core_mask = np.zeros(len(x),dtype=bool) #核心掩码
core_mask[best_model.core_sample_indices_] = True #获得核心样本
offset_mask = best_model.labels_ == -1 #孤立样本掩码 labels_ = -1的都是孤立样本
peripher_mask = ~(core_mask|offset_mask) #外周样本掩码
mp.figure('DBSCAN',facecolor='lightgray')
mp.title('DBSCAN', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
labels = set(pred_y) #set过滤重复的数据
#将range(len(labels))中对应的元素获得get_cmap对应颜色
cs = mp.get_cmap('brg',len(labels))(range(len(labels)))
mp.scatter(x[core_mask][:, 0], x[core_mask][:, 1],c=cs[pred_y[core_mask]], s=80,label='Core')
mp.scatter(x[offset_mask][:, 0], x[offset_mask][:, 1],edgecolor=cs[pred_y[offset_mask]],
facecolor='none',s=80,label='offset')
mp.scatter(x[peripher_mask][:, 0], x[peripher_mask][:, 1],c=cs[pred_y[peripher_mask]],
marker='*', s=80,label='peripher')
mp.legend()
mp.tight_layout()
mp.show()
十三、推荐引擎
横向推荐:可能是未来需要的;
纵向推荐:曾经用过的,未来可能不在需要。
1.欧氏距离分数
概念:欧氏距离分数=1/(1+欧氏距离)
欧氏距离趋于0,则分数趋于1
欧氏距离趋于+∞,则分数趋于0
分数趋于1,则越相似;分数趋于0,则越不相似。
欧氏距离分数矩阵:
a b c ...
a 1 x x
b x 1 x
c x x 1
...
代码示例:向相似的用户推荐电影
import json
import numpy as np
with open('../../day01/data/ratings.json','r') as f:
ratings = json.loads(f.read())
users,scmat = list(ratings.keys()),[] #scmat构建欧氏矩阵
for user1 in users:
scrow = []
for user2 in users:
movies = set()
for movie in ratings[user1]:
if movie in ratings[user2]:
movies.add(movie)
if len(movies) == 0:
score = 0
else:
x,y = [],[]
for movie in movies:
x.append(ratings[user1][movie])
y.append(ratings[user2][movie])
x = np.array(x)
y = np.array(y)
score = 1/(1 + np.sqrt(((x-y)**2).sum())) #计算欧氏距离分数
scrow.append(score)
scmat.append(scrow)
users = np.array(users)
scrow = np.array(scmat)
for scrow in scmat:
#打印欧氏距离分数矩阵
print(' '.join('{:>5.2f}'.format(score) for score in scrow))
2.皮氏距离得分
概念:
皮尔逊:主要代表:相关性矩阵
相关性系数=协方差/标准差之积
介于[-1,1]之间;-1,反相关;1,正相关;0,不相关。
皮氏距离得分就是相关系数的值。
代码示例:ps.py (更客观些)
import json
import numpy as np
with open('../../day01/data/ratings.json','r') as f:
ratings = json.loads(f.read())
users,scmat = list(ratings.keys()),[]
for user1 in users:
scrow = []
for user2 in users:
movies = set()
for movie in ratings[user1]:
if movie in ratings[user2]:
movies.add(movie)
if len(movies) == 0:
score = 0
else:
x,y = [],[]
for movie in movies:
x.append(ratings[user1][movie])
y.append(ratings[user2][movie])
x = np.array(x)
y = np.array(y)
score = np.corrcoef(x,y)[0,1] #计算皮氏距离得分
scrow.append(score)
scmat.append(scrow)
users = np.array(users)
scrow = np.array(scmat)
for scrow in scmat:
#打印皮氏距离得分矩阵
print(' '.join('{:>5.2f}'.format(score) for score in scrow))
3.按相似度排列用户
代码示例:sim.py
import json
import numpy as np
with open('../../day01/data/ratings.json','r') as f:
ratings = json.loads(f.read())
users,scmat = list(ratings.keys()),[]
for user1 in users:
scrow = []
for user2 in users:
movies = set()
for movie in ratings[user1]:
if movie in ratings[user2]:
movies.add(movie)
if len(movies) == 0:
score = 0
else:
x,y = [],[]
for movie in movies:
x.append(ratings[user1][movie])
y.append(ratings[user2][movie])
x = np.array(x)
y = np.array(y)
score = np.corrcoef(x,y)[0,1] #计算皮氏距离得分
scrow.append(score)
scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
# 按相识度排序
for i, user in enumerate(users):
sorted_indics = scmat[i].argsort()[::-1] # 获得降序排列索引
sorted_indics = sorted_indics[sorted_indics != i] # 过滤等于自己的项
similar_users = users[sorted_indics] # 获得排序索引下标对应的用户
similar_scores = scmat[i, sorted_indics] # 获得排序索引下标对应的分值
print(user, '-->', similar_users, similar_scores,sep='\n')
4.生成推荐清单
推荐度的计算
1.正相关用户:即皮氏距离分数 > 0
2.打分高低
3.相似度权重
被推荐电影分值的加权平均
分值x相似度得分
代码示例:rcm.py
import json
import numpy as np
with open('../../day01/data/ratings.json','r') as f:
ratings = json.loads(f.read())
users,scmat = list(ratings.keys()),[]
for user1 in users:
scrow = []
for user2 in users:
movies = set()
for movie in ratings[user1]:
if movie in ratings[user2]:
movies.add(movie)
if len(movies) == 0:
score = 0
else:
x,y = [],[]
for movie in movies:
x.append(ratings[user1][movie])
y.append(ratings[user2][movie])
x = np.array(x)
y = np.array(y)
score = np.corrcoef(x,y)[0,1] #计算皮氏距离得分
scrow.append(score)
scmat.append(scrow)
users = np.array(users)
scmat = np.array(scmat)
# 按相识度排序
for i, user in enumerate(users):
sorted_indices = scmat[i].argsort()[::-1] # 获得降序排列索引
sorted_indices = sorted_indices[sorted_indices != i] # 过滤等于自己的项
similar_users = users[sorted_indices] # 获得排序索引下标对应的用户
similar_scores = scmat[i, sorted_indices] # 获得排序索引下标对应的分值
positive_mask = similar_scores > 0 #创建不相关掩码,即去掉 <0 的得分
similar_users = similar_users[positive_mask]
similar_scores = similar_scores[positive_mask]
score_sums,weight_sums = {},{}
for similar_user,similar_score in zip(
similar_users,similar_scores):
for movie,score in ratings[similar_user].items():
if movie not in ratings[user].keys():
if movie not in score_sums.keys():
score_sums[movie] = 0
score_sums[movie] += score * similar_score
if movie not in weight_sums.keys():
weight_sums[movie] = 0
weight_sums[movie] += similar_score
movie_ranks = {} #推荐度
for movie,score_sum in score_sums.items():
movie_ranks[movie] = score_sum / weight_sums[movie]
sorted_indices = np.array(list(movie_ranks.values())).argsort()[::-1]
recomms = np.array(list(
movie_ranks.keys()))[sorted_indices] #获得推荐清单
print(user,'-->',recomms,sep='\n')
十四、自然语言
1.概念
+--<----------------------------<--+
| |
人->说话->文本->语义->逻辑->文本->说话
-------- ---------------------------
人 机 器 语 音 识 别 过 程
人说话->文本:语音识别
文本-->语义:自然语言转化
语义-->逻辑:业务处理过程
逻辑-->文本:生成自然语音
文本-->机器说话:语音合成
2.安装
全新安装NLTK - 自然语言工具包
1.python -m pip install nltk
2.进入python命令行
3.import nltk
4.nltk.download()
5.选择all,点击download
6.设置环境变量:
在计算机-属性-高级系统设置-高级-环境变量-系统变量-新建:
上边:NLTK_DATA,下边:C:\nltk_data
7.如果没有设置环境变量,这需要每次:
from nltk import data
data.path.append(r"E:\python\nltk_data")
如果复制包:
则在上述第5步时选择本地包
中文支持:
pip install jieba
3.分词
将段落拆分成句子、单词、标点符号,等等
方法一:
import nltk.tokenize as tk #分词器
tk.sent_tokenize(文本段落) -> 获得句子列表。按句拆分
首字母大写、句尾标点(.!?...)
tk.word_tokenize(文本句子) -> 获得单词列表。按单词拆分
n个连续<空格>、换行、标点
方法二:
分词器 = tk.WordPunctTokenizer()
分词器.tokenize(文本句子)->按单词拆分
代码示例:tkn.py
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences" \
"with punctuations to see it in action."
tokens = tk.sent_tokenize(doc)
for i ,token in enumerate(tokens): #拆分为句子
print("%2d" % (i+1),token,sep='\n')
tokens = tk.word_tokenize(doc)#拆分为单词
for i,token in enumerate(tokens):
print("%2d"%(i+1),token)
tokenizer = tk.WordPunctTokenizer()#拆分为单词
tokens = tokenizer.tokenize(doc)
for i,token in enumerate(tokens):
print("%2d"%(i+1),token)
中文结巴分词示例:jiebacut.py
import jieba.analyse
# jieba.cut主要有三种模式
str_text = "我是中国人,我的名字叫杨洋,大家好!"
#str_text = open(u'../data/测试.txt', encoding='utf-8', errors='ignore').read()
# 全模式cut_all=True
str_quan = jieba.cut(str_text, cut_all=True)
print(" ".join(str_quan))
# 精准模式cut_all=False,默认即是
str_jing = jieba.cut(str_text, cut_all=False)
print(" ".join(str_jing))
# 搜索引擎模式 cut_for_search
str_soso = jieba.cut_for_search(str_text)
print(" ".join(str_soso))
4.词干提取
波特词干提取器:只针对英文
import nltk.stem.porter as pt
偏宽松,保留更多的字母
pt = pt.PorterStemmer()
兰卡斯特词干提取器:只针对英文
import nltk.stem.lancaster as lc
偏严格,只保留较少的字母
lc = lc.LancasterStemmer()
思诺博词干提取器:可用于其他语言
import nltk.stem.snowball as sb
偏中庸,严格程度居于以上二者这间
sb = sb.SnowballStemmer()
代码示例:stm.py
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table','probably','wolves','playing','is','dog','the',
'beaches','grounded','dreamt','envision']
pt = pt.PorterStemmer()
lc = lc.LancasterStemmer()
sb = sb.SnowballStemmer('english')
for word in words:
pt_stem = pt.stem(word)
lc_stem = lc.stem(word)
sb_stem = sb.stem(word)
print('%8s %8s %8s %8s'%(word,pt_stem,lc_stem,sb_stem))
5.词型还原
名词:变成单数
动词:变为动词原型
import nltk.stem as ns
ns = ns.WordNetLemmatizer() #词型还原器
n_ns = ns.lemmatize(word,pos='n')#按名词处理
v_ns = ns.lemmatize(word, pos='v')#按动词处理
代码示例:lmm.py
import nltk.stem as ns
words = ['table','probably','wolves','playing','is','dog','the',
'beaches','grounded','dreamt','envision']
ns = ns.WordNetLemmatizer()
for word in words:
n_ns = ns.lemmatize(word,pos='n')#按名词处理
v_ns = ns.lemmatize(word, pos='v')#按动词处理
print('%8s %8s %8s'%(word,n_ns,v_ns))
6.词袋模型
词表:包含段落中不同单词的个数。
[1]The brown dog is running.
[2]The black dog is in the black room.
[3]Running in the room is forbidden.
the brown dog is running black in room forbidden
获得每个单词在每个句子中出现的次数矩阵:
black brown dog forbidden in is room running the
[1] 0 1 1 0 0 1 0 1 1
[2] 2 0 1 0 1 1 1 0 2
[3] 0 0 0 1 1 1 1 1 1
代码示例:bow.py
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden. '
sc = tk.sent_tokenize(doc)
print(sc)
cv = ft.CountVectorizer()#基于计数的矢量化器
bow = cv.fit_transform(sc).toarray()#获取词袋
print(bow)
words = cv.get_feature_names()
print(words)
7.词频
对词袋矩阵做归一化,用词表中的每个单词在每个样本中出现的频率,
表示该单词对具体语句语义的价值。
代码示例:tf.py
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden. '
sc = tk.sent_tokenize(doc)
cv = ft.CountVectorizer()#基于计数的矢量化器
bow = cv.fit_transform(sc).toarray()#获取词袋
tf = sp.normalize(bow,norm='l1') #归一化处理,获得词频
print(tf)
8.词频逆文档频率与文档频率
文档频率 = 含有某个单词的文档(句子)数/总文档(句子)数
逆文档频率 = 1/文档频率 = 总文档(句子)数/含有某个单词的文档(句子)数
词频逆文档频率(TF-IDF):自然语言的数学模型
= 词频 * 逆文档频率
即为:词频矩阵中每个元素乘以相应单词的逆文档频率
值越大,说明该词对样本语义的贡献越大,根据每个词的贡献力度,
构建学习模型。形成自然语言的数学模型。
代码示例:tfidf.py
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden. '
sc = tk.sent_tokenize(doc)
cv = ft.CountVectorizer()#基于计数的矢量化器
bow = cv.fit_transform(sc).toarray()#获取词袋
tt = ft.TfidfTransformer() #词频逆文档频率转换器
tfidf = tt.fit_transform(bow).toarray() #获得词频逆文档频率
print(tfidf)
9.文本分类(主题识别)
使用朴素贝叶斯分类器
代码示例:doc.py
import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb #朴素贝叶斯分类器
train = sd.load_files('../../day01/data/20news',
encoding='latin1',shuffle=True,random_state=7)
train_data =train.data
train_y = train.target #获得类别下标
categories = train.target_names #获得类别标签
cv = ft.CountVectorizer() #词袋构建
train_row = cv.fit_transform(train_data)
tt = ft.TfidfTransformer() #词频逆文档频率(TF-IDF)
train_x = tt.fit_transform(train_row)
model = nb.MultinomialNB() #多项分布的朴素贝叶斯分类器
model.fit(train_x,train_y) #进行训练
test_data = [
'The curveballs of right handed pitchers tend to curve to the left',
'Caesar cipher is an ancient form of encryption',
'This two-wheeler is really good on slippery roads'
] #测试样本
test_bow = cv.transform(test_data) #识别词袋
test_x = tt.transform(test_bow)
pred_test_y = model.predict(test_x) #预测输出
for sc,index in zip(test_data,pred_test_y):
print(sc,'-->',categories[index] ) #打印预测结果
10.性格识别
代码示例:gndr.py
import numpy as np
import random
import nltk.corpus as nc #语料库
import nltk.classify as cf #自然语言分类器
male_names = nc.names.words('male.txt')
female_names = nc.names.words('female.txt')
models,acs=[],[]
for n_letters in range(1,6):
data = []
for male_name in male_names:
feature={'feature':male_name[-n_letters:].lower()}
data.append((feature,'male'))
for female_name in female_names:
feature={'feature':female_name[-n_letters:].lower()}
data.append((feature,'female'))
random.seed(7)
random.shuffle(data) #把顺序打乱
train_data = data[:int(len(data)/2)] #取前一半存为训练样本
test_data = data[int(len(data)/2):] #取后一半存为测试样本
model = cf.NaiveBayesClassifier.train(train_data) #直接训练数据,无需构建对象
ac = cf.accuracy(model,test_data)
models.append(model)
acs.append(ac)
best_index=np.array(acs).argmax()
best_letters = best_index + 1
best_model = models[best_index]
best_ac = acs[best_index]
print(best_letters,best_ac)
names,genders = ['Leonardo','Amy','Sam','Tom','Katherine','Taylor','Susanne'],\
[]#预测测试
for name in names:
feature = {'feature':name[-best_letters:].lower()}
gender = best_model.classify(feature)
genders.append(gender)
for name,gender in zip(names,genders):
print(name,'-->',gender)
python MachinelEarning机器学习笔记day05
猜你喜欢
转载自blog.csdn.net/pinecn/article/details/89877040
今日推荐
周排行