1、概述
使用预训练的词嵌入模型 (Word2Vec) 来探索词嵌入如何让我们探索词之间的相似性和词之间的关系。 例如,查找一个国家的首都或一家公司的主要产品。 最后,演示使用 t-SNE 在 2D 图上绘制高维空间。
我们将首先从 Google 新闻下载预训练模型并解压。
2、加载词向量
import os
#from tensorflow.keras.utils import get_file
import gensim
import subprocess
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(10, 10)
model = gensim.models.KeyedVectors.load_word2vec_format('D:\GoogleNews-vectors-negative300.bin', binary=True)
3、查找相似的词
让我们通过下面语句查看与浓缩咖啡最相似的东西:
model.most_similar(positive=['espresso'])
输出如下
[('cappuccino', 0.6888187527656555), ('mocha', 0.6686208248138428), ('coffee', 0.6616825461387634), ('latte', 0.653675377368927), ('caramel_macchiato', 0.6491268277168274), ('ristretto', 0.6485546231269836), ('espressos', 0.6438629031181335), ('macchiato', 0.6428250074386597), ('chai_latte', 0.6308028101921082), ('espresso_cappuccino', 0.6280542612075806)]
4、定义关系函数
定义一个查找函数,如果国王像男人,那么女人像什么?
def A_is_to_B_as_C_is_to(a, b, c, topn=1):
a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))
res = model.most_similar(positive=b + c, negative=a, topn=topn)
if len(res):
if topn == 1:
return res[0][0]
return [x[0] for x in res]
return None
调用方法,会得到queen。
A_is_to_B_as_C_is_to('man', 'woman', 'king')
如果德国的首都是柏林,那么意大利的首都是哪里?
for country in 'Italy', 'France', 'India', 'China':
print('%s is the capital of %s' %
(A_is_to_B_as_C_is_to('Germany', 'Berlin', country), country))
输出如下:
Rome is the capital of Italy
Paris is the capital of France
Delhi is the capital of India
Beijing is the capital of China
或者我们可以对特定公司的重要产品做同样的事情。 在这里,我们使用两种产品为产品方程播种,iPhone 代表 Apple,Starbucks_coffee 代表 Starbucks。 请注意,在嵌入模型中,数字被替换为 #:
for company in 'Google', 'IBM', 'Boeing', 'Microsoft', 'Samsung':
products = A_is_to_B_as_C_is_to(
['Starbucks', 'Apple'],
['Starbucks_coffee', 'iPhone'],
company, topn=3)
print('%s -> %s' %
(company, ', '.join(products)))
输出如下:
Google -> personalized_homepage, app, Gmail
IBM -> DB2, WebSphere_Portal, Tamino_XML_Server
Boeing -> Dreamliner, airframe, aircraft
Microsoft -> Windows_Mobile, SyncMate, Windows
Samsung -> MM_A###, handset, Samsung_SCH_B###
4、使用TSNE进行聚类
让我们通过选择三类物品、饮料、国家和运动来进行一些聚类:
beverages = ['espresso', 'beer', 'vodka', 'wine', 'cola', 'tea']
countries = ['Italy', 'Germany', 'Russia', 'France', 'USA', 'India']
sports = ['soccer', 'handball', 'hockey', 'cycling', 'basketball', 'cricket']
items = beverages + countries + sports
len(items)
并查找他们的向量:
item_vectors = [(item, model[item])
for item in items
if item in model]
len(item_vectors)
现在使用 TSNE 进行聚类:
vectors = np.asarray([x[1] for x in item_vectors])
lengths = np.linalg.norm(vectors, axis=1)
norm_vectors = (vectors.T / lengths).T
tsne = TSNE(n_components=2, perplexity=10, verbose=2).fit_transform(norm_vectors)
输出如下:
[t-SNE] Computing 17 nearest neighbors...
[t-SNE] Indexed 18 samples in 0.000s...
[t-SNE] Computed neighbors for 18 samples in 0.052s...
[t-SNE] Computed conditional probabilities for sample 18 / 18
[t-SNE] Mean sigma: 0.581543
[t-SNE] Computed conditional probabilities in 0.003s
[t-SNE] Iteration 50: error = 70.7682343, gradient norm = 0.2115109 (50 iterations in 0.017s)
[t-SNE] Iteration 100: error = 50.6931763, gradient norm = 0.0493703 (50 iterations in 0.012s)
[t-SNE] Iteration 150: error = 72.4811478, gradient norm = 0.2386879 (50 iterations in 0.011s)
[t-SNE] Iteration 200: error = 61.9339905, gradient norm = 0.1581545 (50 iterations in 0.012s)
[t-SNE] Iteration 250: error = 64.9977417, gradient norm = 0.1168961 (50 iterations in 0.011s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.997742
[t-SNE] Iteration 300: error = 0.9344296, gradient norm = 0.0008748 (50 iterations in 0.013s)
[t-SNE] Iteration 350: error = 0.7699002, gradient norm = 0.0005786 (50 iterations in 0.011s)
[t-SNE] Iteration 400: error = 0.6187575, gradient norm = 0.0004745 (50 iterations in 0.012s)
[t-SNE] Iteration 450: error = 0.5282804, gradient norm = 0.0003208 (50 iterations in 0.011s)
[t-SNE] Iteration 500: error = 0.4986507, gradient norm = 0.0001888 (50 iterations in 0.011s)
[t-SNE] Iteration 550: error = 0.3673418, gradient norm = 0.0004975 (50 iterations in 0.012s)
[t-SNE] Iteration 600: error = 0.2507115, gradient norm = 0.0007413 (50 iterations in 0.011s)
[t-SNE] Iteration 650: error = 0.1724875, gradient norm = 0.0002562 (50 iterations in 0.011s)
[t-SNE] Iteration 700: error = 0.1552246, gradient norm = 0.0001649 (50 iterations in 0.012s)
[t-SNE] Iteration 750: error = 0.1389877, gradient norm = 0.0000916 (50 iterations in 0.012s)
[t-SNE] Iteration 800: error = 0.1303239, gradient norm = 0.0000812 (50 iterations in 0.011s)
[t-SNE] Iteration 850: error = 0.1220449, gradient norm = 0.0000533 (50 iterations in 0.010s)
[t-SNE] Iteration 900: error = 0.1205731, gradient norm = 0.0000198 (50 iterations in 0.011s)
[t-SNE] Iteration 950: error = 0.1201564, gradient norm = 0.0000153 (50 iterations in 0.011s)
[t-SNE] Iteration 1000: error = 0.1198082, gradient norm = 0.0000120 (50 iterations in 0.012s)
[t-SNE] KL divergence after 1000 iterations: 0.119808Process finished with exit code 0
使用matplotlib显示结果
可以看到,这些国家、运动和饮料都形成了自己的小群落,可以说板球和印度相互吸引,但可能不太清楚,葡萄酒、法国、意大利和浓缩咖啡。