AmpliGraph初步实践

「这是我参与11月更文挑战的第22天,活动详情查看:2021最后一次更文挑战

AmpliGraph是埃森哲实验室开发的一个基于TensorFlow的开源库,用于预测知识图谱中概念之间的联系。它是一个神经机器学习模型的集合,用于统计关系学习(statistical relational learning)(SRL)(也称为关系机器学习)--这是AI/ML的一个子学科,涉及对知识图谱的监督学习。AmpliGraph有以下特征:(1)从现有知识图谱中发现新知识;(2)使用缺少的语句来完成大型知识图谱;(3)生成独立的知识图谱嵌入;(4)开发和评估新的关系模型。

1. 简介和准备工作

在本动手教程中,我们将使用开源库AmpliGraph.
让我们首先安装该库及其依赖项,然后导入本教程中使用的库。

# Install CUDA
conda install -y cudatoolkit=10.0

# Install cudnn libraries
conda install cudnn=7.6

# Install tensorflow GPU 
pip install tensorflow-gpu==1.15.3
复制代码

检查tensorflow 和 GPU

import tensorflow as tf 
print('TensorFlow  version: {}'.format(tf.__version__))
复制代码

安装AmpliGraph和其他依赖项

# Install AmpliGraph library
pip install ampligraph

# Required to visualize embeddings with tensorboard projector, comment out if not required!
pip install --user tensorboard

# Required to plot text on embedding clusters, comment out if not required!
pip install --user git+https://github.com/Phlya/adjustText
复制代码

2. 加载知识图谱数据集

首先,我们需要一个知识图,因此让我们加载一个称为Freebase-15k-237的标准知识图。

Ampligraph提供了一系列API来 加载标准知识图谱.

还提供了一系列API,用于加载csv,ntriples和rdf格式。详细信息可以点击这里

from ampligraph.datasets import load_fb15k_237, load_wn18rr, load_yago3_10
复制代码

重新映射了freebase237的id,并创建了一个csv文件,其中包含可读的名称而不是id。

import pandas as pd

URL = './data/freebase-237-merged-and-remapped.csv'
dataset = pd.read_csv(URL, header=None)
dataset.columns = ['subject', 'predicate', 'object']
dataset.head(5)
复制代码

image.png

print('Total triples in the KG:', dataset.shape)
复制代码

Total triples in the KG: (310079, 3)

【创建训练集、验证集、测试集】
使用Ampligraph提供的train_test_split_no_unseen函数来创建训练集、验证集、测试集。

from ampligraph.evaluation import train_test_split_no_unseen
# get the validation set of size 500
test_train, X_valid = train_test_split_no_unseen(dataset.values, 500, seed=0)

# get the test set of size 1000 from the remaining triples
X_train, X_test = train_test_split_no_unseen(test_train, 1000, seed=0)

print('Total triples:', dataset.shape)
print('Size of train:', X_train.shape)
print('Size of valid:', X_valid.shape)
print('Size of test:', X_test.shape)
复制代码

Total triples: (310079, 3)
Size of train: (308579, 3)
Size of valid: (500, 3)
Size of test: (1000, 3)\

3. 模型训练

现在,我们已经分割了数据集,让我们直接进入模型训练。

创建一个TransE模型,并使用fit函数在训练分组上对其进行训练。

TransE是最早为KGE研究奠定平台的嵌入式模型之一。它使用简单的矢量代数对三元组进行评分。与大多数模型相比,它的可训练参数数量非常少。

f = s + p o n f = - || s + p - o ||_{n}

from ampligraph.latent_features import TransE

model = TransE(k=150,                                                             # embedding size
               epochs=100,                                                        # Num of epochs
               batches_count= 10,                                                 # Number of batches 
               eta=1,                                                             # number of corruptions to generate during training
               loss='pairwise', loss_params={'margin': 1},                        # loss type and it's hyperparameters         
               initializer='xavier', initializer_params={'uniform': False},       # initializer type and it's hyperparameters
               regularizer='LP', regularizer_params= {'lambda': 0.001, 'p': 3},   # regularizer along with its hyperparameters
               optimizer= 'adam', optimizer_params= {'lr': 0.001},                # optimizer to use along with its hyperparameters
               seed= 0, verbose=True)

model.fit(X_train)

from ampligraph.utils import save_model, restore_model
save_model(model, 'TransE-small.pkl')
复制代码

image.png

参考此链接了解参数及其值的详细说明。

【计算评估指标】 score: 这是模型通过应用评分功能分配给三元组的值。

test_triple = ['harrison ford', 
               '/film/actor/film./film/performance/film', 
               'star wars']

triple_score = model.predict(test_triple)

print('Triple of interest:\n', test_triple)
print('Triple Score:\n', triple_score)
复制代码

Triple of interest:
['harrison ford', '/film/actor/film./film/performance/film', 'star wars']
Triple Score:
[-8.270267]

import numpy as np
list_of_actors = ['salma hayek', 'carrie fisher', 'natalie portman',  'kristen bell',
                  'mark hamill', 'neil patrick harris', 'harrison ford' ]

# stack it horizontally to create s, p, o
hypothesis = np.column_stack([list_of_actors, 
                              ['/film/actor/film./film/performance/film'] * len(list_of_actors),
                              ['star wars'] * len(list_of_actors),
                             ])

# score the hypothesis
triple_scores = model.predict(hypothesis)

# append the scores column
scored_hypothesis = np.column_stack([hypothesis, triple_scores])
# sort by score in descending order
scored_hypothesis = scored_hypothesis[np.argsort(scored_hypothesis[:, 3])]
scored_hypothesis
复制代码

结果

array([['harrison ford', '/film/actor/film./film/performance/film',
        'star wars', '-8.270266'],
       ['carrie fisher', '/film/actor/film./film/performance/film',
        'star wars', '-8.357192'],
       ['natalie portman', '/film/actor/film./film/performance/film',
        'star wars', '-8.739484'],
       ['neil patrick harris', '/film/actor/film./film/performance/film',
        'star wars', '-9.089647'],
       ['mark hamill', '/film/actor/film./film/performance/film',
        'star wars', '-9.17255'],
       ['salma hayek', '/film/actor/film./film/performance/film',
        'star wars', '-9.205964'],
       ['kristen bell', '/film/actor/film./film/performance/film',
        'star wars', '-9.764657']], dtype='<U39')
复制代码

image.png

根据上面的排序可知,我们对每一个三元组有了一个分数,可计算出它们的排名如下: C O U N T ( c o r r u p t i o n s c o r e > = t r i p l e s c o r e ) COUNT ( corruption_{score} >= triple_{score} ) 找到hypothesis score在sub_corr_score中的位置,得到sub_rank

sub_rank_worst = np.sum(np.greater_equal(sub_corr_score, triple_score[0])) + 1
复制代码

Assigning the worst rank (to break ties): 1655

X_test_small = np.array(
                [['doctorate',
                    '/education/educational_degree/people_with_this_degree./education/education/major_field_of_study',
                    'computer engineering'],

                ['star wars',
                    '/film/film/estimated_budget./measurement_unit/dated_money_value/currency',
                    'united states dollar'],

                ['harry potter and the chamber of secrets',
                    '/film/film/estimated_budget./measurement_unit/dated_money_value/currency',
                    'united states dollar'],

                ['star wars', '/film/film/language', 'english language'],
                ['harrison ford', '/film/actor/film./film/performance/film', 'star wars']])


X_filter = np.concatenate([X_train, X_valid, X_test], 0)

ranks = evaluate_performance(X_test_small, 
                             model=model, 
                             filter_triples=X_filter, 
                             corrupt_side='s,o')
print(ranks)

复制代码

结果

[[   9    5]
 [   1    1]
 [  77    1]
 [   2    2]
 [1644  833]]
复制代码

Mean rank(MR):顾名思义,是三元组所有rank的平均值。取值范围从1(当所有级别都等于1时的理想情况)到最坏情况(所有级别都在最后)。

image.png

from ampligraph.evaluation import mr_score
print('MR :', mr_score(ranks))
复制代码

MR:257.5

Mean reciprocal rank (MRR) 是所有三元组rank倒数的平均值。取值范围为0 ~ 1;值越高,模型越好。

image.png

from ampligraph.evaluation import mrr_score
print('MRR :', mrr_score(ranks))
复制代码

MRR:0.4325906876796283

hits@n表示计算的排名大于或等于n的排名的百分比。取值范围为0 ~ 1;值越高,模型越好。

image.png

from ampligraph.evaluation import hits_at_n_score
print('hits@1 :', hits_at_n_score(ranks, 1))
print('hits@10 :', hits_at_n_score(ranks, 10))
复制代码

hits@1 : 0.3
hits@10 : 0.7

封装为一个函数调用

def display_aggregate_metrics(ranks):
    print('Mean Rank:', mr_score(ranks)) 
    print('Mean Reciprocal Rank:', mrr_score(ranks)) 
    print('Hits@1:', hits_at_n_score(ranks, 1))
    print('Hits@10:', hits_at_n_score(ranks, 10))
    print('Hits@100:', hits_at_n_score(ranks, 100))


display_aggregate_metrics(ranks)
复制代码

Mean Rank: 257.5
Mean Reciprocal Rank: 0.4325906876796283
Hits@1: 0.3
Hits@10: 0.7
Hits@100: 0.8

4. 训练和early stopping

在训练模型时,我们希望确保模型对数据没有过拟合或过拟合。如果我们针对固定数量的时间段训练一个模型,我们将不知道模型是否对训练数据进行了欠拟合或过拟合。因此,有必要在固定的时间间隔上测试模型的性能,以决定何时停止训练。这被称为“early stopping”,即我们不让模型运行很长时间,而是在性能开始下降之前就停止。
然而,我们也不希望模型对保留的集合进行过拟合,从而限制模型的泛化能力。因此,我们应该创建一个验证集和一个测试集来验证模型的泛化能力,并确保我们不会对数据进行过拟合和过拟合。


early_stopping_params = { 'x_valid': X_valid,   # Validation set on which early stopping will be performed
                          'criteria': 'mrr',    # metric to watch during early stopping
                          'burn_in': 150,       # Burn in time, i.e. early stopping checks will not be performed till 150 epochs
                          'check_interval': 50, # After burn in time, early stopping checks will be performed at every 50th epochs (i.e. 150, 200, 250, ...)
                          'stop_interval': 2,   # If the monitored criteria degrades for these many epochs, the training stops. 
                          'corrupt_side':'s,o'  # Which sides to corrupt furing early stopping evaluation (default both subject and obj as described earlier)
                        }

# create a model as earlier
model = TransE(k=100, 
               epochs=10000, 
               eta=1, 
               loss='multiclass_nll', 
               initializer='xavier', initializer_params={'uniform': False},
               regularizer='LP', regularizer_params= {'lambda': 0.0001, 'p': 3},
               optimizer= 'adam', optimizer_params= {'lr': 0.001}, 
               seed= 0, batches_count= 1, verbose=True)

# call model.fit by passing early stopping params
model.fit(X_train,                                      # training set
          early_stopping=True,                          # set early stopping to true
          early_stopping_params=early_stopping_params)  # pass the early stopping params

# evaluate the model with filter
X_filter = np.concatenate([X_train, X_valid, X_test], 0)
ranks = evaluate_performance(X_test, 
                             model=model, 
                             filter_triples=X_filter)
# display the metrics
display_aggregate_metrics(ranks)
复制代码

image.png

猜你喜欢

转载自juejin.im/post/7033386911968428040