机器学习项目(二) 人工智能辅助信息抽取(八)

BiLSTM-CRF 模型

BiLSTM-CRF

1.句中转化为字词向量序列,字词向量可以在事先训练好或随机初始化,在模型训练时还可以再训练
2.经过BiLSTM特征提取,输出是每个单词对应的预测标签
3.经CRF层约束,输出最优标签序列

发射分数(emission score)

发射分数 来自BiLSTM层的输出
X i y j X_{iyj} 代表发射分数,i是单词的位置索引, y j y_j 是类别的索引
x i = 1 , y j = 2 = x w 1 , B O r g a n i z a t i o n = 0.1 x_i = 1,y_j = 2 = x_{w1},B-Organization = 0.1

转移分数(Transition score)

转移分数,来自CRF层可以学到的转移矩阵
转移矩阵是BiLSTM-CRF模型的一个参数。可随机初始化转移矩阵的分数,然后再训练中更新

路径分数

Si = EmissionScore + TransitionScore

CRF损失函数

CRF损失函数由两部分组成,真实路径的分数和所有路径的总分数。真实路径的分数应该是所有路径中分数最高的

当前节点得分

类似维特比解码算法,这里每个节点记录之前所有节点到当前节点的路径总和,最后一步即可得到所有路径的总和。

预测

预测采用维特比解码,每个节点记录之前所有节点到当前节点的最优路径,最后一步通过回溯可以得到一条最优路径。
 previous  = [ max (  scores  [ 00 ] ,  scores  [ 10 ] ) , max (  scores  [ 01 ] ,  scores  [ 11 ] ) ] \text { previous }=[\max (\text { scores }[00], \text { scores }[10]), \max (\text { scores }[01], \text { scores }[11])]

1.读取数据

ner_train_path = '../dataset/ner/ner.train'
ner_test_path = '../dataset/ner/ner.test'

dl_dicts_path = './warehouse/dicts.pkl'
def load_dataset(path):
    sentences = []
    labels = []
    sentence = []
    label = []
    with codecs.open(path, 'r', encoding='utf-8') as f:
        for line in f.readlines():        
            if line == '\n':
                sentences.append(sentence)
                sentence = []
                labels.append(label)
                label = []
            else:
                try:
                    if line[0] == ' ':
                        word, tag = ' ', line[:-1].split(' ')[-1]
                    else:
                        word, _, tag = line[:-1].split(' ')
                    word = re.sub('[0-9]', '0',  word.lower())
                    sentence.append(word)
                    label.append(tag)
                except:
                    print(line)
                    break
    return np.array(sentences), np.array(labels)
train_sentences, train_labels = load_dataset(ner_train_path)
test_sentences, test_labels = load_dataset(ner_test_path)
len(train_labels), len(test_labels)
total_words = list(set([x for sentence in train_sentences for x in sentence]))
total_words.insert(0, 'unk')
total_words.insert(0, 'pad')

total_labels = list(set([x for label in train_labels for x in label]) - {'O'})
total_labels.insert(0, 'O')
len(total_words), len(total_labels)

total_words[:10]

print(total_labels)
word_2_id = {w: index for index, w in enumerate(total_words)}
id_2_word = {index: w for w, index in word_2_id.items()}
assert word_2_id['pad'] == 0
assert word_2_id['unk'] == 1

label_2_id = {label: index for index, label in enumerate(total_labels)}
id_2_label = {index: label for label, index in label_2_id.items()}
assert label_2_id['O'] == 0


with codecs.open(dl_dicts_path, "wb") as f:
    pickle.dump([word_2_id, id_2_word, label_2_id, id_2_label], f)



2.划分验证集
从训练集中划分1/10出来作为验证集,并将所有数据集转化为id序列

dl_train_path = './data/dl_ner.train'
dl_val_path = './data/dl_ner.val'
dl_test_path = './data/dl_ner.test'

len(train_labels), len(test_labels)

count = len(train_labels)
split_index = count//10

indexs = np.arange(count)
np.random.shuffle(indexs)
train_indexs = indexs[split_index:]
val_indexs = indexs[:split_index]
val_indexs[:10]

train_sentences_splited = train_sentences[train_indexs]
train_labels_splited = train_labels[train_indexs]
val_sentences_splited = train_sentences[val_indexs]
val_labels_splited = train_labels[val_indexs]

len(train_labels_splited), len(val_labels_splited), len(test_labels)


def build_dl_data(sentences, labels, word_2_id, label_2_id):
    data = []
    for index in range(len(sentences)):
        sentence, label = sentences[index], labels[index]
        sentence_id = [word_2_id.get(w, 1) for w in sentence]
        label_id = [label_2_id.get(l, 0) for l in label]
        data.append([sentence, sentence_id, label_id])
    return data

train_data = build_dl_data(train_sentences_splited, train_labels_splited, word_2_id, label_2_id)
val_data = build_dl_data(val_sentences_splited, val_labels_splited, word_2_id, label_2_id)
test_data = build_dl_data(test_sentences, test_labels, word_2_id, label_2_id)

len(train_data), len(val_data), len(test_data)

print(val_data[0])

with codecs.open(dl_train_path, "wb") as f:
    pickle.dump(train_data, f)
    
with codecs.open(dl_val_path, "wb") as f:
    pickle.dump(val_data, f)
    
with codecs.open(dl_test_path, "wb") as f:
    pickle.dump(test_data, f)

发布了110 篇原创文章 · 获赞 3 · 访问量 4094

猜你喜欢

转载自blog.csdn.net/qq_33357094/article/details/104999535