自然语言处理学习6：nltk词性标注

1. 使用词性标注器

import nltk
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

tagged_text = nltk.pos_tag(text)
print(tagged_text)
# 为避免标记的复杂化，可设置tagset为‘universal’
tagged_text = nltk.pos_tag(text,tagset='universal')    
print(tagged_text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
[('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('to', 'PRT'), ('obtain', 'VERB'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]

# 注：不同符号代表的词性可参考博文https://mp.csdn.net/postedit/79006868

2. str2tuple()创建标注元组

tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)
('fly', 'NN')

3. 读取已标注的语料库

print(nltk.corpus.brown.tagged_words())
print(nltk.corpus.nps_chat.tagged_words())
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

4. 探索已标注的语料库

brown_learned_text = nltk.corpus.brown.words(categories='learned')
sorted(set(b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often'))
[',',
 '.',
 'accomplished',
 'analytically',
 'appear',
 'apt',
 'associated',
 'assuming',
 ...]

5. 查看跟随词的词性标记: 查看‘often'后面跟随的词的词性分布

tags = nltk.pos_tag([b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often'],tagset='universal')
tags = [item[1] for item in tags]
fd = nltk.FreqDist(tags)
print(fd.tabulate())
VERB  ADV  ADJ  ADP NOUN    .  PRT 
  27   12    7    7    5    4    2

6. 使用POS标记寻找三词短语

def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):  
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)

for tagged_sent in nltk.corpus.brown.tagged_sents():
    process(tagged_sent)

combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
......

7. 默认标注器: nltk.DefaultTagger()

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
print(nltk.FreqDist(tags).max())
NN

"NN"出现的次数最多，设置"NN"为默认的词性，但是效果不佳

raw = 'I do not like green eggs and ham, I do not like them Sam I am'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)
print(default_tagger.evaluate(brown_tagged_sents)) 
0.13089484257215028

8. 正则表达式标注器
#注意，这些是按顺序处理的，第一个被匹配上的会被使用。

效果也不是很好，但是要由于默认的标注器

patterns = [(r'.*ing$','VBG'),
            (r'.*ed$','VBD'),
            (r'.*es$','VBZ'),
            (r'.*ould$','MD'),
            (r'.*\'s$','NN$'),
            (r'.*s$','NNS'),
            (r'^-?[0-9]+(.[0-9]+)?$','CD'),
            (r'.*','NN')]
regexp_tagger = nltk.RegexpTagger(patterns)
print(regexp_tagger.tag(brown_sents[3]))
print(regexp_tagger.evaluate(brown_tagged_sents))
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')]
0.20326391789486245

9. 查询标注器

fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = list(fd.keys())[:100]
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
print(baseline_tagger.evaluate(brown_tagged_sents))

sent = brown.sents(categories='news')[5]
baseline_tagger.tag(sent)

0.3329355371243312
Out[19]:
[('It', 'PPS'),
 ('recommended', 'VBD'),
 ('that', 'CS'),
 ('Fulton', 'NP-TL'),
 ('legislators', 'NNS'),
 ('act', 'NN'),
  ......]

10. N-gram标注
(1) 一元标注器（unigram tagging):使用简单的统计算法，对每个标识符分配最有可能的标记，不考虑上下文

#指定已标注的句子数据作为参数来训练一元标注器
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) 
unigram_tagger.tag(brown_sents[2009])
[('The', 'AT'),
 ('structures', 'NNS'),
 ('housing', 'VBG'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('masonry', 'NN'),
 ('and', 'CC'),
 ('frame', 'NN'),
 ('construction', 'NN'),
 ('.', '.')]

(2) 分离训练和测试数据: 80%作为训练集来训练Unigram标注器，准确率达到93.60%

size = int(len(brown_tagged_sents)*0.8)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[:size]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
0.9359608998057523

（3）一般的N-gram标注，它的上下文是当前词和前面n-1个标识符的词性标记

#注意，bigram标注器能够标注训练中它看到过的句子中的所有词，但对一个没见过的句子却不行，
#只要遇到一个新词就无法给它分配标记
bigram_tagger = nltk.BigramTagger(train_sents)
print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(brown_sents[2007]))
0.7912525847484179
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

print(bigram_tagger.tag(brown_sents[4203]))
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', None), ('is', None), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)]

注意：N-gram不应考虑跨越句子边界的上下文，因此，NLTK的标注器被设计用于句子链表，一个句子是一个链表！

11. 组合标注器：回退backoff

"实例：尝试使用bigram标注器标注标识符，如果无法找到标记，尝试使用unigram标注器，还无法找到，则使用默认标注器"
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents,backoff=t0)
t2 = nltk.BigramTagger(train_sents,backoff=t1)
print(t0.evaluate(test_sents))
print(t1.evaluate(test_sents))
print(t2.evaluate(test_sents))
0.1334795413246444
0.9359608998057523
0.9740459928566952

"练习：定义一个名为t3的TrigramTagger，扩展前面的例子"
t3 = nltk.TrigramTagger(train_sents,backoff=t2)
print(t3.evaluate(test_sents))
0.9835954633748982

12. 存储标注器：使用pickle, 也可使用cPickle ( 是pickle的一个更快的c语言编译版本。)

#存储标注器
from pickle import dump
output = open('t2.pkl','wb')
dump(t2,output,-1)
output.close()
#加载标注器
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
tagger.tag(brown_sents[22])
[('Regarding', 'IN'),
 ("Atlanta's", 'NP$'),
 ('new', 'JJ'),
 ('multi-million-dollar', 'JJ'),
 ('airport', 'NN'),
 (',', ','),
 ('the', 'AT'),
 ('jury', 'NN'),
 ('recommended', 'VBD'),
  ......]

自然语言处理学习6：nltk词性标注

猜你喜欢