1. 使用词性标注器
import nltk text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") tagged_text = nltk.pos_tag(text) print(tagged_text) # 为避免标记的复杂化,可设置tagset为‘universal’ tagged_text = nltk.pos_tag(text,tagset='universal') print(tagged_text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] [('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('to', 'PRT'), ('obtain', 'VERB'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]
# 注:不同符号代表的词性可参考博文https://mp.csdn.net/postedit/79006868
2. str2tuple()创建标注元组
tagged_token = nltk.tag.str2tuple('fly/NN') print(tagged_token) ('fly', 'NN')3. 读取已标注的语料库
print(nltk.corpus.brown.tagged_words()) print(nltk.corpus.nps_chat.tagged_words()) [('The', 'AT'), ('Fulton', 'NP-TL'), ...] [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
4. 探索已标注的语料库
brown_learned_text = nltk.corpus.brown.words(categories='learned') sorted(set(b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often')) [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming', ...]
5. 查看跟随词的词性标记: 查看‘often'后面跟随的词的词性分布
tags = nltk.pos_tag([b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often'],tagset='universal') tags = [item[1] for item in tags] fd = nltk.FreqDist(tags) print(fd.tabulate()) VERB ADV ADJ ADP NOUN . PRT 27 12 7 7 5 4 2
6. 使用POS标记寻找三词短语
def process(sentence): for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence): if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')): print(w1, w2, w3) for tagged_sent in nltk.corpus.brown.tagged_sents(): process(tagged_sent) combined to achieve continue to place serve to protect wanted to wait allowed to place expected to become ......7. 默认标注器: nltk.DefaultTagger()
from nltk.corpus import brown brown_tagged_sents = brown.tagged_sents(categories='news') brown_sents = brown.sents(categories='news') tags = [tag for (word,tag) in brown.tagged_words(categories='news')] print(nltk.FreqDist(tags).max()) NN"NN"出现的次数最多,设置"NN"为默认的词性, 但是效果不佳
raw = 'I do not like green eggs and ham, I do not like them Sam I am' tokens = nltk.word_tokenize(raw) default_tagger = nltk.DefaultTagger('NN') default_tagger.tag(tokens) print(default_tagger.evaluate(brown_tagged_sents)) 0.130894842572150288. 正则表达式标注器
#注意,这些是按顺序处理的,第一个被匹配上的会被使用。
效果也不是很好,但是要由于默认的标注器
patterns = [(r'.*ing$','VBG'), (r'.*ed$','VBD'), (r'.*es$','VBZ'), (r'.*ould$','MD'), (r'.*\'s$','NN$'), (r'.*s$','NNS'), (r'^-?[0-9]+(.[0-9]+)?$','CD'), (r'.*','NN')] regexp_tagger = nltk.RegexpTagger(patterns) print(regexp_tagger.tag(brown_sents[3])) print(regexp_tagger.evaluate(brown_tagged_sents)) [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')] 0.203263917894862459. 查询标注器
fd = nltk.FreqDist(brown.words(categories='news')) cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) most_freq_words = list(fd.keys())[:100] likely_tags = dict((word,cfd[word].max()) for word in most_freq_words) baseline_tagger = nltk.UnigramTagger(model=likely_tags) print(baseline_tagger.evaluate(brown_tagged_sents)) sent = brown.sents(categories='news')[5] baseline_tagger.tag(sent) 0.3329355371243312 Out[19]: [('It', 'PPS'), ('recommended', 'VBD'), ('that', 'CS'), ('Fulton', 'NP-TL'), ('legislators', 'NNS'), ('act', 'NN'), ......]10. N-gram标注
(1) 一元标注器(unigram tagging):使用简单的统计算法,对每个标识符分配最有可能的标记,不考虑上下文
#指定已标注的句子数据作为参数来训练一元标注器 unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) unigram_tagger.tag(brown_sents[2009]) [('The', 'AT'), ('structures', 'NNS'), ('housing', 'VBG'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('masonry', 'NN'), ('and', 'CC'), ('frame', 'NN'), ('construction', 'NN'), ('.', '.')](2) 分离训练和测试数据: 80%作为训练集来训练Unigram标注器,准确率达到93.60%
size = int(len(brown_tagged_sents)*0.8) train_sents = brown_tagged_sents[:size] test_sents = brown_tagged_sents[:size] unigram_tagger = nltk.UnigramTagger(train_sents) unigram_tagger.evaluate(test_sents) 0.9359608998057523(3)一般的N-gram标注,它的上下文是当前词和前面n-1个标识符的词性标记
#注意,bigram标注器能够标注训练中它看到过的句子中的所有词,但对一个没见过的句子却不行, #只要遇到一个新词就无法给它分配标记 bigram_tagger = nltk.BigramTagger(train_sents) print(bigram_tagger.evaluate(test_sents)) print(bigram_tagger.tag(brown_sents[2007])) 0.7912525847484179 [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] print(bigram_tagger.tag(brown_sents[4203])) [('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', None), ('is', None), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)]注意:N-gram不应考虑跨越句子边界的上下文,因此,NLTK的标注器被设计用于句子链表,一个句子是一个链表!
11. 组合标注器:回退backoff
"实例:尝试使用bigram标注器标注标识符,如果无法找到标记,尝试使用unigram标注器,还无法找到,则使用默认标注器" t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents,backoff=t0) t2 = nltk.BigramTagger(train_sents,backoff=t1) print(t0.evaluate(test_sents)) print(t1.evaluate(test_sents)) print(t2.evaluate(test_sents)) 0.1334795413246444 0.9359608998057523 0.9740459928566952
"练习:定义一个名为t3的TrigramTagger,扩展前面的例子" t3 = nltk.TrigramTagger(train_sents,backoff=t2) print(t3.evaluate(test_sents)) 0.983595463374898212. 存储标注器:使用pickle, 也可使用cPickle ( 是pickle的一个更快的c语言编译版本。)
#存储标注器 from pickle import dump output = open('t2.pkl','wb') dump(t2,output,-1) output.close() #加载标注器 from pickle import load input = open('t2.pkl', 'rb') tagger = load(input) input.close() tagger.tag(brown_sents[22]) [('Regarding', 'IN'), ("Atlanta's", 'NP$'), ('new', 'JJ'), ('multi-million-dollar', 'JJ'), ('airport', 'NN'), (',', ','), ('the', 'AT'), ('jury', 'NN'), ('recommended', 'VBD'), ......]