版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/AgoniAngel/article/details/80146491
1. DefaultTagger标注器
DefaultTagger可以将所有token标记为同一个标签(tag)。
sent = "Thanks for your reading!"
tokens = nltk.word_tokenize(sent)
default_tagger = nltk.DefaultTagger('NN')
tagged_words = default_tagger.tag(tokens)
print(tagged_words)
result:
[('Thanks', 'NN'), ('for', 'NN'), ('your', 'NN'), ('reading', 'NN'), ('!', 'NN')]
evaluate函数可以测试这种标记方法的准确率。这里使用brown语料库提供的标记好词性的tagged_sents进行测试:
brown_tagged_sents = brown.tagged_sents(categories='news')
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.evaluate(brown_tagged_sents))
result:
0.13089484257215028
输入结果说明将所有单词标记为名词(NN)的方法只有13%的准确率,这也说明brown_tagged_sents里名词占13%。
2. N-gram标注器
将brown_tagged_sents前90%的数据作为训练数据,后10%的数据作为测试数据。
以UnigramTagger(train_data,backoff = default_tagger) 为例,对于UnigramTagger不能标记train_data中的一些单词,使用 backoff 对应的default_tagger标记。
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
train_data= brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)]
test_data= brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):]
unigram_tagger = UnigramTagger(train_data,backoff = default_tagger)
print(unigram_tagger.evaluate(test_data))
bigram_tagger= BigramTagger(train_data, backoff = unigram_tagger)
print(bigram_tagger.evaluate(test_data))
trigram_tagger=TrigramTagger(train_data,backoff = bigram_tagger)
print(trigram_tagger.evaluate(test_data))
result:
0.8361407355726104
0.8452108043456593
0.843317053722715
3. 正则表达式标注器
举例来说,以able结尾的单词一般是形容词,以ly结尾的一般是副词等,根据这种构词规则,可以使用正则表达式表示一类单词的通用形式,进而统一进行标注。from nltk.tag.sequential import RegexpTagger
regexp_tagger = RegexpTagger(
[( r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
( r'(The|the|A|a|An|an)$', 'AT'), # articles
( r'.*able$', 'JJ'), # adjectives
( r'.*ness$', 'NN'), # nouns formed from adj
( r'.*ly$', 'RB'), # adverbs
( r'.*s$', 'NNS'), # plural nouns
( r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense verbs
(r'.*', 'NN') # nouns (default)
]) # 前缀r用于防止转义,常用于正则表达式
print((regexp_tagger.evaluate(test_data)))
result:
0.31306687929831556
用正则表示式标注器标注日期和$字符:
date_tagger = RegexpTagger([
(r'(\d{2})[/.-](\d{2})[/.-](\d{4})$','DATE'),
(r'\$','MONEY')
])
test = 'I will be coming on sat 10-02-2014 with around 10 $ '.split()
date_tagger.tag(test)
result:
[('I', None),
('will', None),
('be', None),
('coming', None),
('on', None),
('sat', None),
('10-02-2014', 'DATE'),
('with', None),
('around', None),
('10', None),
('$', 'MONEY')]
unigram_tagger = UnigramTagger(train_data,backoff = regexp_tagger)
print(unigram_tagger.evaluate(test_data))
bigram_tagger= BigramTagger(train_data, backoff = unigram_tagger)
print(bigram_tagger.evaluate(test_data))
trigram_tagger=TrigramTagger(train_data,backoff = bigram_tagger)
print(trigram_tagger.evaluate(test_data))
result:
0.8657430479417921
0.8755108143127679
0.8730190371773149