NLTK学习（二）

学习所用，如有侵权，立即删除。

词性标注，或POS(Part Of Speech)，是一种分析句子成分的方法，通过它来识别每个词的词性。下面简要列举POS的tagset含意，详细可看nltk.help.brown_tagset()

标记	词性	示例
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CONJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
MOD	情态动词	will, can, would, may, must, should
NN	名词	year,home,costs,time
NNP	专有名词	April，China，Washington
NUM	数词	fourth，2016, 09:30
PRON	代词	he,they,us
P	介词	on,over,with,of
TO	词to	to
UH	叹词	ah,ha,oops
VB		动词
VBD	动词过去式	made,said,went
VBG	现在分词	going,lying,playing
VBN	过去分词	taken,given,gone
WH	wh限定词	who,where,when,what

使用NLTK进行词性标注：

示例：

import nltk
sent = 'I am going to Wuhan University now'
tokens = nltk.word_tokenize(sent)

taged_sent = nltk.pos_tag(tokens)
print(taged_sent)

运行结果：

[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('Wuhan', 'NNP'), ('University', 'NNP'), ('now', 'RB')]

语料库的已标注数据

语料类提供了下列方法可以返回预标注数据。

方法	说明
tagged_words(fileids,categories)	返回标注数据，以词列表的形式
tagged_sents(fileids,categories)	返回标注数据，以句子列表形式
tagged_paras(fileids,categories)	返回标注数据，以文章列表形式

标注器：

1、默认标注器

最简单的词性标注器将所有的词都标注为名词NN，标注的正确率不是很高，没有很高的价值。

示例：

import nltk
from nltk.corpus import brown

default_tagger = nltk.DefaultTagger('NN')
sents = 'I am going to Wuhan University now.'
print(default_tagger.tag(sents))

tagged_sent = brown.tagged_sents(categories='news')
print(default_tagger.evaluate(tagged_sent))

运行结果：

[('I', 'NN'), (' ', 'NN'), ('a', 'NN'), ('m', 'NN'), (' ', 'NN'), ('g', 'NN'), ('o', 'NN'), ('i', 'NN'), ('n', 'NN'), ('g', 'NN'), (' ', 'NN'), ('t', 'NN'), ('o', 'NN'), (' ', 'NN'), ('W', 'NN'), ('u', 'NN'), ('h', 'NN'), ('a', 'NN'), ('n', 'NN'), (' ', 'NN'), ('U', 'NN'), ('n', 'NN'), ('i', 'NN'), ('v', 'NN'), ('e', 'NN'), ('r', 'NN'), ('s', 'NN'), ('i', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'NN'), ('n', 'NN'), ('o', 'NN'), ('w', 'NN'), ('.', 'NN')]
0.13089484257215028

从上述结果可以看出，默认标注器将所有词标注为名词NN的准确率较低，只有13.08%。

基于规则的标注器：

使用规则可以提高标注器的准确率，比如对于ing结尾则柡注为VG，ed结尾则标注为VD。可以通过正则表达式标注器实现。

示例：

import nltk
from nltk.corpus import brown

pattern = [
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*es$', 'VBZ'),
    (r'.*\'s$', 'NN$'),
    (r'.*s$', 'NNS'),
    (r'.*', 'NN'),  # 未标注的仍为NN
]
sents = 'I am going to Wuhan University.'

tagger = nltk.RegexpTagger(pattern)

print(tagger.tag(nltk.word_tokenize(sents)))

tagged_sents = brown.tagged_sents(categories='news')
print(tagger.evaluate(tagged_sents))

运行结果：

[('I', 'NN'), ('am', 'NN'), ('going', 'VBG'), ('to', 'NN'), ('Wuhan', 'NN'), ('University', 'NN'), ('.', 'NN')]
0.1875310778288283

可以看出相比于默认的词性标注器，基于规则的词性标注具有较高的准确率。

3、基于查表的标注器

统计一下部分高频词的词性，比如经常出现的100个词的词性。利用单个词的词性的统计知识来进行标注，这就是Unigram模型的思想。

示例：

import nltk
from nltk.corpus import  brown
sents = open('nothing_gonna_change_my_love_for_you.txt').read()
fdist = nltk.FreqDist(brown.words(categories='news'))
common_word = fdist.most_common(100)

cfdist = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
table = dict((word, cfdist[word].max()) for (word, _) in common_word)

uni_tagger = nltk.UnigramTagger(model=table, backoff=nltk.DefaultTagger('NN'))
print(uni_tagger.tag(nltk.word_tokenize(sents)))
tagged_sents = brown.tagged_sents(categories='news')
print(uni_tagger.evaluate(tagged_sents))

运行结果：

[('锘縄f', 'NN'), ('I', 'PPSS'), ('had', 'HVD'), ('to', 'TO'), ('live', 'NN'), ('my', 'NN'), ('life', 'NN'), ('without', 'NN'), ('you', 'NN'), ('near', 'NN'), ('me', 'NN'), ('The', 'AT'), ('days', 'NN'), ('would', 'MD'), ('all', 'ABN'), ('be', 'BE'), ('empty', 'NN'), ('The', 'AT'), ('nights', 'NN'), ('would', 'MD'), ('seem', 'NN'), ('so', 'NN'), ('long', 'NN'), ('With', 'NN'), ('you', 'NN'), ('I', 'PPSS'), ('see', 'NN'), ('forever', 'NN'), ('oh', 'NN'), ('so', 'NN'), ('clearly', 'NN'), ('I', 'PPSS'), ('might', 'NN'), ('have', 'HV'), ('been', 'BEN'), ('in', 'IN'), ('love', 'NN'), ('before', 'IN'), ('But', 'CC'), ('it', 'PPS'), ('never', 'NN'), ('felt', 'NN'), ('this', 'DT'), ('strong', 'NN'), ('Our', 'NN'), ('dreams', 'NN'), ('are', 'BER'), ('young', 'NN'), ('And', 'NN'), ('we', 'PPSS'), ('both', 'NN'), ('know', 'NN'), ('they', 'PPSS'), ("'ll", 'NN'), ('take', 'NN'), ('us', 'NN'), ('where', 'NN'), ('we', 'PPSS'), ('want', 'NN'), ('to', 'TO'), ('go', 'NN'), ('Hold', 'NN'), ('me', 'NN'), ('now', 'NN'), ('Touch', 'NN'), ('me', 'NN'), ('now', 'NN'), ('I', 'PPSS'), ('do', 'NN'), ("n't", 'NN'), ('want', 'NN'), ('to', 'TO'), ('live', 'NN'), ('without', 'NN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('[', 'NN'), ('2', 'NN'), (']', 'NN'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('If', 'NN'), ('the', 'AT'), ('road', 'NN'), ('ahead', 'NN'), ('[', 'NN'), ('3', 'NN'), (']', 'NN'), ('is', 'BEZ'), ('not', '*'), ('so', 'NN'), ('easy', 'NN'), (',', ','), ('Our', 'NN'), ('love', 'NN'), ('will', 'MD'), ('lead', 'NN'), ('the', 'AT'), ('way', 'NN'), ('for', 'IN'), ('us', 'NN'), ('Just', 'NN'), ('like', 'NN'), ('a', 'AT'), ('guiding', 'NN'), ('star', 'NN'), ('I', 'PPSS'), ("'ll", 'NN'), ('be', 'BE'), ('there', 'EX'), ('for', 'IN'), ('you', 'NN'), ('if', 'NN'), ('you', 'NN'), ('should', 'NN'), ('need', 'NN'), ('me', 'NN'), ('You', 'NN'), ('do', 'NN'), ("n't", 'NN'), ('have', 'HV'), ('to', 'TO'), ('change', 'NN'), ('a', 'AT'), ('thing', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('just', 'NN'), ('the', 'AT'), ('way', 'NN'), ('you', 'NN'), ('are', 'BER'), ('So', 'NN'), ('come', 'NN'), ('with', 'IN'), ('me', 'NN'), ('and', 'CC'), ('share', 'NN'), ('the', 'AT'), ('view', 'NN'), ('I', 'PPSS'), ("'ll", 'NN'), ('help', 'NN'), ('you', 'NN'), ('see', 'NN'), ('forever', 'NN'), ('too', 'NN'), ('Hold', 'NN'), ('me', 'NN'), ('now', 'NN'), ('Touch', 'NN'), ('me', 'NN'), ('now', 'NN'), ('I', 'PPSS'), ('do', 'NN'), ("n't", 'NN'), ('want', 'NN'), ('to', 'TO'), ('live', 'NN'), ('without', 'NN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN')]
0.5817769556656125

只利用前100个词的历史统计数据便能获得58%的正确率，加大这个词的数量更可以继续提升标注的正确率,当为8000时可以达到90%的正确率。这里我们对不在这100个词的其他词统一回退到默认标注器。

训练N-gram标注器：

一般N-gram标注器：

Unigram标注器是1-Gram。考虑更多的上下文，便有了2-gram、3-gram，这里统称为N-gram。注意，考虑更长的上下文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型，另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下：__init__(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据，用于构建一个标注器。

示例：

import nltk
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0: train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train=x_train)
print(tagger.evaluate(x_test))

运行结果：

0.8121200039868434

对于UnigramTagger，采用90%的数据进行训练，再余下的10%数据上测试的准确率是81%。

组合标注器

利用backoff参数，将多个组合标注器组合起来，以提高识别精确率。

示例：

import nltk
from nltk.corpus import brown

pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍标注为NN
]

brown_tagged_sents = brown.tagged_sents(categories='news')
train_num = int(len(brown_tagged_sents) * 0.9)

x_train = brown_tagged_sents[0: train_num]
x_test = brown_tagged_sents[train_num:]
t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train, backoff=t0)
t2 = nltk.BigramTagger(x_train, backoff=t1)
print(t2.evaluate(x_test))

运行结果：

0.8627529153792485

可以看出，不需要任何的语言学知识，只需要借助统计数据便可以使得词性标注做的足够好。
对于中文，只要有标注语料，也可以按照上面的过程训练N-gram标注器。

nltk.tag.BrillTagger实现了基于转换的标注，在基础标注器的结果上，对输出进行基于规则的修正，实现更高的准确度。

中文标注器的训练

示例：

import nltk
import json

lines = open('199801.txt', 'rb').readlines()
all_tagged_sents = []

for line in lines:
    line = line.decode('utf-8')
    sent = line.split()
    tagged_sent = []
    for item in sent:
        pair = nltk.str2tuple(item)
        tagged_sent.append(pair)

    if len(tagged_sent) > 0:
        all_tagged_sents.append(tagged_sent)

train_size = int(len(all_tagged_sents) * 0.8)
x_train = all_tagged_sents[: train_size]
x_test = all_tagged_sents[train_size:]
tagger = nltk.UnigramTagger(train=x_train, backoff=nltk.DefaultTagger('n'))

tokens = nltk.word_tokenize(u'我 认为 不丹 的 被动 卷入 不 构成 此次 对峙 的 主要 因素。')
tagged = tagger.tag(tokens)
print(json.dumps(tagged, ensure_ascii=False))
print(tagger.evaluate(x_test))

运行结果：

[["我", "R"], ["认为", "V"], ["不丹", "n"], ["的", "U"], ["被动", "A"], ["卷入", "V"], ["不", "D"], ["构成", "V"], ["此次", "R"], ["对峙", "V"], ["的", "U"], ["主要", "B"], ["因素。", "n"]]
0.8714095491725319

猜你喜欢