nltk(二)

1.collocations模块

用于计算一组单词中，没window_size个单词中n个词同时出现的次数

from nltk.collocations import *

sent = 'this this is is a a test test'.split()

b = BigramCollocationFinder.from_words(sent, window_size=2)

b.ngram_fd.items()

View Code

BigramCollocationFinder 用于计算两个单词出现的次数

TrigramCollocationFinder 用于计算三个单词出现的次数

QuadgramCollocationFinder 用于计算n个单词出现的次数

2.data模块

用于管理语言包的路径信息

nltk.data.path 返回语言包路径list

nltk.data.PathPointer路径指针基类

有FileSystemPathPointer和BufferedGzipFile两个子类分别用于处理普通文件和压缩文件

`3.featstruct` 模块

用于表示特征，功能类似与dict和list

Feature 用于存放一个特征，有个name属性和value

有SlashFeature和RangeFeature两个子类

FeatStruct有若干个特征

有FeatDict和FeatList两个子类

from nltk.featstruct import FeatStruct
FeatStruct('[a=?x]').unify(FeatStruct('[b=?x]'))

`4.grammar` 模块

用于处理自定义文法

import nltk
from nltk import CFG

grammar = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
sent = 'Mary saw Bob'.split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for i in rd_parser.parse(sent):
    print(i)

View Code

`5.probability` 模块

主要包括词频列表，词频字典，概率分布（ELEProbDist）

from nltk.probability import ConditionalFreqDist
from nltk.tokenize import word_tokenize
sent = "the the the dog dog some other words that we do not care about"
cfdist = ConditionalFreqDist()
for word in word_tokenize(sent):
    print(word)
    condition = len(word)
    cfdist[condition][word] += 1
cfdist2 = ConditionalFreqDist((len(word), word) for word in word_tokenize(sent))

View Code

`6.text` 模块

用于处理文本信息，主要包括单词查找，单词拆分，文本包装器

import nltk.corpus
from nltk.text import TextCollection
from nltk.book import text1, text2, text3

gutenberg = TextCollection(nltk.corpus.gutenberg)
mytexts = TextCollection([text1, text2, text3])

View Code

`7.tree` 模块

用于生成和打印语法树

1.collocations模块

2.data模块

3.featstruct 模块

4.grammar 模块

5.probability 模块

6.text 模块

7.tree 模块

猜你喜欢

`3.featstruct` 模块

`4.grammar` 模块

`5.probability` 模块

`6.text` 模块

`7.tree` 模块