nltk(二)

1.collocations模块

用于计算一组单词中,没window_size个单词中n个词同时出现的次数

from nltk.collocations import *

sent = 'this this is is a a test test'.split()

b = BigramCollocationFinder.from_words(sent, window_size=2)

b.ngram_fd.items()
View Code

BigramCollocationFinder 用于计算两个单词出现的次数

TrigramCollocationFinder 用于计算三个单词出现的次数

QuadgramCollocationFinder 用于计算n个单词出现的次数

2.data模块

用于管理语言包的路径信息

nltk.data.path 返回语言包路径list

nltk.data.PathPointer路径指针基类

有FileSystemPathPointer和BufferedGzipFile两个子类分别用于处理普通文件和压缩文件

3.featstruct 模块

用于表示特征,功能类似与dict和list

Feature 用于存放一个特征,有个name属性和value

有SlashFeature和RangeFeature两个子类

FeatStruct有若干个特征

有FeatDict和FeatList两个子类

from nltk.featstruct import FeatStruct
FeatStruct('[a=?x]').unify(FeatStruct('[b=?x]'))

4.grammar 模块

用于处理自定义文法

import nltk
from nltk import CFG

grammar = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)
sent = 'Mary saw Bob'.split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for i in rd_parser.parse(sent):
    print(i)
View Code

5.probability 模块

主要包括词频列表,词频字典,概率分布(ELEProbDist)

from nltk.probability import ConditionalFreqDist
from nltk.tokenize import word_tokenize
sent = "the the the dog dog some other words that we do not care about"
cfdist = ConditionalFreqDist()
for word in word_tokenize(sent):
    print(word)
    condition = len(word)
    cfdist[condition][word] += 1
cfdist2 = ConditionalFreqDist((len(word), word) for word in word_tokenize(sent))
View Code

6.text 模块

用于处理文本信息,主要包括单词查找,单词拆分,文本包装器

import nltk.corpus
from nltk.text import TextCollection
from nltk.book import text1, text2, text3

gutenberg = TextCollection(nltk.corpus.gutenberg)
mytexts = TextCollection([text1, text2, text3])
View Code

7.tree 模块

用于生成和打印语法树

猜你喜欢

转载自www.cnblogs.com/yangyang12138/p/12466808.html