PythonNLP学习进阶:习题练习(2016-2-12)

习题:处理布朗语料库的新闻和言情文体,找出一周中最有新闻价值并且是最浪漫的日子。定义一个变量days 包含星期的链表如['Monday', ...]。然后使用cfd.tabulate(samples=days)为这些词的计数制表。接下来用绘图替代制表尝试同样的事情。你可以在额外的参数conditions=['Monday', ...]的帮助下控制星期输出的顺序。

分析:     

  • 处理布朗语料库的新闻和言情文体
  • 一周之中最有新闻价值并且是最浪漫的日子

看到这个问题,如果事先对Brown语料库不了解,而且不明白语料库与星期之间的关系的话,这个题目是无从下手的。

(1)首先,我们来看一下brown语料库中的新闻和言情文体 以及语料与日期之间的关系

代码块1:

>>> from nltk.corpus import brown
>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', 'genre', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', <span style="color:#ff0000;">u'news'</span>, u'religion', u'reviews', <span style="color:#ff0000;">u'romance'</span>, u'science_fiction']

代码块2:

>>> b=brown.words(categories=['news','romance'])
>>> b
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

代码块3:

>>> a=nltk.Text(brown.words(categories=['news','romance']))
>>> a
<Text: The Fulton County Grand Jury said Friday an...>
>>> a.concordance('Tuesday')
Displaying 25 of 46 matches:
 GOP chairman , said a meeting held Tuesday night in Blue Ridge brought enthusi
ublic relations director , resigned Tuesday to work for Lt. Gov. Garland Byrd's
. It is expected to be reported out Tuesday , but this is a little uncertain . 
n the hospital but is expected back Tuesday . Leadership is hopeful The housing
ntroller Alexander Hemphill charged Tuesday that the bids on the Frankford Elev
hich also brought these disclosures Tuesday : The city has sued for the full am
由代码块3可以看出brown语料之中有很多和日期有关的句子,由代码块2可知,brown语料库本身是词链表形式存储。

(2)如何定义:一周之中最有新闻价值并且是最浪漫的日子

  • 有新闻价值的衡量标准:即带有星期的事件出现在brown语料库的news中,可以以出现的频次和来衡量价值度
  • 最浪漫的衡量标准:即带有星期的事件出现在brown语料库的romance中,同样以出现的频次和来衡量价值度
这个问题因为没有标准答案:

我提供以下两种思路:

1.分别统计日期在两种文体中出现的频次,然后分别绘制频次图:

>>> days =["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
>>> cfd = nltk.ConditionalFreqDist((genre, word) for genre in ['news', 'romance'] for word in brown.words(categories=genre))
>>> cfd.plot(samples = days)



2.一周之中最有价值且最浪漫的日子,将价值和浪漫分别取比例,然后输出一条图线



猜你喜欢

转载自blog.csdn.net/txlCandy/article/details/50655572