NLP笔记 --- 1.单词计数

版权声明:原创作品,欢迎转载 https://blog.csdn.net/xf8964/article/details/88919795

单词计数


首先我们先来完成一个小实验,单词计数,首先我们需要一段文本数据
As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.
我们把它保存为 input.txt文本,我们要统计文本中使用最多的前10个单词,和使用最少的单词,其实这有一定的步骤

  • 1.将文本转换为小写,这是因为在文本中car 和Car是一样的
  • 2.通过正则来去除文本中标点符号,最好是用空格替换标点符号
  • 3.根据空格来分割单词,返回一个列表
  • 4.使用字典来统计,也可使用from collections import defaultdict 进行统计
    我们先来完成单词统计函数
def count_word(text):
    """Count how times each unique word occurs in text"""
    counts = dict()   # 
    # Convert to lowercase
    text = text.lower()
    
    # 取出非字母字符
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    # 将字符串按照空格分割
    text = text.split()
    for word in text:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
            
    return counts
def test_run():
    with open("input.txt", "r") as f:
        text = f.read()
        counts = count_word(text)
        sorted_counts = sorted(counts.items(), key=lambda pair : pair[1], reverse=True)
        
        print("10 most common words:\nWord\tCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))
            
        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))

猜你喜欢

转载自blog.csdn.net/xf8964/article/details/88919795