python 进阶学习(一) 词频统计

随机序列中的元素出现次数统计

题目:
对于随机序列中[2,1,3,5,6,8,4,9,2,3,5,1,5,3,…]各元素出现多少次, 出现次数最多的三个元素及其出现次数.

from random import randint
data = [randint(0,10) for _ in range(30)]
print(data) # 必有重复的

结果如下:
[9, 9, 1, 5, 10, 0, 3, 7, 5, 8, 6, 3, 1, 6, 4, 2, 0, 8, 9, 9, 6, 6, 6, 1, 3, 9, 6, 7, 6, 7]

第一种方法:

# 通常的做法
d = dict.fromkeys(data, 0) # 将data中的元素作为字典的键, 值为默认值`0`
# 对data进行迭代, 找出每个数字出现的次数:
for x in data:
    d[x] += 1
print(d)
# 对字典,按值进行排序然后切片
result = sorted(c.items(),key = lambda k : k[1],reverse=True)[:3]
print(result)

对字典进行按值排序

结果如下:

# d 输出结果
{9: 5, 1: 3, 5: 2, 10: 1, 0: 2, 3: 3, 7: 3, 8: 2, 6: 7, 4: 1, 2: 1}
# result 输出结果:
[(6, 7), (9, 5), (1, 3)]

第二种方法(推荐):

# 使用python 自带的方法:
from collections import Counter
# 词频统计
c2 = Counter(data)
print(c2)
# 输出最多的3个
print(c2.most_common(3))

结果如下:

# c2 输出结果:
Counter({9: 5, 1: 3, 5: 2, 10: 1, 0: 2, 3: 3, 7: 3, 8: 2, 6: 7, 4: 1, 2: 1})
# c2.most_common(3) 输出结果:
[(6, 7), (9, 5), (1, 3)]

词频统计

题目:对一段英文文章的单词,进行词频统计,找到出现次数最高的10个单词,它们出现的次数是多少?

# 这是随机的一段英文文章
txt = '''
My grandfather died when I was a small boy, and my grandmother started staying with us for about six months every year. She lived in a room that doubled as my father's office, which we referred to as "the back room." She carried with her a powerful aroma1. I don't know what kind of perfume she used, but it was the double-barreled, ninety-proof, knockdown, render-the-victim-unconscious, moose-killing variety. She kept it in a huge atomizer and applied2 it frequently and liberally. It was almost impossible to go into her room and remain breathing for any length of time. When she would leave the house to go spend six months with my Aunt Lillian, my mother and sisters would throw open all the windows, strip the bed, and take out the curtains and rugs. Then they would spend several days washing and airing things out, trying frantically3 to make the pungent4 odor go away.
This, then, was my grandmother at the time of the infamous5 pea incident.
'''

下面使用collections.Counter进行词频统计

import re
# 对txt使用非字母的字符进行分隔,并进行统计词频
c3 = Counter(re.split('\W+',txt))
print(c3.most_common(10))

结果输出:

[('the', 10),
 ('and', 8),
 ('my', 5),
 ('was', 4),
 ('a', 4),
 ('to', 4),
 ('with', 3),
 ('She', 3),
 ('room', 3),
 ('of', 3)]

猜你喜欢

转载自blog.csdn.net/weixin_39791387/article/details/81705895