统计一个数组或者一个文档中出现频率最高的词,或者对元素排序是数据统计中经常用到的。
先说下最常用到的方法:
from random import randint
data = [randint(100, 110) for _ in range(30)] # 初始化一个长度为30的随机列表
d = dict.fromkeys(data, 0) # 初始化一个字典,data中值为key, 0为value
for i in data:
d[i] += 1
sorted([(v, k) for k,v in d.items()], reverse=True) # 统计完根据频率降序排列
>>>
[(5, 109),
(5, 108),
(5, 103),
(4, 105),
(3, 107),
(2, 110),
(2, 106),
(2, 101),
(1, 104),
(1, 102)]
如果对上面代码优化一下,可以使用生成器:
res = sorted(((v, k) for k,v in d.items()), reverse=True)
res[:3]
>>> [(5, 109), (5, 108), (5, 103)]
但是对于数据量大的情况下就很耗内存了,因为不必要求出所有元素的排名。如果再优化一些,可以使用python内置的堆排序:
import heapq
heapq.nlargest(3, ((k,v) for k, v in d.items()))
>>> [(5, 109), (5, 108), (5, 103)]
还有collections中内置的Counter方法:
from collections import Counter
d = Counter(data)
d.most_common(3) # 同样是排名前三个
>>> [(5, 109), (5, 108), (5, 103)]
"""
看下源码:
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.items(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.items(), key=_itemgetter(1))
也是使用内置的堆排序,包装了一下。对于堆排序可以好好研究下
"""