说说 Python 数组的高效性

如果我们需要一个只包含数字的列表，那么使用数组方式比 list 方式更高效。而且数组还支持所有跟可变序列有关的操作，比如移除列表中的一个元素（.pop）、插入元素（.insert）和在列表末尾一次性追加另一个序列中的多个值（.extend）。除此之外，数组还定义从文件读取（.frombytes）与写入（.tofile）的效率更高的方法。

创建数组需要一个类型码，形如 array(‘d’），这个类型码是用来表示在底层实现的 C 语言的数据类型。一般我们用的 Python 底层是用 C 语言编写实现的，所以又称为 CPython。

Python 定义了以下这些类型码：

类型码	C 类型	Python 类型	所占字节	注释
‘b’	signed char	int	1
‘B’	unsigned char	int	1
‘u’	Py_UNICODE	Unicode 字符	2	(1)
‘h’	signed short	int	2
‘H’	unsigned short	int	2
‘i’	signed int	int	2
‘I’	unsigned int	int	2
‘l’	signed long	int	4
‘L’	unsigned long	int	4
‘q’	signed long long	int	8
‘Q’	unsigned long long	int	8
‘f’	float	float	4
‘d’	double	float	8

注释（1）：'u' 类型码对应于 Python 中已过时的 unicode 字符 (Py_UNICODE 即 wchar_t)。根据系统平台的不同，它可能是 16 位或 32 位。

比如 b 类型码表示的是有符号字符（ signed char ），array(’ b ')创建出的数组就只能存放一个字节大小的整数，范围从 -128 到 127 。通过这样的限制，即使序列很长，拥有很多数字，也能节省空间。

数组定义好类型，就不能存放非定义类型的数据。

Luciano Ramalho 举了一个示例来说明数组的高效性。首先创建一个有 1000 万个随机浮点数的数组，然后写入数据，最后读取出数据。

from array import array
from random import random

floats = array('d', (random() for i in range(10 ** 7)))
logging.info('floats[-1] -> %s', floats[-1])

fp = open('floats.bin', 'wb')
floats.tofile(fp)
fp.close()

floats2 = array('d')
fp = open('floats.bin', 'rb')
floats2.fromfile(fp, 10 ** 7)
fp.close()
logging.info('floats2[-1] -> %s', floats2[-1])
logging.info('floats2==floats -> %s', floats2 == floats)

运行结果：

INFO - floats[-1] -> 0.9160358679542017
INFO - floats2[-1] -> 0.9160358679542017
INFO - floats2==floats -> True

通过 cProfile 模块分析代码性能，输出如下结果：

INFO -          192 function calls (180 primitive calls) in 0.098 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.061    0.061    0.061    0.061 {method 'fromfile' of 'array.array' objects}
        1    0.030    0.030    0.030    0.030 {method 'tofile' of 'array.array' objects}
        2    0.007    0.003    0.007    0.003 {built-in method io.open}
...

可以看到创建 1000 万个随机浮点数的数组，并实现读写文件操作，仅需 0.01 s 左右。生成的文件大小约为 73M。

首先利用生成器表达式创建一个可迭代对象，** 表示乘方，接着生成一个双精度浮点数组（类型码是 ‘d’）；
array 的 -1 索引值可以获取到数组中最后一个元素；
“wb” 是以二进制写方式打开文件，w 是 write 的缩写；而 b 是 binary 的缩写；

binary /ˈbaɪnəri
using only 0 and 1 as a system of numbers

创建数组时，可以初始化，也可以不初始化直接创建一个空数组，形如： array(‘d’)；
fromfile() 方法的第二个入参用于指定数值最大范围；
可以看到从文件中读取到的数组与存入的数组是完全一致的。

因为 array.tofile 是把数据写入到二进制文件，所以比直接写入文本文件快很多。据统计，两者在性能上会相差近 7 倍。

说说 Python 数组的高效性

猜你喜欢