数据挖掘 numpy进阶之索引和索引技巧

广播法则(rule)

广播法则能使通用函数有意义地处理不具有相同形状的输入。

广播第一法则是，如果所有的输入数组维度不都相同，一个“1”将被重复地添加在维度较小的数组上直至所有的数组拥有一样的维度。

广播第二法则确定长度为1的数组沿着特殊的方向表现地好像它有沿着那个方向最大形状的大小。对数组来说，沿着那个维度的数组元素的值理应相同。

应用广播法则之后，所有数组的大小必须匹配。

索引和索引技巧

NumPy比普通Python序列提供更多的索引功能。除了索引整数和切片，正如我们之前看到的，数组可以被整数数组和布尔数组索引。

通过数组索引

已数组i元素为索引，读取数组a的值,构成一维数组，维度与索引数组i相同。

import numpy
a = numpy.arange(12)**2
print("a: ", a)
i = numpy.array([1, 1, 3, 8, 5])
print("i: ", i)
# 已数组i元素为索引，读取数组a的值,构成一维数组，维度与索引数组i相同
print("a[i]: ", a[i])

j = numpy.array([[3, 4], [9, 7]])
print("j: ", j)
# 维度与索引数组i相同
print("a[j]: ", a[j])

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [  0   1   4   9  16  25  36  49  64  81 100 121]
i:  [1 1 3 8 5]
a[i]:  [ 1  1  9 64 25]
j:  [[3 4]
 [9 7]]
a[j]:  [[ 9 16]
 [81 49]]

Process finished with exit code 0

当被索引数组a是多维的时，每一个唯一的索引数列指向a的第一维。以下示例通过将图片标签用调色版转换成色彩图像展示了这种行为。

import numpy
palette = numpy.array([
    [0, 0, 0],  # 黑色
    [255, 0, 0],  # 红色
    [0, 255, 0],  # 绿色
    [0, 0, 255],  # 蓝色
    [255, 255, 255]  # 白色
])

image = numpy.array([
    [0, 1, 2, 0],
    [0, 3, 4, 0]
])
print(palette[image])

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
[[[  0   0   0]
  [255   0   0]
  [  0 255   0]
  [  0   0   0]]

 [[  0   0   0]
  [  0   0 255]
  [255 255 255]
  [  0   0   0]]]

Process finished with exit code 0

我们也可以给出不不止一维的索引，每一维的索引数组必须有相同的形状。

import numpy
a = numpy.arange(12).reshape(3, 4)
print("a: ", a)
i = numpy.array([
    [0, 1],
    [1, 2],
])
print("i: ", i)
j = numpy.array([
    [2, 1],
    [3, 3]
])
print("j: ", j)

print("a[i, j]: ", a[i, j])
print("a[i, 2]: ", a[i, 2])
print("a[:, j]: ", a[:, j])  # a第一维下的第j个元素构成的数组，第一维与a相同，其他维与索引数组j相同

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
i:  [[0 1]
 [1 2]]
j:  [[2 1]
 [3 3]]
a[i, j]:  [[ 2  5]
 [ 7 11]]
a[i, 2]:  [[ 2  6]
 [ 6 10]]
a[:, j]:  [[[ 2  1]
  [ 3  3]]

 [[ 6  5]
  [ 7  7]]

 [[10  9]
  [11 11]]]

Process finished with exit code 0

把i和j放到序列中(比如说列表)然后通过list索引。

import numpy
a = numpy.arange(12).reshape(3, 4)
print("a: ", a)
i = numpy.array([
    [0, 1],
    [1, 2],
])
print("i: ", i)
j = numpy.array([
    [2, 1],
    [3, 3]
])
print("j: ", j)

# 把i和j放到序列中(比如说列表)然后通过list索引。
l = [i, j]
print("a[l]: ", a[l])

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
i:  [[0 1]
 [1 2]]
j:  [[2 1]
 [3 3]]
a[l]:  [[ 2  5]
 [ 7 11]]

Process finished with exit code 0

我们不能把i和j放在一个数组中，因为这个数组将被解释成索引a的第一维。

import numpy
a = numpy.arange(12).reshape(3, 4)
print("a: ", a)
i = numpy.array([
    [0, 1],
    [1, 2],
])
print("i: ", i)
j = numpy.array([
    [2, 1],
    [3, 3]
])
print("j: ", j)

# 把i和j放到序列中(比如说列表)然后通过list索引。
l = numpy.array([i, j])
print("L: ", l)
# print("a[l]: ", a[l])  # 错误的做法
print("a[l]: ", a[l[0], l[1]])

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
i:  [[0 1]
 [1 2]]
j:  [[2 1]
 [3 3]]
L:  [[[0 1]
  [1 2]]

 [[2 1]
  [3 3]]]
a[l]:  [[ 2  5]
 [ 7 11]]

Process finished with exit code 0

常用的数组索引用法是搜索时间序列最大值。

import numpy
time = numpy.linspace(20, 145, 5)  # 20到145的五个同间隔点
print("time: ", time)

data = numpy.sin(numpy.arange(20))
data.shape = 5, 4
print("data: ", data)

ind = data.argmax(axis=0)  # 第一维度上值最大的索引排序
print("index: ", ind)

time_max = time[ind]
print(time_max)

data_max = data[ind, range(data.shape[1])]
    # 在Python 3中，range()与xrange()合并为range( )。
    # data.shape[1]值得是data的第二维索引
print("data_max: ", data_max)
print("data.max: ", data.max(axis=0))
"""
ndarray.max([int axis])
函数功能：求ndarray中指定维度的最大值，默认求所有值的最大值。
axis=0:求各column的最大值
axis=1:求各row的最大值
"""
print(all(data_max == data.max(axis=0)))
# all() 函数用于判断给定的可迭代参数 iterable 中的所有元素是否不为 0、''、False 或者 iterable 为空，如果是返回 True，否则返回 False。

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
time:  [  20.     51.25   82.5   113.75  145.  ]
data:  [[ 0.          0.84147098  0.90929743  0.14112001]
 [-0.7568025  -0.95892427 -0.2794155   0.6569866 ]
 [ 0.98935825  0.41211849 -0.54402111 -0.99999021]
 [-0.53657292  0.42016704  0.99060736  0.65028784]
 [-0.28790332 -0.96139749 -0.75098725  0.14987721]]
index:  [2 0 3 1]
[  82.5    20.    113.75   51.25]
data_max:  [ 0.98935825  0.84147098  0.99060736  0.6569866 ]
data.max:  [ 0.98935825  0.84147098  0.99060736  0.6569866 ]
True

Process finished with exit code 0

使用数组索引作为目标来赋值，当一个索引列表包含重复时，赋值被多次完成，保留最后的值：

import numpy
a = numpy.arange(5)
print("a: ", a)
a[[1, 3, 4, 4]] = [0, 0, 0, 1]
print("a: ", a)

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [0 1 2 3 4]
a:  [0 0 2 0 1]

Process finished with exit code 0

使用用+=结构，即使0在索引列表中出现两次，索引为0的元素仅仅增加一次。

import numpy
a = numpy.arange(5)
print("a: ", a)
a[[0, 0, 2]] += 1
print("a: ", a)

import numpy
a = numpy.arange(5)
print("a: ", a)
a[[0, 0, 2]] += 1
print("a: ", a)

通过布尔数组索引

当我们使用整数数组索引数组时，我们提供一个可选择的索引列表。通过布尔数组索引的方法，我们可以显式地选择数组中我们想要和不想要的元素。

使用布尔数组的索引最自然方式就是使用和原数组一样形状的布尔数组。

布尔索引不更改原数组，创建的都是原数组的副本。

import numpy
# 布尔索引
a = numpy.arange(12).reshape(3, 4)
b = a > 4
print("b: ", b)
print("a[b]: ", a[b])  # a中满足 关系b 的元素构成的一维数组

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
b:  [[False False False False]
 [False  True  True  True]
 [ True  True  True  True]]
a[b]:  [ 5  6  7  8  9 10 11]

Process finished with exit code 0

属性赋值：

import numpy
# 布尔索引
a = numpy.arange(12).reshape(3, 4)
print("a: ", a)
b = a > 4
a[b] = 0  # 满足 关系b 的元素将被赋值
print("a: ", a)

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
a:  [[0 1 2 3]
 [4 0 0 0]
 [0 0 0 0]]

Process finished with exit code 0

可以参考曼德博集合示例看看如何使用布尔索引来生成曼德博集合的图像。

通过布尔来索引的方法更近似于整数索引；对数组的每个维度我们给一个一维布尔数组来选择我们想要的切片。其中布尔索引的在切片中的位置表示对数组的当还是列的操作，[ : , b]为行对应，[b， : ]为列对应。

import numpy
a = numpy.arange(12).reshape(3, 4)
print("a: ", a)
b1 = numpy.array([False, True, True])  # 输出每一列为True的值，列数不变
b2 = numpy.array([True, False, True, False])
print("a[b1, :] ", a[b1, :])  # 等同于 a[b1]
print("a[:, b1] ", a[:, b2])  # 输出每一行为True的值，行数不变
print("a[b1, b2] ", a[b1, b2])  # 未知, 不等于a[b1, :][:, b2]

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
a:  [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
a[b1, :]  [[ 4  5  6  7]
 [ 8  9 10 11]]
a[:, b1]  [[ 0  2]
 [ 4  6]
 [ 8 10]]
a[b1, b2]  [ 4 10]

Process finished with exit code 0

注意一维数组的长度必须和你想要切片的维度或轴的长度一致，在之前的例子中，b1是一个秩为1长度为三的数组(a的行数)，b2(长度为4)与a的第二秩(列)相一致。

ix_()函数

ix_()函数可以为了获得多元组的结果而用来结合不同向量。例如，如果你想要用所有向量a、b和c元素组成的三元组来计算a+b*c：

import numpy
a = numpy.array([2, 3, 4, 5])
b = numpy.array([8, 5, 4])
c = numpy.array([5, 4, 6, 8, 3])
ax, bx, cx = numpy.ix_(a, b, c)
print("ax: ", ax)
print("bx: ", bx)
print("cx: ", cx)

print("ax.shape: ", ax.shape)
print("bx.shape: ", bx.shape)
print("cx.shape: ", cx.shape)

result = ax + bx * cx
print("ax + bx * cx: ", result)
print("result[3, 2, 4] ", result[3, 2, 4])

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
ax:  [[[2]]

 [[3]]

 [[4]]

 [[5]]]
bx:  [[[8]
  [5]
  [4]]]
cx:  [[[5 4 6 8 3]]]
ax.shape:  (4, 1, 1)
bx.shape:  (1, 3, 1)
cx.shape:  (1, 1, 5)
ax + bx * cx:  [[[42 34 50 66 26]
  [27 22 32 42 17]
  [22 18 26 34 14]]

 [[43 35 51 67 27]
  [28 23 33 43 18]
  [23 19 27 35 15]]

 [[44 36 52 68 28]
  [29 24 34 44 19]
  [24 20 28 36 16]]

 [[45 37 53 69 29]
  [30 25 35 45 20]
  [25 21 29 37 17]]]
17

Process finished with exit code 0

实行如下简化：

import numpy
def ufunc_reduce(ufct, *vectors):
    vs = numpy.ix_(*vectors)
    r = ufct.identity
    for v in vs:
        r = ufct(r, v)
    return r
a = numpy.array([2, 3, 4, 5])
b = numpy.array([8, 5, 4])
c = numpy.array([5, 4, 6, 8, 3])
print(ufunc_reduce(numpy.add, a, b, c))

"E:\Python 3.6.2\python.exe" F:/PycharmProjects/test.py
[[[15 14 16 18 13]
  [12 11 13 15 10]
  [11 10 12 14  9]]

 [[16 15 17 19 14]
  [13 12 14 16 11]
  [12 11 13 15 10]]

 [[17 16 18 20 15]
  [14 13 15 17 12]
  [13 12 14 16 11]]

 [[18 17 19 21 16]
  [15 14 16 18 13]
  [14 13 15 17 12]]]

Process finished with exit code 0

这个reduce与ufunc.reduce(比如说add.reduce)相比的优势在于它利用了广播法则，避免了创建一个输出大小乘以向量个数的参数数组。