Pandas的数据结构

pandas是数据分析的一个核心框架，集成了数据结构化和数据清洗以及分析的一些方法。pandas在numpy的基础上新增了三个数据类型，Series、DataFrame、Panel

import numpy as np

1、Series

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签

# 引入Series

​

1）Series的创建

两种创建方式：

(1) 由列表或numpy数组创建

默认索引为0到N-1的整数型索引

nd = np.array([1,2,3,4])

array([1, 2, 3, 4])

s = Series(nd) # 没有指定索引默认0~N-1

0    1
1    2
2    3
3    4
dtype: int32

s = Series([1,2,3,4,5],index=list("abcde"))

a    1
b    2
c    3
d    4
e    5
dtype: int64

s["a"]

1

s = Series([1,2,3,4,5,6],index=["A","A","B","B","A","C"])

A    1
A    2
B    3
B    4
A    5
C    6
dtype: int64

s["B"]

B    3
B    4
dtype: int64

(2) 由字典创建

s = Series({"a":1,"b":2,"c":3})

a    1
b    2
c    3
dtype: int64

s1=Series({"a":123,"b":431},index=list("ac"))

a    123.0
c      NaN
dtype: float64

============================================

练习1：

使用多种方法创建以下Series，命名为s1：
语文 150
数学 150
英语 150
理综 300

============================================

#数组

语文    150
数学    150
英语    150
理综    300
dtype: int32

dic = {"语文":150,"数学":150,"英语":150,"理综":300}

数学    150
理综    300
英语    150
语文    150
dtype: int64

nd[0] = 1000

array([1000,  150,  150,  300])

s1 #由数组和列表创建Series是一个浅拷贝（只拷贝引用地址，不拷贝对象本身）

语文    1000
数学     150
英语     150
理综     300
dtype: int32

dic["数学"] = 120

{'数学': 120, '理综': 300, '英语': 150, '语文': 150}

s2 # 由字典创建Series是一个创建副本的过程（也叫深拷贝）

数学    150
理综    300
英语    150
语文    150
dtype: int64

2）Series的索引和切片

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的仍然是一个Series类型）。分为显示索引和隐式索引：

(1) 显式索引：

- 使用index中的元素作为索引值
- 使用.loc[]（推荐）

注意，此时是闭区间

s

a    1
b    2
c    3
dtype: int64

s.values

array([1, 2, 3], dtype=int64)

s.index # index的值是显示索引

Index(['a', 'b', 'c'], dtype='object')

#方式一

1

# 方式二（推荐）

1

s.loc["a","b"] # 不能写成这种形式

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-22-1c72e1f53463> in <module>()
----> 1 s.loc["a","b"] # 不能写成这种形式

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1323             except (KeyError, IndexError):
   1324                 pass
-> 1325             return self._getitem_tuple(key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    839 
    840         # no multi-index, so validate all of the indexers
--> 841         self._has_valid_tuple(tup)
    842 
    843         # ugly hack for GH #836

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    186         for i, k in enumerate(key):
    187             if i >= self.obj.ndim:
--> 188                 raise IndexingError('Too many indexers')
    189             if not self._has_valid_type(k, i):
    190                 raise ValueError("Location based indexing can only have [%s] "

IndexingError: Too many indexers

s.loc[["a","b","a"]] # 通过列表来查找，实际上就是从s中截取子series

(2) 隐式索引：

- 使用整数作为索引值
- 使用.iloc[]（推荐）

注意，此时是半开区间

s.iloc[0]

s2.iloc[0]

150

s.iloc[0,1]

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-24-61ca01bee2d4> in <module>()
----> 1 s.iloc[0,1]

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1323             except (KeyError, IndexError):
   1324                 pass
-> 1325             return self._getitem_tuple(key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1660     def _getitem_tuple(self, tup):
   1661 
-> 1662         self._has_valid_tuple(tup)
   1663         try:
   1664             return self._getitem_lowerdim(tup)

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    186         for i, k in enumerate(key):
    187             if i >= self.obj.ndim:
--> 188                 raise IndexingError('Too many indexers')
    189             if not self._has_valid_type(k, i):
    190                 raise ValueError("Location based indexing can only have [%s] "

IndexingError: Too many indexers

s.iloc[[0,1]]

a    1
b    2
dtype: int64

(3)切片

# 显示

a    1
b    2
c    3
dtype: int64

# 隐式

a    1
b    2
dtype: int64

s = Series([1,2,3,4,5,6],index=["A","A","B","C","B","C"])

A    1
A    2
B    3
C    4
B    5
C    6
dtype: int64

s.loc["A":"C"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-29-dd4f4933695c> in <module>()
----> 1 s.loc["A":"C"]
      2 # 如果显式索引中有重复的不建议用显示索引来切片

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1504         if isinstance(key, slice):
   1505             self._has_valid_type(key, axis)
-> 1506             return self._get_slice_axis(key, axis=axis)
   1507         elif is_bool_indexer(key):
   1508             return self._getbool_axis(key, axis=axis)

d:\Anaconda\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis)
   1354         labels = obj._get_axis(axis)
   1355         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1356                                        slice_obj.step, kind=self.name)
   1357 
   1358         if isinstance(indexer, slice):

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
   3348         """
   3349         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 3350                                                  kind=kind)
   3351 
   3352         # return a slice

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, end, step, kind)
   3542         end_slice = None
   3543         if end is not None:
-> 3544             end_slice = self.get_slice_bound(end, 'right', kind)
   3545         if end_slice is None:
   3546             end_slice = len(self)

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind)
   3496             if isinstance(slc, np.ndarray):
   3497                 raise KeyError("Cannot get %s slice bound for non-unique "
-> 3498                                "label: %r" % (side, original_label))
   3499 
   3500         if isinstance(slc, slice):

KeyError: "Cannot get right slice bound for non-unique label: 'C'"

============================================

练习2：

使用多种方法对练习1创建的Series s1进行索引和切片：

索引：数学 150

切片：语文 150 数学 150 英语 150

============================================

s1[[1]]

数学    150
dtype: int32

s1.loc["语文":"英语"]

语文    1000
数学     150
英语     150
dtype: int32

s1.iloc[0:3]

语文    1000
数学     150
英语     150
dtype: int32

3）Series的基本概念

可以把Series看成一个定长的有序字典

可以通过shape，size，index,values等得到series的属性

s.shape

(6,)

s.reshape((3,2)) # 一般不对Series进行reshape操作，会改变原来的数据形式

d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.

array([[1, 2],
       [3, 4],
       [5, 6]], dtype=int64)

s.size

6

s.index

Index(['A', 'A', 'B', 'C', 'B', 'C'], dtype='object')

s.index[[0,1]]

Index(['A', 'A'], dtype='object')

s.index[0:2]

Index(['A', 'A'], dtype='object')

s.index = list("abcdef")

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

s.index[0] = "A" # index的值不允许单个修改

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-7c5a7b9320c3> in <module>()
----> 1 s.index[0] = "A" # index的值不允许单个修改

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   1668 
   1669     def __setitem__(self, key, value):
-> 1670         raise TypeError("Index does not support mutable operations")
   1671 
   1672     def __getitem__(self, key):

TypeError: Index does not support mutable operations

可以通过head(),tail()快速查看Series对象的样式

# 把数据读入

data.head(3)

data.tail(10)

DataFrame是由Series构成的

data["age"]

0       29.0000
1        2.0000
2       30.0000
3       25.0000
4        0.9167
5       47.0000
6       63.0000
7       39.0000
8       58.0000
9       71.0000
10      47.0000
11      19.0000
12          NaN
13          NaN
14          NaN
15      50.0000
16      24.0000
17      36.0000
18      37.0000
19      47.0000
20      26.0000
21      25.0000
22      25.0000
23      19.0000
24      28.0000
25      45.0000
26      39.0000
27      30.0000
28      58.0000
29          NaN
         ...   
1283        NaN
1284        NaN
1285        NaN
1286        NaN
1287        NaN
1288        NaN
1289        NaN
1290        NaN
1291        NaN
1292        NaN
1293        NaN
1294        NaN
1295        NaN
1296        NaN
1297        NaN
1298        NaN
1299        NaN
1300        NaN
1301        NaN
1302        NaN
1303        NaN
1304        NaN
1305        NaN
1306        NaN
1307        NaN
1308        NaN
1309        NaN
1310        NaN
1311        NaN
1312        NaN
Name: age, Length: 1313, dtype: float64

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况

s = Series({"a":123,"b":345,"c":685},index=list("abcdef"))

a    123.0
b    345.0
c    685.0
d      NaN
e      NaN
f      NaN
dtype: float64

s.index = list("abcdnm")

a    123.0
b    345.0
c    685.0
d      NaN
n      NaN
m      NaN
dtype: float64

可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

pd.isnull(s)

a    False
b    False
c    False
d     True
n     True
m     True
dtype: bool

s.isnull()

a    False
b    False
c    False
d     True
n     True
m     True
dtype: bool

ind = s.isnull()

a    False
b    False
c    False
d     True
n     True
m     True
dtype: bool

s[ind] # 索引对应的值为True则输出

d   NaN
n   NaN
m   NaN
dtype: float64

s.notnull()

a     True
b     True
c     True
d    False
n    False
m    False
dtype: bool

s[s.notnull()]

a    123.0
b    345.0
c    685.0
dtype: float64

s1 = Series({"a":True,"b":False,"c":True,"d":False,"n":True,"m":True})

a     True
b    False
c     True
d    False
m     True
n     True
dtype: bool

s[s1]

a    123.0
c    685.0
n      NaN
m      NaN
dtype: float64

s

a    123.0
b    345.0
c    685.0
d      NaN
n      NaN
m      NaN
dtype: float64

s[s.isnull()] = 1000

a     123.0
b     345.0
c     685.0
d    1000.0
n    1000.0
m    1000.0
dtype: float64

Series对象本身及其实例都有一个name属性

s.name = "python"

a     123.0
b     345.0
c     685.0
d    1000.0
n    1000.0
m    1000.0
Name: python, dtype: float64

【拓展】过滤条件

s = Series({"a":1,"b":2,"c":3})

a    False
b    False
c     True
dtype: bool

nd = np.random.randint(0,20,size=10)

array([12, 15, 13,  6,  5,  3,  1,  3,  2,  6])

a = nd > 10

array([ True,  True,  True, False, False, False, False, False, False, False], dtype=bool)

nd[a]

array([12, 15, 13])

19、将一个一维数组转化为二进制表示矩阵。例如

[1,2,3]

转化为

[[0,0,1],

[0,1,0],

[0,1,1]]

1 and 0

0

1 or 0

1

1 & 2

0

# 60 = 1*2^5 + 1*2^4 + 1*2^3 +0*2^2+0*2^1+ 1*2^0

0

I = np.array([0,1,2,3,15,17,32,64,128])

array([  0,   1,   2,   3,  15,  17,  32,  64, 128])

A = I.reshape((-1,1))

array([[  0],
       [  1],
       [  2],
       [  3],
       [ 15],
       [ 17],
       [ 32],
       [ 64],
       [128]])

B = 2**np.arange(8)

array([  1,   2,   4,   8,  16,  32,  64, 128], dtype=int32)

M = A & B

array([[  0,   0,   0,   0,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0,   0,   0,   0],
       [  0,   2,   0,   0,   0,   0,   0,   0],
       [  1,   2,   0,   0,   0,   0,   0,   0],
       [  1,   2,   4,   8,   0,   0,   0,   0],
       [  1,   0,   0,   0,  16,   0,   0,   0],
       [  0,   0,   0,   0,   0,  32,   0,   0],
       [  0,   0,   0,   0,   0,   0,  64,   0],
       [  0,   0,   0,   0,   0,   0,   0, 128]], dtype=int32)

  [[  0,0,0,0,0,0,0,0],        1,2,4,8,16,32,64,128
   [  1,1,1,1,1,11,1,1],
   [  2], 
   [  3],                 &     
   [ 15],
   [ 17],
   [ 32],
   [ 64],
   [128]]

M != 0

array([[False, False, False, False, False, False, False, False],
       [ True, False, False, False, False, False, False, False],
       [False,  True, False, False, False, False, False, False],
       [ True,  True, False, False, False, False, False, False],
       [ True,  True,  True,  True, False, False, False, False],
       [ True, False, False, False,  True, False, False, False],
       [False, False, False, False, False,  True, False, False],
       [False, False, False, False, False, False,  True, False],
       [False, False, False, False, False, False, False,  True]], dtype=bool)

M[M!=0] = 1

array([[0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1]], dtype=int32)

M[:,::-1]

array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)

4）Series的运算

(1) 适用于numpy的数组运算也适用于Series

s

a    1
b    2
c    3
dtype: int64

s + 3

a    4
b    5
c    6
dtype: int64

s * 3

a    3
b    6
c    9
dtype: int64

(2) Series之间的运算

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

注意：要想保留所有的index，则需要使用.add()函数

s1 = Series(np.random.randint(0,150,size=4),index=list("ABCD"),name="数学")

A    121
B     97
C     76
D     27
Name: 数学, dtype: int32

s2 = Series(np.random.randint(0,150,size=4),index=list("ABmn"),name="python")

A     64
B     19
m     17
n    137
Name: python, dtype: int32

np.nan + 3

nan

s1 + s2

A    185.0
B    116.0
C      NaN
D      NaN
m      NaN
n      NaN
dtype: float64

A 142 A 132 B 0 + B 107 C 6 m 50 D 67 n 5 m nan C nan n nan D nan

s1 * s2

A    7744.0
B    1843.0
C       NaN
D       NaN
m       NaN
n       NaN
dtype: float64

s1 / s2

A    1.890625
B    5.105263
C         NaN
D         NaN
m         NaN
n         NaN
dtype: float64

s1.add(s2)

A    185.0
B    116.0
C      NaN
D      NaN
m      NaN
n      NaN
dtype: float64

s1.subtract(s2)

A    57.0
B    78.0
C     NaN
D     NaN
m     NaN
n     NaN
dtype: float64

s1.pow(2)

A    14641
B     9409
C     5776
D      729
Name: 数学, dtype: int32

============================================

练习3：

想一想Series运算和ndarray运算的规则有什么不同？
新建另一个索引包含“文综”的Series s2，并与s2进行多种算术操作。思考如何保存所有数据。

============================================

2、DataFrame

DataFrame是一个【表格型】的数据结构，可以看做是【由Series组成的字典】（共用同一个索引）。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values（numpy的二维数组）

1）DataFrame的创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。

此外，DataFrame会自动加上每一行的索引（和Series一样）。

同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN。

from pandas import DataFrame

dic = {

{'外语': [120, 130, 123, 125, 112, 143],
 '姓名': ['狗蛋', '刘德华', '鹿晗', 'TF', '张柏芝', '谢霆锋'],
 '数学': [130, 110, 100, 100, 100, 120],
 '综合': [230, 240, 254, 234, 100, 100],
 '语文': [120, 130, 140, 110, 110, 100]}

df = DataFrame(dic,index=list("abcdef"))

​

DataFrame属性：values、columns、index、shape

df.values # 二维数组

array([[120, '狗蛋', 130, 230, 120],
       [130, '刘德华', 110, 240, 130],
       [123, '鹿晗', 100, 254, 140],
       [125, 'TF', 100, 234, 110],
       [112, '张柏芝', 100, 100, 110],
       [143, '谢霆锋', 120, 100, 100]], dtype=object)

df.index # index索引对象

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

df.index[0] = "q" # 索引不能单独赋值

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-90-2b136e7ea00f> in <module>()
----> 1 df.index[0] = "q" # 索引不能单独赋值

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   1668 
   1669     def __setitem__(self, key, value):
-> 1670         raise TypeError("Index does not support mutable operations")
   1671 
   1672     def __getitem__(self, key):

TypeError: Index does not support mutable operations

df.index = list("ABCDEF") # index只能整体赋值

df

df.columns  # index对象

Index(['外语', '姓名', '数学', '综合', '语文'], dtype='object')

df.columns[0] = "python" # 不能单独赋值

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-94-82c06599cb92> in <module>()
----> 1 df.columns[0] = "python" # 不能单独赋值

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   1668 
   1669     def __setitem__(self, key, value):
-> 1670         raise TypeError("Index does not support mutable operations")
   1671 
   1672     def __getitem__(self, key):

TypeError: Index does not support mutable operations

df.columns = ["python","java","h5","ruby","javascript"]

df

df1 = DataFrame(dic,columns=["姓名","java","h5","ruby","javascript"])

df2 = DataFrame(dic,columns=["姓名","语文","数学","外语","综合"])

nd = np.random.randint(0,150,size=(4,4))

df3 = DataFrame(nd,index=list("abcd"),columns=list("ABCD"))

df2["数学"][0] = 1000

d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

df2

dic

{'外语': [120, 130, 123, 125, 112, 143],
 '姓名': ['狗蛋', '刘德华', '鹿晗', 'TF', '张柏芝', '谢霆锋'],
 '数学': [130, 110, 100, 100, 100, 120],
 '综合': [230, 240, 254, 234, 100, 100],
 '语文': [120, 130, 140, 110, 110, 100]}

df3["A"]["a"] = 2000

nd

array([[2000,   37,   48,  114],
       [ 122,   31,    4,   35],
       [ 100,   40,   73,    0],
       [ 138,  111,  101,  110]])

【注意】用字典创建是副本拷贝，用数组创建是索引的拷贝

============================================

练习4：

根据以下考试成绩表，创建一个DataFrame，命名为df：

    张三  李四
语文 150  0
数学 150  0
英语 150  0
理综 300  0

============================================

2）DataFrame的索引

(1) 对列进行索引

- 通过类似字典的方式
- 通过属性的方式

可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且name属性也已经设置好了，就是相应的列名。

df

# 键值形式

A     狗蛋
B    刘德华
C     鹿晗
D     TF
E    张柏芝
F    谢霆锋
Name: java, dtype: object

#属性形式

0    1000
1     110
2     100
3     100
4     100
5     120
Name: 数学, dtype: int64

# 索引多列

df[["python"]]

df["python":"java"] # 对列进行切片没有意义

(2) 对行进行索引

- 使用.ix[]来进行行索引
- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引

同样返回一个Series，index为原来的columns。

df

df.loc["A"]

python        120
java           狗蛋
h5            130
ruby          230
javascript    120
Name: A, dtype: object

df.iloc[1]

python        130
java          刘德华
h5            110
ruby          240
javascript    130
Name: B, dtype: object

# 切片

df.loc[["A","B"]]

df.iloc[0:3]

df.loc["A","python"]

120

df.iloc[1,1] # 隐式索引相当于索引数组

'刘德华'

df["python","A"]  # 不能先列后行

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('python', 'A')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-120-d9617f00040f> in <module>()
----> 1 df["python","A"]  # 不能先列后行

d:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

d:\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

d:\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

d:\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('python', 'A')

(3) 对元素索引的方法

- 使用列索引
- 使用行索引(iloc[3,1]相当于两个参数;iloc[[3,3]] 里面的[3,3]看做一个参数)
- 使用values属性（二维numpy数组）

# 两次索引

df.loc["A"]["java"]

df.loc["A":"C"]["java"]

A     狗蛋
B    刘德华
C     鹿晗
Name: java, dtype: object

df["python":"java"]["A"] # 对列切片得到了空

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'A'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-122-f6746d39a074> in <module>()
----> 1 df["python":"java"]["A"] # 对列切片得到了空

d:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

d:\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

d:\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

d:\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

d:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'A'

df.iloc[0:3,4] # 隐式切片的时候第二个也要隐式

A    120
B    130
C    140
Name: javascript, dtype: int64

df.iloc[0:3]["h5"]

A    130
B    110
C    100
Name: h5, dtype: int64

【注意】直接用中括号时：

索引表示的是列索引
切片表示的是行切片

​

============================================

练习5：

使用多种方法对ddd进行索引和切片，并比较其中的区别

============================================

3）DataFrame的运算

（1） DataFrame之间的运算

同Series一样：

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

创建DataFrame df1 不同人员的各科目成绩，月考一

df1 = DataFrame(np.random.randint(0,150,size=(4,4)),

创建DataFrame df2 不同人员的各科目成绩，月考二
有新学生转入

df2 = DataFrame(np.random.randint(0,150,size=(6,4)),

df1 + df2

df1.add(df2,fill_value=0)

df2.add(df1,fill_value=0) # fill_value参数在补用它的值去填补缺失的地方

java  数学 python       java  python 数学

a 0 + a 0 b 0 b 0 c 0 c 0 d 0 0 nan d 0

下面是Python 操作符与pandas操作函数的对应表：

Python Operator	Pandas Method(s)
`+`	`add()`
`-`	`sub()`, `subtract()`
`*`	`mul()`, `multiply()`
`/`	`truediv()`, `div()`, `divide()`
`//`	`floordiv()`
`%`	`mod()`
`**`	`pow()`

​

（2） Series与DataFrame之间的运算

【重要】

使用Python操作符：以行为单位操作（参数必须是行），对所有行都有效。（类似于numpy中二维数组与一维数组的运算，但可能出现NaN）

使用pandas操作函数：

  axis=0：以列为单位操作（参数必须是列），对所有列都有效。
  axis=1：以行为单位操作（参数必须是行），对所有行都有效。

df2

s_row = df2.loc["c"]

python    123
java       83
ruby        0
数理统计       54
Name: c, dtype: int32

s_col = df2["python"]

a     47
b     91
c    123
d     50
e     84
f    103
Name: python, dtype: int32

df2

df2.add(s_row,axis=1) # 按照列标进行加（把s加到所有的行中）

df2.add(s_col,axis=0) # 按行标来加（把s加到所有的列中）

============================================

练习6：

假设ddd是期中考试成绩，ddd2是期末考试成绩，请自由创建ddd2，并将其与ddd相加，求期中期末平均值。
假设张三期中考试数学被发现作弊，要记为0分，如何实现？
李四因为举报张三作弊立功，期中考试所有科目加100分，如何实现？
后来老师发现有一道题出错了，为了安抚学生情绪，给每位学生每个科目都加10分，如何实现？

============================================

dic = {

dic = {

d = ddd[["语文","数学","外语","综合"]] + ddd2[["语文","数学","外语","综合"]]

d = d/2

d["姓名"] = ddd["姓名"] # 添加上姓名这一列

d

ddd.loc[4,"数学"] = 0 # 如果要修改不能用二次索引

ddd

ddd.loc[5,["语文","数学","外语","综合"]] += 100

ddd

ddd[["语文","数学","外语","综合"]] += 10

ddd

ddd.iloc[:,1:5] +=10 # 用数组的方法

ddd

	row.names	pclass	survived	name	age	embarked	home.dest	room	ticket	boat	sex
0	1	1st	1	Allen, Miss Elisabeth Walton	29.0	Southampton	St Louis, MO	B-5	24160 L221	2	female
1	2	1st	0	Allison, Miss Helen Loraine	2.0	Southampton	Montreal, PQ / Chesterville, ON	C26	NaN	NaN	female
2	3	1st	0	Allison, Mr Hudson Joshua Creighton	30.0	Southampton	Montreal, PQ / Chesterville, ON	C26	NaN	(135)	male

	java	python	ruby	微积分	数理统计	线性代数
a	51.0	148.0	134.0	31.0	203.0	33.0
b	27.0	189.0	133.0	139.0	183.0	69.0
c	83.0	135.0	0.0	40.0	190.0	59.0
d	10.0	58.0	31.0	92.0	234.0	120.0
e	117.0	84.0	17.0	NaN	132.0	NaN
f	87.0	103.0	62.0	NaN	23.0	NaN

	java	python	ruby	微积分	数理统计	线性代数
a	51.0	148.0	134.0	31.0	203.0	33.0
b	27.0	189.0	133.0	139.0	183.0	69.0
c	83.0	135.0	0.0	40.0	190.0	59.0
d	10.0	58.0	31.0	92.0	234.0	120.0
e	117.0	84.0	17.0	NaN	132.0	NaN
f	87.0	103.0	62.0	NaN	23.0	NaN

	python	java	ruby	数理统计
a	94	98	181	163
b	182	118	224	170
c	246	206	123	177
d	100	60	81	193
e	168	201	101	216
f	206	190	165	126

	语文	数学	外语	综合
0	241	260	240	460
1	261	220	160	280
2	280	200	246	508
3	220	220	250	268
4	220	200	244	200
5	200	240	286	160

pandas（一）pandas的数据结构

Pandas的数据结构

1、Series

1）Series的创建

2）Series的索引和切片

3）Series的基本概念

4）Series的运算

2、DataFrame

1）DataFrame的创建

2）DataFrame的索引

3）DataFrame的运算

猜你喜欢

	A	B	C	D
a	84	37	48	114
b	122	31	4	35
c	100	40	73	0
d	138	111	101	110

	A	B	C	D
a	2000	37	48	114
b	122	31	4	35
c	100	40	73	0
d	138	111	101	110

	python	java	ruby	数理统计
a	170	134	134	170
b	214	110	133	133
c	246	166	0	108
d	173	93	31	197
e	207	200	17	186
f	226	170	62	77

	语文	数学	外语	综合	姓名
0	120.5	130.0	120.0	230.0	狗蛋
1	130.5	110.0	80.0	140.0	刘德华
2	140.0	100.0	123.0	254.0	鹿晗
3	110.0	110.0	125.0	134.0	TF
4	110.0	100.0	122.0	100.0	张三
5	100.0	120.0	143.0	80.0	李四