Pandas数据结构之DataFrame（数据帧）

DataFrame是pandas的二维数据结构，可以允许不同类型的列。可以把它想象成Excel的sheet或者SQL的表。DataFrame可以接受不同的数据类型：

一维ndarray的字典，列表，字典等
二维ndarray
结构化数组
序列
别的数据帧
在创建数据帧的时候，你还可以将数据帧的索引（index）和列标签（column labels）作为参数传入。

DataFrame的创建

从序列或者字典的字典创建

索引将是各个序列的索引的并集。如果有嵌套的dicts，他们将首先被转换为序列。如果没有传递列，则列将是dict键的有序列表。

In [34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [35]: df = pd.DataFrame(d)

In [36]: df
Out[36]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

In [37]: pd.DataFrame(d, index=['d', 'b', 'a'])
Out[37]: 
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

In [38]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[38]: 
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

可以通过访问index和columns属性分别访问行标签和列标签：

注意：当一组特定的列标签与数据的dict一起传递时，传递的列标签将覆盖dict中的键。

In [39]: df.index
Out[39]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [40]: df.columns
Out[40]: Index(['one', 'two'], dtype='object')

从ndarray/列表的字典创建

字典中的各个数组必须是等长的。如果行标签没有传入，那么会自定义行标签为range(n)，n是数组的长度。

In [41]: d = {'one' : [1., 2., 3., 4.],
   ....:      'two' : [4., 3., 2., 1.]}
   ....: 

In [42]: pd.DataFrame(d)
Out[42]: 
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

In [43]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[43]: 
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

从结构化数组创建

这种情况的处理方式与数组的字典相同。

In [44]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [45]: data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [46]: pd.DataFrame(data)
Out[46]: 
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

In [47]: pd.DataFrame(data, index=['first', 'second'])
Out[47]: 
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'

In [48]: pd.DataFrame(data, columns=['C', 'A', 'B'])
Out[48]: 
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

从字典的列表创建

列表中的每个字典的每个键代表一列。

In [49]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [50]: pd.DataFrame(data2)
Out[50]: 
   a   b     c
0  1   2   NaN
1  5  10  20.0

In [51]: pd.DataFrame(data2, index=['first', 'second'])
Out[51]: 
        a   b     c
first   1   2   NaN
second  5  10  20.0

In [52]: pd.DataFrame(data2, columns=['a', 'b'])
Out[52]: 
   a   b
0  1   2
1  5  10

从元组的字典创建

In [53]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
   ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
   ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
   ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
   ....: 
Out[53]: 
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

等效构造函数

DataFrame.from_dict
DataFrame.from_records

列的选择、添加和删除

数据帧列的选择和字典是类似的。

In [58]: df['one']
Out[58]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [59]: df['three'] = df['one'] * df['two']

In [60]: df['flag'] = df['one'] > 2

In [61]: df
Out[61]: 
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

我们也可以像删除字典的键一样删除或弹出数据帧的列。

In [62]: del df['two']

In [63]: three = df.pop('three')

In [64]: df
Out[64]: 
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

当我们插入单一值的时候，它会用这个值自动填满这一列的所有行。

In [65]: df['foo'] = 'bar'

In [66]: df
Out[66]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

插入与DataFrame具有不同索引的Series时，它将符合DataFrame的索引：

In [67]: df['one_trunc'] = df['one'][:2]

In [68]: df
Out[68]: 
   one   flag  foo  one_trunc
a  1.0  False  bar        1.0
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN
d  NaN  False  bar        NaN

也可以使用insert函数插入。三个参数分别表示列的位置，列的名称，数据。

In [69]: df.insert(1, 'bar', df['one'])

In [70]: df
Out[70]: 
   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  NaN  NaN  False  bar        NaN

在方法链（Method Chains）中分配新列

In [71]: iris = pd.read_csv('data/iris.data')

In [72]: iris.head()
Out[72]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [73]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
   ....:      .head())
   ....: 
Out[73]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200

In [74]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
   ....:                                      x['SepalLength'])).head()
   ....: 
Out[74]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200

注意：assign总是返回数据的副本，原始数据保持不变。
assign用于方法链中是非常方便的。

In [75]: (iris.query('SepalLength > 5')
   ....:      .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
   ....:              PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
   ....:      .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
   ....: 
Out[75]: <matplotlib.axes._subplots.AxesSubplot at 0x7f210fb001d0>

新特性：如果有多个assign，后一个assign会基于前一个assign计算的结果再继续计算。

In [76]: dfa = pd.DataFrame({"A": [1, 2, 3],
   ....:                     "B": [4, 5, 6]})
   ....: 

In [77]: dfa.assign(C=lambda x: x['A'] + x['B'],
   ....:            D=lambda x: x['A'] + x['C'])
   ....: 
Out[77]: 
   A  B  C   D
0  1  4  5   6
1  2  5  7   9
2  3  6  9  12

行选择

基础行选择：

操作	语法	返回结果
选择列	df[col]	Series
用行标签选择行	df.loc[index]	Series
用行的位置选择行	df.iloc[loc]	Series
用行切片选择行	df[5:10]	DataFrame
用布尔数组选择行	df[bool_vec]	DataFrame

In [80]: df.loc['b']
Out[80]: 
one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

In [81]: df.iloc[2]
Out[81]: 
one             3
bar             3
flag         True
foo           bar
one_trunc     NaN
Name: c, dtype: object

DataFrame和Numpy函数

Numpy的基本函数都是可以应用在DataFrame上面的。

In [101]: np.exp(df)
Out[101]: 
                 A       B       C
2000-01-01  0.2932  2.1593  0.2777
2000-01-02  0.4830  0.8858  0.9068
2000-01-03  2.0053  1.4074  2.6110
2000-01-04  0.3294  0.5380  1.1615
2000-01-05  0.4808  1.9892  1.1930
2000-01-06  1.4968  0.8565  1.3521
2000-01-07  0.1131  0.2541  0.3851
2000-01-08  4.3176  0.1750  0.4375

In [102]: np.asarray(df)
Out[102]: 
array([[-1.2268,  0.7698, -1.2812],
       [-0.7277, -0.1213, -0.0979],
       [ 0.6958,  0.3417,  0.9597],
       [-1.1103, -0.62  ,  0.1497],
       [-0.7323,  0.6877,  0.1764],
       [ 0.4033, -0.155 ,  0.3016],
       [-2.1799, -1.3698, -0.9542],
       [ 1.4627, -1.7432, -0.8266]])
In [103]: df.T.dot(df)
Out[103]: 
         A       B       C
A  11.3419 -0.0598  3.0080
B  -0.0598  6.5206  2.0833
C   3.0080  2.0833  4.3105

控制器显示控制

 info()
 to_string()
 pd.set_option('display.width', 40) # default is 80
 pd.set_option('display.max_colwidth',30)

以上内容参考https://www.pypandas.cn/document/dsintro/dataframe.html