本次从DataFrame的基本定义、创建、索引、取值/赋值几个方面进行基本的入门介绍
定义
DataFrame是个表格型的数据结构,既包括行索引,也包括列索引,类似于numpy的二维数组,它可以被看做是n个Series的集合。
创建
DataFrame的创建形式有许多,可以通过二维数组、等长列表字典、字典组成的字典、由Series组成的字典等多种方式创建等。DataFrame在创建时可通过columns和index指定它的行索引和列索引,当然也可不指定索引,系统默认会创建0/1/2/3形式的数字索引。
二维数组创建形式(不指定行索引和列索引)
DataFrame(np.zeros([3,3]))
Out[124]:
0 1 2
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
二维数组创建形式(指定行索引和列索引)
DataFrame(np.zeros([3,3]),columns=[1,2,3],index=['a','b','c'])
Out[128]:
1 2 3
a 0.0 0.0 0.0
b 0.0 0.0 0.0
c 0.0 0.0 0.0
等长列表字典创建形式
data={'state':['ohio','Nevada','ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,1.2,1.8,1.4]}
frame=DataFrame(data)
frame
Out[127]:
state year pop
0 ohio 2000 1.5
1 Nevada 2001 1.7
2 ohio 2002 1.2
3 Nevada 2001 1.8
4 Nevada 2002 1.4
等长列表字典创建形式(指定行索引和列索引)
frame1=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame1
Out[157]:
year state pop debt
one 2000 ohio 1.5 NaN
two 2001 Nevada 1.7 NaN
three 2002 ohio 1.2 NaN
four 2001 Nevada 1.8 NaN
five 2002 Nevada 1.4 NaN
frame1=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
Traceback (most recent call last):
File "<ipython-input-158-a5d924c7dc7b>", line 1, in <module>
frame1=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
File "f:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 348, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "f:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 459, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "f:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 7323, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "f:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4877, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "f:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4843, in construction_error
passed, implied))
ValueError: Shape of passed values is (4, 5), indices imply (4, 6)
可以看到,列索引中多出了debt,因data中不存在与其对应的字典值,故被自动赋予了nan。若在行索引index中出现了多余的索引,则不会被赋予nan,而是会报错提示超出了范围
索引
DataFrame的索引有行索引、列索引、值索引三种,行/列索引结果为一个列表,值索引的结果为一个数组,故其相应的列表和数组的操作也适用于索引结果。
frame1
Out[159]:
year state pop debt
one 2000 ohio 1.5 NaN
two 2001 Nevada 1.7 NaN
three 2002 ohio 1.2 NaN
four 2001 Nevada 1.8 NaN
five 2002 Nevada 1.4 NaN
frame1.columns
Out[160]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame1.index
Out[161]: Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
frame1.values
Out[162]:
array([[2000, 'ohio', 1.5, nan],
[2001, 'Nevada', 1.7, nan],
[2002, 'ohio', 1.2, nan],
[2001, 'Nevada', 1.8, nan],
[2002, 'Nevada', 1.4, nan]], dtype=object)
取值/赋值
DataFrame可通过指定行/列索引对象的方式进行取值、赋值
frame1.year
Out[180]:
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
frame1['year']
Out[181]:
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
frame1.ix['one']
Out[182]:
year 2000
state ohio
pop 1.5
debt NaN
Name: one, dtype: object
赋值
frame1['debt']=np.array(range(5))
frame1
Out[185]:
year state pop debt
one 2000 ohio 1.5 0
two 2001 Nevada 1.7 1
three 2002 ohio 1.2 2
four 2001 Nevada 1.8 3
five 2002 Nevada 1.4 4
列、行索引命名
frame1.index.name='number'
frame1.columns.name='state'
frame1
Out[188]:
state year state pop debt
number
one 2000 ohio 1.5 0
two 2001 Nevada 1.7 1
three 2002 ohio 1.2 2
four 2001 Nevada 1.8 3
five 2002 Nevada 1.4 4