python系列全部来源于《Python for data analysis》笔记
1 简介
Pandas是python的一个数据分析包
Pandas中的数据结构
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,
而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
Time- Series:以时间为索引的Series。
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel :三维的数组,可以理解为DataFrame的容器。
1.1 Series
(1)简单创建序列(直接从数组array产生)
In [4]: obj = Series([4, 7, -5, 3])
In [5]: obj Out[5]:
0 4
1 7
2 -5
3 3
可以看到它与纯粹的array的区别是,它包含了一个索引列。
(2)获取Series的索引和值
In [6]: obj.values
Out[6]: array([ 4, 7, -5, 3])
In [7]: obj.index
Out[7]: Int64Index([0, 1, 2, 3])
(3)创建序列的同时,指定索引
In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [9]: obj2
Out[9]:
d 4
b 7
a -5
c 3
(4)对序列中的元素操作,注意与NumPy的array区别
In [11]: obj2['a'] Out[11]: -5
In [12]: obj2['d'] = 6
In [13]: obj2[['c', 'a', 'd']]
Out[13]:
c 3
a -5
d 6
In [18]: 'b' in obj2
Out[18]: True
In [19]: 'e' in obj2
Out[19]: False
(5)将Python基础数据类型dict转换为Series
In [20]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [21]: obj3 = Series(sdata)
In [22]: obj3
Out[22]: Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
(6)Series在算术运算的重要方法是自动根据index索引找到相应的值,并执行操作
In [29]: obj3 In [30]: obj4
Out[29]: Out[30]:
Ohio 35000 California NaN
Oregon 16000 Ohio 35000
Texas 71000 Oregon 16000
Utah 5000 Texas 71000
In [31]: obj3 + obj4 Out[31]:
California NaN
Ohio 70000
Oregon 32000
Texas 142000
Utah NaN
2 DataFrame
(1)创建DataFrame,最普通的方式是:从一个等长的列表或数组的Dict类型产生
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
得到的DataFrame将会被自动加上索引
In [38]: frame
Out[38]: pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
(2)为DataFrame指定列名:
In [39]: DataFrame(data, columns=['year', 'state', 'pop'])
Out[39]: year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
(3) 为DataFrame增加一列,非常粗暴.....
In [46]: frame2['debt'] = 16.5
In [47]: frame2
Out[47]: year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
In [48]: frame2['debt'] = np.arange(5.)
In [49]: frame2
Out[49]: year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
如果增加的列与原DataFrame长度不一致:
In [50]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [51]: frame2['debt'] = val
In [52]: frame2
Out[52]: year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
(4)删除某一列,使用关键字"del"
In [53]: frame2['eastern'] = frame2.state == 'Ohio'
In [54]: frame2
Out[54]: year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
In [55]: del frame2['eastern']
In [56]: frame2.columns
Out[56]: Index([year, state, pop, debt], dtype=object)
(5)index索引对象
所有的数组或者其他序列化标签(类似‘name’、‘label’属性)被构造成Series或DataFrame对象时都会被转化为内部的索引。
In [68]: obj = Series(range(3), index=['a', 'b', 'c'])
In [69]: index = obj.index
In [70]: index
Out[70]: Index([a, b, c], dtype=object)
Series或DataFrame的索引不能被修改,下式将会出错:
In [72]: index[1] = 'd'
--------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-72-676fdeb26a68> in <module>() ----> 1 index[1] = 'd'
但是可以重新被指定(方法1):
In [73]: index = pd.Index(np.arange(3))
In [74]: obj2 = Series([1.5, -2.5, 0], index=index)
使用'Reindex'指定(方法2):
<pre class="python" name="code">In [79]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [80]:obj
Out[80]: d 4.5
b 7.2
a -5.3
c 3.6
In [81]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [82]: obj2
Out[82]: a -5.3
b 7.2
c 3.6
d 4.5
e NaN
或者:
In [90]: states = ['Texas', 'Utah', 'California']
In [91]: frame.reindex(columns=states)
Out[91]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
(6)index索引有类似集合的性质
In [76]: frame3
Out[76]: state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [77]: 'Ohio' in frame3.columns
Out[77]: True
In [78]: 2003 in frame3.index
Out[78]: False
(7)使用”drop“方法从某一维度上删除部分数据,注意drop方法将会返回新的对象,不对数据对象本身造成影响
在Series对象中使用drop:
In [94]: obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [95]: new_obj = obj.drop('c')
In [96]: new_obj
Out[96]: a 0
b 1
d 3
e 4
In [97]: obj.drop(['d', 'c'])
Out[97]: a 0
b 1
e 4
在DataFrame中使用drop:
In [98]: data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
In [99]: data.drop(['Colorado', 'Ohio'])
Out[99]: one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [100]: data.drop('two', axis=1)
In [101]: data.drop(['two', 'four'], axis=1)
Out[100]: one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
(8) 索引、选择和过滤
基本和array相同,示例如下:
(Series略)DataFrame中:
In [112]: data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
In [113]: data
Out[113]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [115]: data[['three', 'one']]
Out[115]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
In [118]: data < 5
Out[118]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [119]: data[data < 5] = 0
In [120]: data
Out[120]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
使用犀利的 "ix "方法:
In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
Out[122]:
four one two
Colorado 7 0 5
Utah 11 8 9
关于索引的相关方法使用说明:
Type Notes
obj[val] Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows),
or boolean DataFrame (set values based on some criterion). obj.ix[val] Selects single row of subset of rows from the DataFrame.
obj.ix[:, val] Selects single column of subset of columns.
obj.ix[val1, val2] Select both rows and columns. reindex method Conform one or more axes to new indexes.
xs method Select single row or column as a Series by label.
icol, irow methods Select single column or row, respectively, as a Series by integer location.
get_value, set_value methods Select single value by row and column label.