pandas使用小结(一)

1.How to select rows from a DataFrame based on values in some column in pandas?
In SQL I would use:

select * from table where colume_name = some_value.

 

2.我已经有一个dataFrame了,格式如下:

年月  用户代码  规模

201201  500    1

203103  123    2

现在我想实现SQL中的如下操作:

Select count(distinct CLIENTCODE) from table group by YEARMONTH

解决办法(一):

 
 
In [2]: table Out[2]: CLIENTCODE YEARMONTH 0 1 201301 1 1 201301 2 2 201301 3 1 201302 4 2 201302 5 2 201302 6 3 201302 In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique() Out[3]: YEARMONTH 201301 2 201302 3
解决办法(二)

len(unique())也可以用,速度是nunique()的3到15倍,不过具体什么意思没参透[1]

3.有一个rpt表,内容如下:

rpt MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231') Data columns: STK_ID 47518 non-null values STK_Name 47518 non-null values RPT_Date 47518 non-null values sales 47518 non-null values
将某个STK_ID的记录全部过滤出来,命令是:rpt[rpt['STK_ID']=='600809']
将一个数列
stk_list = ['600809','600141','600329']中的全部记录过滤出来,命令是:
rpt[rpt['STK_ID'].isin(stk_list)].
或者
rpt.query('STK_ID in (600809,600141,600329)')
或者
rpt.query('60000 < STK_ID < 70000')
如果用模糊匹配的话,命令是:
rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')]
4.获取dataFrame的行数和列数,使用的命令是:dataframe.shape[0]和dataframe.shape[1]
此外,获取行数还有方法:len(DataFrame.index)
5.dataFrame去重:
一般使用命令 drop和drop_duplicates,不过其结果是产生一个新的dataframe,除非你在inplace参数赋值是true,这
种情况下是在原来dataframe上进行修改
设 df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])
将b列中重复的项删除,命令是 df = df.drop_duplicates('b')
6.将dataframe中,某列进行清洗的命令
删除换行符:misc['product_desc'] = misc['product_desc'].str.replace('\n', '')
删除字符串前后空格:df["Make"] = df["Make"].map(str.strip)
7.从外部导入到dataframe中时,根据需要转换数据类型:
8.apply,applymap和map的应用,总结是apply 用在dataframe上,用于对row或者column进行计算, applymap 用于dataframe上,是元素级别的操作,map 用于series上,是元素级别的操作。
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [117]: frame Out[117]: b d e Utah -0.029638 1.081563 1.280300 Ohio 0.647747 0.831136 -1.549481 Texas 0.513416 -0.884417 0.195343 Oregon -0.485454 -0.477388 -0.309548 In [118]: f = lambda x: x.max() - x.min() In [119]: frame.apply(f) Out[119]: b 1.133201 d 1.965980 e 2.829781 dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x In [121]: frame.applymap(format) Out[121]: b d e Utah -0.03 1.08 1.28 Ohio 0.65 0.83 -1.55 Texas 0.51 -0.88 0.20 Oregon -0.49 -0.48 -0.31

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format) Out[122]: Utah 1.28 Ohio -1.55 Texas 0.20 Oregon -0.31 Name: e, dtype: object
 
9.根据某列,将两个 dataframe合并:
# Merge multiple dataframes df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')
10.将一个list添加到set中:
keep.add(onemorevalue)
keep.update(yoursequenceofvalues)
将某个元素添加到set中:
table = pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"], converters = {'Description' : strip, 'Model' : strip, 'Make' : strip, 'Year' : make_int})

猜你喜欢

转载自blog.csdn.net/doubleicefire/article/details/80175107