pandas使用小结(一)

1.How to select rows from a DataFrame based on values in some column in pandas?
In SQL I would use:

select * from table where colume_name = some_value.

2.我已经有一个dataFrame了，格式如下：

年月用户代码规模

201201 500 1

203103 123 2

现在我想实现SQL中的如下操作：

Select count(distinct CLIENTCODE) from table group by YEARMONTH

解决办法(一)：

In [2]: table Out[2]: CLIENTCODE YEARMONTH 0 1 201301 1 1 201301 2 2 201301 3 1 201302 4 2 201302 5 2 201302 6 3 201302 In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique() Out[3]: YEARMONTH 201301 2 201302 3

解决办法（二）

len(unique())也可以用，速度是nunique()的3到15倍，不过具体什么意思没参透[1]

3.有一个rpt表，内容如下：

rpt MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231') Data columns: STK_ID 47518 non-null values STK_Name 47518 non-null values RPT_Date 47518 non-null values sales 47518 non-null values

将某个STK_ID的记录全部过滤出来，命令是：rpt[rpt['STK_ID']=='600809']

将一个数列

stk_list = ['600809','600141','600329']中的全部记录过滤出来，命令是：

rpt[rpt['STK_ID'].isin(stk_list)].

或者

rpt.query('STK_ID in (600809,600141,600329)')

或者

rpt.query('60000 < STK_ID < 70000')

如果用模糊匹配的话，命令是：

rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')]

4.获取dataFrame的行数和列数，使用的命令是：dataframe.shape[0]和dataframe.shape[1]

此外，获取行数还有方法：len(DataFrame.index)

5.dataFrame去重：

一般使用命令 drop和drop_duplicates,不过其结果是产生一个新的dataframe,除非你在inplace参数赋值是true,这

种情况下是在原来dataframe上进行修改

设 df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])

将b列中重复的项删除，命令是 df = df.drop_duplicates('b')

6.将dataframe中，某列进行清洗的命令

删除换行符：misc['product_desc'] = misc['product_desc'].str.replace('\n', '')

删除字符串前后空格：df["Make"] = df["Make"].map(str.strip)

7.从外部导入到dataframe中时，根据需要转换数据类型：

8.apply,applymap和map的应用，总结是apply 用在dataframe上，用于对row或者column进行计算, applymap 用于dataframe上，是元素级别的操作，map 用于series上，是元素级别的操作。
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [117]: frame Out[117]: b d e Utah -0.029638 1.081563 1.280300 Ohio 0.647747 0.831136 -1.549481 Texas 0.513416 -0.884417 0.195343 Oregon -0.485454 -0.477388 -0.309548 In [118]: f = lambda x: x.max() - x.min() In [119]: frame.apply(f) Out[119]: b 1.133201 d 1.965980 e 2.829781 dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x In [121]: frame.applymap(format) Out[121]: b d e Utah -0.03 1.08 1.28 Ohio 0.65 0.83 -1.55 Texas 0.51 -0.88 0.20 Oregon -0.49 -0.48 -0.31

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format) Out[122]: Utah 1.28 Ohio -1.55 Texas 0.20 Oregon -0.31 Name: e, dtype: object

 
9.根据某列，将两个 dataframe合并：

# Merge multiple dataframes df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

10.将一个list添加到set中:

keep.add(onemorevalue)

keep.update(yoursequenceofvalues)

将某个元素添加到set中:

table = pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"], converters = {'Description' : strip, 'Model' : strip, 'Make' : strip, 'Year' : make_int})

pandas使用小结(一)

猜你喜欢