如何通过numpy子集来筛选矩阵或者数组

使用pandsa数据框时经常需要通过某一列来筛选数据，有时需要用for循环来筛选目标列，但for循环太慢，可以通过numpy子集函数先筛选索引，然后通过布尔索引来筛选，可以极大提高筛选的速度，可以1秒筛选几百万的矩阵。

第二步，对于每个样本，提取3条序列作为测试集合使用

比如要解决如下问题：

本人有个涉及质粒的矩阵，如下：

>>> pdf6.head()
	0	1	2	3	4	5	6	7	8	9	...	2072	2073	2074	2075	2076	2077	2078	2079	id6	plasmid
0	32	12	21	20	11	6	4	4	16	13	...	4	4	12	5	1	13	1	8	AP012172_1	CP002635
1	20	14	8	14	13	4	7	7	12	8	...	8	1	9	2	1	8	2	12	AP012172_2	AP012172
2	16	14	11	24	7	4	6	11	10	5	...	3	5	10	5	2	11	1	6	AP012173_3	AP012173
3	29	12	23	16	14	11	10	6	24	8	...	4	4	10	2	2	11	1	11	AP012174_4	AP012174
4	37	13	17	24	9	12	2	6	21	12	...	8	6	13	7	2	9	2	7	AP012175_5	AP012175

每个质粒理论上应该有10个数据，但有些质粒不到10个，目标是去掉数据不够10条的质粒样本，然后每个取其中三个样本。

方法1：采用numpy中集合函数

1.1 筛选出不够10条数据的质粒序列的index

#首先根据id sort，这样同一个质粒的sample会聚在一起
>>> s_pdf6=pdf6.sort_values("id6")
>>> p6id = pdf6["plasmid"]
>>> badplasid=pd.Series(p6id).value_counts()[pd.Series(p6id).value_counts()!=10].index
>>> badplasid
Index(['HE577332', 'CP014550', 'CM002269', 'AP018195', 'HE577331', 'CP006583',
       'CM001479', 'CP017127', 'CP008949', 'EF495211',
       ...
       'HE983996', 'CP006924', 'AF312688', 'AF128883', 'CP003436', 'CM002137',
       'GU569091', 'CP003435', 'CP003442', 'CP012459'],
      dtype='object', length=152)

1.2 使用numpy并集从总的数据中选出badplasid的index

# 计算子集
>>> s_pdf6_bad_index=np.in1d(s_pdf6.plasmid,badplasid)
# 返回 s_pdf6_bad_index是一个布尔array，长度为s_pdf6.plasmid,元素中存在badplasid的地方为true，其他为bad

# 由于是需要去掉子集部分，因此需要用反转（ture-false互相转换）
>>> s_pdf6_10_index=np.logical_not(s_pdf6_bad_index)

1.3 利用获得的布尔索引筛选dataframe

>>> pdf6_10s=s_pdf6[s_pdf6_10_index]

1.4 对每个plasmid样本的10个sample，提取3个作为testdata使用

# 重设索引
>>> pdf6_10s=pdf6_10s.reset_index()
# 提取后7个sample为train data
>>> pdf_train=pdf6_10s[(pdf6_10s.index%10)>=3]

方法2: for循环

先建一个空白dataframe，然后利用far循环一个样本一个样本的筛选并append。

# 1. 查看不够10个sample的质粒
>>> pd.Series(p6id).value_counts()

# 2. 观察最后的行不够10个位置，查看一下index的位置，然后截取前面够10个的index。此处为后170个plasmid不够10个sample
>>> p_goodid=list(pd.Series(p6id).value_counts().index)[:-170]

# 对p_goodid中的元素依次循环提取并加到先前矩阵后面
>>> pdf_test = pd.DataFrame(columns = pdf6.columns)
>>> pdf_train = pd.DataFrame(columns = pdf6.columns)

>>> for i in p_goodid:
        single=pdf6[pdf6.plasmid==i]
         pdf_test=pdf_test.append(single.iloc[:3,],ignore_index=True)
        pdf_train=pdf_train.append(single.iloc[3:,],ignore_index=True)

第二种方法适合数据不多，如果数据量太大，花费时间太久。本人测试，第一种方法30s，第二种方法花了12h。