.loc[],中括号里面是先行后列,以逗号分割,行和列分别是行标签和列标签(label)
.iloc[]与loc一样,中括号里面也是先行后列,行列标签用逗号分割,与loc不同的之处是,.iloc 是根据行数与列数来索引的
.ix上面两种用法都可以
X=data.loc[:,data.columns != 'Class'] #loc 通过行标签索引数据,
y=data.loc[:,data.columns == 'Class'] #取label
#number of data points in the minority class
number_records_fraud=len(data[data.Class==1]) #Class=1的数量
fraud_indices=np.array(data[data.Class==1].index) #取得其索引值
normal_indices=np.array(data[data.Class==0].index) # class为0的数据索引
random_normal_indices=np.random.choice(normal_indices,number_records_fraud,replace=False) # 随机采样,并不对原始dataframe进行替换
random_normal_indices=np.array(random_normal_indices) # 矩阵转换成numpy的array格式
under_sample_indices=np.concatenate([fraud_indices,random_normal_indices]) # 合并class=1和class=0中随机选取的数据
under_sample_data = data.iloc[under_sample_indices,:] #定位到真正数据,iloc通过行号索引行数据
X_undersample=under_sample_data.loc[:,under_sample_data.columns!='Class']
y_undersample=under_sample_data.loc[:,under_sample_data.columns=='Class']
print(X_undersample)
print(y_undersample)
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))
思路:大样本随机取小样本的数量A--》a
a和B再split成train和test