用Pandas处理较大数据量

在一些比赛中,经常会出现原始训练数据就有十几G大小,正常的个人电脑内存根本不足以容纳这么大数据量。查到可以使用Pandas将原数据集划分成小块存储。以下内容转载自知乎

user_feat = ['user_id','user_gender_id','user_age_level','user_occupation_id','user_star_level']
reader = pd.read_csv("./data/round2_train.txt", sep="\s+",iterator=True)
chunks = []
loop = True
while loop:
    try:
        chunk = reader.get_chunk(500000)[user_feat]
        chunks.append(chunk)
    except StopIteration:
        loop = False
        print("Iteration is stopped")
df_user = pd.concat(chunks,axis=0, ignore_index=True)
df_user = pd.concat([df_user, test[user_feat]],axis=0)
df_user.drop_duplicates(subset='user_id',keep='first',inplace=True)
df_user.to_csv('./data/user_file.csv',index=False)
print('user_file', df_user.shape)

del df_user
gc.collect()

猜你喜欢

转载自blog.csdn.net/Lyteins/article/details/82355572