大数据处理经验(持续更新)

先取少量数据跑代码，确保代码没有语法和逻辑错误，再放到大量数据上面跑。
使用pandas的DataFrame表示数据的时候，对于int和float的默认为int64和float64位，但实际可能不需要这样的高精度表示。可通过以下代码节省内存：

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / (1024 ** 3)
    print('Memory usage of dataframe is {:.2f} GB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            min_val = df[col].min()
            max_val = df[col].max()

            if str(col_type).startswith('int'):
                type_list = [np.int8, np.int16, np.int32, np.int64]
                for i in type_list:
                    if min_val >= np.iinfo(i).min and max_val <= np.iinfo(i).max:
                        df[col] = df[col].astype(i)
                        break
            else:
                type_list = [np.float16, np.float32, np.float64]
                for i in type_list:
                    if min_val >= np.iinfo(i).min and max_val <= np.iinfo(i).max:
                        df[col] = df[col].astype(i)
                        break

    end_mem = df.memory_usage().sum() / (1024 ** 3)
    print('Memory usage of dataframe is {:.2f} GB'.format(end_mem))
    return df

使用pandas的read_csv或者excel读取大文件时，在读取过程中出现OOM(Out of memory，内存溢出)，但是结合watch -n 0.1 free -hm和已读取的行数占比来查看的话，发现需要内存超出实际内存大约占10%左右，可通过设置chunksize进行分块读取(如总行数的1/10)。

herosunly

发布了178 篇原创文章 · 获赞 389 · 访问量 6万+

私信关注

大数据处理经验(持续更新)

猜你喜欢