版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/leadingsci/article/details/89278102
快速提示-使用Modin加速Pandas
https://python.freelycode.com/contribution/detail/1454
github
https://github.com/modin-project/modin
说明手册
1. 安装
pip install modin
安装的时候,提示要安装cpython
2. 使用方法,加一行代码
# import pandas as pd
import modin.pandas as pd
示例1:
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)
4. 速度提升
import modin.pandas as pd
df = pd.read_csv("my_dataset.csv")
5. 文件测试
1. 文件大小
-rw-r--r-- 1 toucan toucan 289K Dec 20 17:17 IthaGenes_variations_export_all.csv
2. pandas读入
# 运行 python read_pandas.py
$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,2):
start = timer()
df = pd.read_csv("IthaGenes_variations_export_all.csv")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
输出结果:
$python read_pandas.py
0.009299777299929701
3. modin读入
# 运行 python read_modin.py
$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,10):
start = timer()
df = pd.read_csv("IthaGenes_variations_export_all.csv")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
输出结果:
扫描二维码关注公众号,回复:
5890731 查看本文章
$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-37-43_6323/logs.
Waiting for redis server at 127.0.0.1:35024 to respond...
Waiting for redis server at 127.0.0.1:62923 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
0.20180192090001584
问题
是不是由于输入文件太小,笔记本的内存不足,没有显示出优势来呢。我再虚拟机中是设置有12Gb内存的。本虚拟机机只有2核,4个线程。
实践2
1, 文件大小
换为大文件791M
-rw-rw-r-- 1 toucan toucan 791M Apr 13 10:43 hapmap_3.3_hg19_pop_stratified_af.vcf
2. pandas读入
$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,22):
start = timer()
df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
输出:
$python read_pandas.py
12.275232009363746
3. modin读入
$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce
# run 2 tierations of read_csv to get an average
time = []
for i in range(0,22):
start = timer()
df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
end = timer()
time.append(end - start)
time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)
在top中,不是以python来运行,而是
输出:
$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-52-14_6531/logs.
Waiting for redis server at 127.0.0.1:48416 to respond...
Waiting for redis server at 127.0.0.1:29343 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0413 10:52:51.888617 6546 node_manager.cc:245] Last heartbeat was sent 524 ms ago
W0413 10:52:56.548642 6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:53:03.599262 6546 node_manager.cc:245] Last heartbeat was sent 532 ms ago
W0413 10:53:07.556165 6546 node_manager.cc:245] Last heartbeat was sent 782 ms ago
W0413 10:53:08.947691 6546 node_manager.cc:245] Last heartbeat was sent 636 ms ago
W0413 10:53:17.075079 6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:53:19.810811 6546 node_manager.cc:245] Last heartbeat was sent 804 ms ago
W0413 10:53:20.800647 6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:53:22.806788 6546 node_manager.cc:245] Last heartbeat was sent 699 ms ago
W0413 10:54:00.502030 6546 node_manager.cc:245] Last heartbeat was sent 585 ms ago
W0413 10:54:10.019619 6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:54:24.286998 6546 node_manager.cc:245] Last heartbeat was sent 732 ms ago
W0413 10:54:28.974217 6546 node_manager.cc:245] Last heartbeat was sent 865 ms ago
W0413 10:54:44.903314 6546 node_manager.cc:245] Last heartbeat was sent 537 ms ago
W0413 10:54:45.480008 6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:54:50.389829 6546 node_manager.cc:245] Last heartbeat was sent 556 ms ago
W0413 10:54:52.274536 6546 node_manager.cc:245] Last heartbeat was sent 522 ms ago
W0413 10:54:52.873443 6546 node_manager.cc:245] Last heartbeat was sent 599 ms ago
W0413 10:55:15.301537 6546 node_manager.cc:245] Last heartbeat was sent 1008 ms ago
W0413 10:55:16.863193 6546 node_manager.cc:245] Last heartbeat was sent 1129 ms ago
W0413 10:55:18.049829 6546 node_manager.cc:245] Last heartbeat was sent 603 ms ago
W0413 10:55:24.432444 6546 node_manager.cc:245] Last heartbeat was sent 959 ms ago
W0413 10:56:02.659128 6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:56:09.559237 6546 node_manager.cc:245] Last heartbeat was sent 607 ms ago
W0413 10:56:12.926802 6546 node_manager.cc:245] Last heartbeat was sent 595 ms ago
W0413 10:56:14.754364 6546 node_manager.cc:245] Last heartbeat was sent 830 ms ago
W0413 10:56:17.414083 6546 node_manager.cc:245] Last heartbeat was sent 526 ms ago
W0413 10:56:21.293486 6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:56:23.624935 6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:56:25.183625 6546 node_manager.cc:245] Last heartbeat was sent 703 ms ago
W0413 10:57:06.594352 6546 node_manager.cc:245] Last heartbeat was sent 544 ms ago
W0413 10:57:09.569542 6546 node_manager.cc:245] Last heartbeat was sent 693 ms ago
W0413 10:57:12.113721 6546 node_manager.cc:245] Last heartbeat was sent 506 ms ago
W0413 10:57:13.748317 6546 node_manager.cc:245] Last heartbeat was sent 690 ms ago
W0413 10:57:18.617753 6546 node_manager.cc:245] Last heartbeat was sent 1032 ms ago
W0413 10:57:25.745839 6546 node_manager.cc:245] Last heartbeat was sent 580 ms ago
W0413 10:57:38.555815 6546 node_manager.cc:245] Last heartbeat was sent 1772 ms ago
W0413 10:58:38.673028 6546 node_manager.cc:245] Last heartbeat was sent 1484 ms ago
W0413 10:58:59.426427 6546 node_manager.cc:245] Last heartbeat was sent 2125 ms ago
21.333674854863624
等了几分钟,计算才得出21秒。
结论
在本虚拟机上测试,由于CPU核数不多,modin的优势并没有明显体现,反而更
慢。