并发编程(三)Python编程慢的罪魁祸首。全局解释器锁GIL
并发编程(四)如何使用多线程,使用多线程对爬虫程序进行修改及比较
并发编程(七)好用的线程池ThreadPoolExecutor
并发编程(九)使用多进程multiprocessing加速程序运行
并发编程(十二)使用subprocess启动电脑任意程序(听歌、解压缩、自动下载等等)
异步IO原理
单线程爬虫执行流程(执行路径)
从下图中我们可以看到,当第一个任务进行等待IO时,它不会像上图一样一直等待IO结束继续执行该任务,而是切换到第二个任务进行执行。直到全部执行到等待IO,再从头继续执行任务直至任务执行结束。
在这儿就有必要提到《the one loop》
the one loop
至尊循环驭众生
至尊循环寻众生
至尊循环引众生
普照众生欣欣荣
也就是说,在这儿单线程中他在等待IO期间会无限超级循环,执行新的任务,进而将所有资源都利用起来。也可以说是就是这个至尊循环让我们实现了IO的多路复用,真正实现了单线程内的并发执行。
python异步IO库:asyncio
import asyncio
# 获取事件循环
loop = asyncio.get_event_loop()
# 定义协程
async def myfunc(url):
await get_url(url)
# 创建task列表
tasks = [loop.create_task(myfunc(url)) for url in urls]
# 执行事件(任务)列表
loop.run_until_complete(asyncio.wait(tasks))
注意:
- 要用在异步IO编程中,依赖的库必须支持异步IO特性
- 比如:爬虫中的requests库就不支持异步,这儿就需要换成aiohttp库
代码:
# -*- coding: utf-8 -*-
# @Time : 2021-03-22 17:20:27
# @Author : wlq
# @FileName: async_test.py
# @Email :[email protected]
# 导包
import asyncio
import aiohttp
import time
# 待爬取链接
urls = [
f"https://w.cnblogs.com/#p{page}"
for page in range(1, 51)
]
# 定义协程(超级循环内执行函数)
async def async_craw(url):
print("craw url:", url)
# 创建对象
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
# 获取结果
rst = await resp.text()
print(url, len(rst))
# 定义超级循环
loop = asyncio.get_event_loop()
# 定义task列表
tasks = [loop.create_task(async_craw(url)) for url in urls]
start = time.time()
# 执行等待task列表完成
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print("use time:", end - start)
'''
output:
craw url: https://w.cnblogs.com/#p1
craw url: https://w.cnblogs.com/#p2
craw url: https://w.cnblogs.com/#p3
craw url: https://w.cnblogs.com/#p4
craw url: https://w.cnblogs.com/#p5
craw url: https://w.cnblogs.com/#p6
craw url: https://w.cnblogs.com/#p7
craw url: https://w.cnblogs.com/#p8
craw url: https://w.cnblogs.com/#p9
craw url: https://w.cnblogs.com/#p10
craw url: https://w.cnblogs.com/#p11
craw url: https://w.cnblogs.com/#p12
craw url: https://w.cnblogs.com/#p13
craw url: https://w.cnblogs.com/#p14
craw url: https://w.cnblogs.com/#p15
craw url: https://w.cnblogs.com/#p16
craw url: https://w.cnblogs.com/#p17
craw url: https://w.cnblogs.com/#p18
craw url: https://w.cnblogs.com/#p19
craw url: https://w.cnblogs.com/#p20
craw url: https://w.cnblogs.com/#p21
craw url: https://w.cnblogs.com/#p22
craw url: https://w.cnblogs.com/#p23
craw url: https://w.cnblogs.com/#p24
craw url: https://w.cnblogs.com/#p25
craw url: https://w.cnblogs.com/#p26
craw url: https://w.cnblogs.com/#p27
craw url: https://w.cnblogs.com/#p28
craw url: https://w.cnblogs.com/#p29
craw url: https://w.cnblogs.com/#p30
craw url: https://w.cnblogs.com/#p31
craw url: https://w.cnblogs.com/#p32
craw url: https://w.cnblogs.com/#p33
craw url: https://w.cnblogs.com/#p34
craw url: https://w.cnblogs.com/#p35
craw url: https://w.cnblogs.com/#p36
craw url: https://w.cnblogs.com/#p37
craw url: https://w.cnblogs.com/#p38
craw url: https://w.cnblogs.com/#p39
craw url: https://w.cnblogs.com/#p40
craw url: https://w.cnblogs.com/#p41
craw url: https://w.cnblogs.com/#p42
craw url: https://w.cnblogs.com/#p43
craw url: https://w.cnblogs.com/#p44
craw url: https://w.cnblogs.com/#p45
craw url: https://w.cnblogs.com/#p46
craw url: https://w.cnblogs.com/#p47
craw url: https://w.cnblogs.com/#p48
craw url: https://w.cnblogs.com/#p49
craw url: https://w.cnblogs.com/#p50
https://w.cnblogs.com/#p9 70107
https://w.cnblogs.com/#p16 70107
https://w.cnblogs.com/#p10 70107
https://w.cnblogs.com/#p2 70107
https://w.cnblogs.com/#p4 70107
https://w.cnblogs.com/#p18 70107
https://w.cnblogs.com/#p5 70107
https://w.cnblogs.com/#p3 70107
https://w.cnblogs.com/#p13 70107
https://w.cnblogs.com/#p32 70107
https://w.cnblogs.com/#p7 70107
https://w.cnblogs.com/#p12 70107
https://w.cnblogs.com/#p43 70107
https://w.cnblogs.com/#p21 70107
https://w.cnblogs.com/#p15 70107
https://w.cnblogs.com/#p8 70107
https://w.cnblogs.com/#p22 70107
https://w.cnblogs.com/#p26 70107
https://w.cnblogs.com/#p37 70107
https://w.cnblogs.com/#p17 70107
https://w.cnblogs.com/#p19 70107
https://w.cnblogs.com/#p28 70107
https://w.cnblogs.com/#p1 70107
https://w.cnblogs.com/#p14 70107
https://w.cnblogs.com/#p45 70107
https://w.cnblogs.com/#p42 70107
https://w.cnblogs.com/#p34 70107
https://w.cnblogs.com/#p6 70107
https://w.cnblogs.com/#p36 70107
https://w.cnblogs.com/#p31 70107
https://w.cnblogs.com/#p39 70107
https://w.cnblogs.com/#p24 70107
https://w.cnblogs.com/#p50 70107
https://w.cnblogs.com/#p40 70107
https://w.cnblogs.com/#p29 70107
https://w.cnblogs.com/#p30 70107
https://w.cnblogs.com/#p48 70107
https://w.cnblogs.com/#p46 70107
https://w.cnblogs.com/#p27 70107
https://w.cnblogs.com/#p38 70107
https://w.cnblogs.com/#p25 70107
https://w.cnblogs.com/#p44 70107
https://w.cnblogs.com/#p23 70107
https://w.cnblogs.com/#p49 70107
https://w.cnblogs.com/#p47 70107
https://w.cnblogs.com/#p11 70107
https://w.cnblogs.com/#p35 70107
https://w.cnblogs.com/#p20 70107
https://w.cnblogs.com/#p41 70107
https://w.cnblogs.com/#p33 70107
use time: 0.3919532299041748
'''
如上图,从并发编程系列博客四中,我们得出单线程和多线程的爬取网页的执行时间,现在我们用了单线程异步IO执行,可以看出单线程异步IO是速度最快的一个(因为灭有切换线程的开销)。
信号量(Semaphore)的使用
-
信号量又称旗语,它是一个同步对象,用于保持在0至最大值之间的一个计数值。
-
使用过程、原理
- 当线程完成一次对该semaphore对象的等待(wait)时,该计数值减一;
- 当线程完成一次对该semaphore对象的释放(release)时,计数值加一;
- 当计数值为0,则线程等待该semaphore对象不再能成功直至该semaphore对象变成signaled状态;
- semaphore对象的计数值大于0,为signaled状态,计数值等于0,为nonsignaled状态。
-
semaphore使用语法:
# 使用方式一: sem = asyncio.Semaphore(10) # ..later async with sem: # work with shared resource # 使用方式二: sem = asyncio.Semaphore(10) # ..later await sem.acquire(): try: # work with shared resource finally: sem.release()
-
示例(基于上面爬虫代码进行修改)
# -*- coding: utf-8 -*- # @Time : 2021-03-22 17:20:27 # @Author : wlq # @FileName: async_test.py # @Email :[email protected] # 导包 import asyncio import aiohttp import time # 初始化信号量 semaphore = asyncio.Semaphore(10) # 待爬取链接 urls = [ f"https://w.cnblogs.com/#p{page}" for page in range(1, 51) ] # 定义协程(超级循环内执行函数) async def async_craw(url): async with semaphore: print("craw url:", url) # 创建对象 async with aiohttp.ClientSession() as session: async with session.get(url) as resp: # 获取结果 rst = await resp.text() await asyncio.sleep(5) print(url, len(rst)) # 定义超级循环 loop = asyncio.get_event_loop() # 定义task列表 tasks = [loop.create_task(async_craw(url)) for url in urls] start = time.time() # 执行等待task列表完成 loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print("use time:", end - start) ''' output: craw url: https://w.cnblogs.com/#p1 craw url: https://w.cnblogs.com/#p2 craw url: https://w.cnblogs.com/#p3 craw url: https://w.cnblogs.com/#p4 craw url: https://w.cnblogs.com/#p5 craw url: https://w.cnblogs.com/#p6 craw url: https://w.cnblogs.com/#p7 craw url: https://w.cnblogs.com/#p8 craw url: https://w.cnblogs.com/#p9 craw url: https://w.cnblogs.com/#p10 https://w.cnblogs.com/#p4 70111 https://w.cnblogs.com/#p8 70111 https://w.cnblogs.com/#p6 70111 https://w.cnblogs.com/#p10 70111 https://w.cnblogs.com/#p3 70111 https://w.cnblogs.com/#p2 70111 https://w.cnblogs.com/#p5 70111 https://w.cnblogs.com/#p1 70111 https://w.cnblogs.com/#p7 70111 craw url: https://w.cnblogs.com/#p11 craw url: https://w.cnblogs.com/#p12 craw url: https://w.cnblogs.com/#p13 craw url: https://w.cnblogs.com/#p14 craw url: https://w.cnblogs.com/#p15 craw url: https://w.cnblogs.com/#p16 craw url: https://w.cnblogs.com/#p17 craw url: https://w.cnblogs.com/#p18 craw url: https://w.cnblogs.com/#p19 https://w.cnblogs.com/#p9 70111 craw url: https://w.cnblogs.com/#p20 https://w.cnblogs.com/#p13 70111 https://w.cnblogs.com/#p15 70111 craw url: https://w.cnblogs.com/#p21 craw url: https://w.cnblogs.com/#p22 https://w.cnblogs.com/#p18 70111 https://w.cnblogs.com/#p16 70111 https://w.cnblogs.com/#p19 70111 https://w.cnblogs.com/#p17 70111 https://w.cnblogs.com/#p11 70111 https://w.cnblogs.com/#p14 70111 craw url: https://w.cnblogs.com/#p23 craw url: https://w.cnblogs.com/#p24 craw url: https://w.cnblogs.com/#p25 craw url: https://w.cnblogs.com/#p26 craw url: https://w.cnblogs.com/#p27 craw url: https://w.cnblogs.com/#p28 https://w.cnblogs.com/#p12 70111 craw url: https://w.cnblogs.com/#p29 https://w.cnblogs.com/#p20 70111 craw url: https://w.cnblogs.com/#p30 https://w.cnblogs.com/#p22 70111 craw url: https://w.cnblogs.com/#p31 https://w.cnblogs.com/#p27 70111 craw url: https://w.cnblogs.com/#p32 https://w.cnblogs.com/#p21 70111 https://w.cnblogs.com/#p29 70111 https://w.cnblogs.com/#p25 70111 https://w.cnblogs.com/#p24 70111 https://w.cnblogs.com/#p26 70111 https://w.cnblogs.com/#p28 70111 https://w.cnblogs.com/#p23 70111 craw url: https://w.cnblogs.com/#p33 craw url: https://w.cnblogs.com/#p34 craw url: https://w.cnblogs.com/#p35 craw url: https://w.cnblogs.com/#p36 craw url: https://w.cnblogs.com/#p37 craw url: https://w.cnblogs.com/#p38 craw url: https://w.cnblogs.com/#p39 https://w.cnblogs.com/#p30 70111 craw url: https://w.cnblogs.com/#p40 https://w.cnblogs.com/#p31 70111 craw url: https://w.cnblogs.com/#p41 https://w.cnblogs.com/#p32 70111 https://w.cnblogs.com/#p38 70111 craw url: https://w.cnblogs.com/#p42 craw url: https://w.cnblogs.com/#p43 https://w.cnblogs.com/#p37 70111 https://w.cnblogs.com/#p34 70111 https://w.cnblogs.com/#p35 70111 https://w.cnblogs.com/#p36 70111 https://w.cnblogs.com/#p39 70111 https://w.cnblogs.com/#p33 70111 craw url: https://w.cnblogs.com/#p44 craw url: https://w.cnblogs.com/#p45 craw url: https://w.cnblogs.com/#p46 craw url: https://w.cnblogs.com/#p47 craw url: https://w.cnblogs.com/#p48 craw url: https://w.cnblogs.com/#p49 https://w.cnblogs.com/#p40 70111 craw url: https://w.cnblogs.com/#p50 https://w.cnblogs.com/#p41 70111 https://w.cnblogs.com/#p42 70111 https://w.cnblogs.com/#p43 70111 https://w.cnblogs.com/#p44 70111 https://w.cnblogs.com/#p45 70111 https://w.cnblogs.com/#p48 70111 https://w.cnblogs.com/#p49 70111 https://w.cnblogs.com/#p50 70111 https://w.cnblogs.com/#p46 70111 https://w.cnblogs.com/#p47 70111 use time: 26.094701528549194 '''
因为代码中加了5秒等待时间,且信号量最大是10.所以刚开始就只取前十个进行等待、处理。