想看漫画,但是不知道为什么网页上不能显示图片。
没办法,只好把漫画下载下来慢慢看了。
这个网站结构很简单。总目录–>章节–>页
总目录
https://www.dagumanhua.com/manhua/3883/
章节
每个章节链接就在上面链接中
<div class="cy_plist" id="play_0">
<ul>
<li><a href="/manhua/3883/623532.html" title="第813话 八品炼药师(上)" target="_blank"><p>第813话 八品炼药师(上)</p><i></i></a></li>
<li><a href="/manhua/3883/623530.html" title="第814话 八品炼药师(下)" target="_blank"><p>第814话 八品炼药师(下)</p><i></i></a></li>
<li><a href="/manhua/3883/622052.html" title="第812话 熊的宝藏(下)" target="_blank"><p>第812话 熊的宝藏(下)</p><i></i></a></li>
<li><a href="/manhua/3883/622051.html" title="第811话 熊的宝藏(上)" target="_blank"><p>第811话 熊的宝藏(上)</p><i></i></a></li>
<li><a href="/manhua/3883/619107.html" title="第810话 山脉之主(下)" target="_blank"><p>第810话 山脉之主(下)</p><i></i></a></li>
<li><a href="/manhua/3883/619105.html" title="第809话 山脉之主(上)" target="_blank"><p>第809话 山脉之主(上)</p><i></i></a></li>
<li><a href="/manhua/3883/617623.html" title="第808话 觅宝(下)" target="_blank"><p>第808话 觅宝(下)</p><i></i></a></li>
这里用lxml解析获得所有链接。
html.xpath('//*[@id="play_0"]/ul/li/a/@href')
因为获得的是相对路径,所以链接地址需要拼接一下。得到绝对地址,例如:https://www.dagumanhua.com/manhua/3883/623532.html
页
章节中每页的链接就在上面链接中
<div class="pages"> <b>1</b> <a href="/manhua/3883/623532_2.html">2</a> <a href="/manhua/3883/623532_3.html">3</a> <a href="/manhua/3883/623532_2.html">下一页</a></div>
这里只有显示了3页,已经足够发现每页地址的规律了。
从第二页开始就是在第一页后面加一个下划线和页码序号。
因为只有显示了3页,所以得想办法获得每章的总页数。在上面的html也可以找到。
function dPlayPre(){
var prepage = 1 -1;
var totalpage = 33;
包含在js的函数中,那个totalpage就是了。
可以用正则过滤获得
re.compile(r'function.*?prepage.*?totalpage = (.*?);', re.S)
剩下的就是循环了,没什么可以说的。
fake_useragent
这个漫画网站很简单。所以只有采用了动态ua。
这次尝试了fake_useragent模块。
def get_headers():
ua = UserAgent() # 实例化,实例化时需要联网但是网站不太稳定
return {'User-Agent':ua.random}
使用当中如有问题,可以参考https://blog.csdn.net/qq_38251616/article/details/86751142
多进程下载
我要看的漫画图片比较多,逐次下载要等死人了,还是开多进程下载吧。
多进程,我选用multiprocessing模块
先看看CPU核数
>>> from multiprocessing import cpu_count
>>> print cpu_count()
4
4核,那就开4个进程。
下载图片的函数。下载后依次保存在对应的目录中。
def download_img(img_url,sn,chapter_direction):
print 'image=',img_url
response = requests.get(img_url,headers=get_headers(),timeout=5)
response.encoding = 'UTF-8'
#下载后文件名改为1.jpg,2.jpg......后面依次编号
image_file = chapter_direction +'\\' + str(sn) + '.jpg'
with open(image_file, 'wb') as f:
f.write(response.content)
下载过程多进程设置如下
try:
p = Pool(4) # 指定进程池中的进程数
for page_list_url_i in page_list_url: #每一页链接
#每一页完整链接
page_list_url_full = 'https://www.dagumanhua.com' + str(page_list_url_i)
print 'page',i+1,'of',len(page_list_url),'=',page_list_url_full
page_text = get_html(page_list_url_full) #每一页的html
img_url1 = image_url(page_text).strip() #每一页图片链接
i = i + 1
#download_img(img_url1,i,chapter_name) #下载每一页图片
#下载每一页图片
#非阻塞异步, 它不会等待子进程执行完毕, 主进程会继续执行。它会根据系统调度来进行进程切换
p.apply_async(download_img,args=(img_url1,i,chapter_name))
#time.sleep(0.5)
print 'Waiting for all subprocesses done...'
p.close() # 关闭进程池
p.join() # 主进程等待进程池中的所有子进程结束
print 'All subprocesses done.'
except KeyboardInterrupt: #接收^c
print 'parent received control-c'
pool.terminate()
pool.join()
多进程执行过程截取部分如下
chapter_list_url_full= https://www.dagumanhua.com/manhua/3883/155457.html
chapter_direction= C:\todd\python_files\website\斗破苍穹\斗破苍穹 第2话 陨落的天才(中)
total pages= 12
page 1 of 12 = https://www.dagumanhua.com/manhua/3883/155457.html
page 2 of 12 = https://www.dagumanhua.com/manhua/3883/155457_2.html
page 3 of 12 = https://www.dagumanhua.com/manhua/3883/155457_3.html
image= http://img.baidu.com.manhuapi.com/c/20170823/rbuys5alm0v.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/dybfvnaha2o.jpg
pagei mage=4 hottp://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpgf
12 = https://www.dagumanhua.com/manhua/3883/155457_4.html
pagei mage=5 hottp://img.baidu.com.manhuapi.com/c/20170823/ry1z3mv0yvr.jpgf
12 = https://www.dagumanhua.com/manhua/3883/155457_5.html
page 6 of 12 = https://www.dagumanhua.com/manhua/3883/155457_6.html
page 7 of 12 = https://www.dagumanhua.com/manhua/3883/155457_7.html
page 8 of 12 = https://www.dagumanhua.com/manhua/3883/155457_8.html
page 9 of 12 = https://www.dagumanhua.com/manhua/3883/155457_9.html
page 10 of 12 = https://www.dagumanhua.com/manhua/3883/155457_10.html
image= http://img.baidu.com.manhuapi.com/c/20170823/bsgtncyxbyj.jpg
page 11 of 12 = https://www.dagumanhua.com/manhua/3883/155457_11.html
image= http://img.baidu.com.manhuapi.com/c/20170823/zjsxf1byjri.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/bmsnlb21rbe.jpg
page 12 of 12 = https://www.dagumanhua.com/manhua/3883/155457_12.html
Waiting for all subprocesses done...
image= http://img.baidu.com.manhuapi.com/c/20170823/rg4mmujk1ev.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/5nm144gvxad.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/j4mdidgwk1c.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/kbqkitmkqnx.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/fkhtyne05fb.jpg
All subprocesses done.
其中有几行有点奇怪,例如下面这句。这是因为不同进程同时在print,所以输出成这样了。
pagei mage=4 hottp://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpgf
12 = https://www.dagumanhua.com/manhua/3883/155457_4.html
实际应该是以下2句组合而成的,自己体会一下,很容易理解的。
page 4 of 12 = https://www.dagumanhua.com/manhua/3883/155457_4.html
image= http://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpg
如果不采用多进程,输入的结果如下。对比多进程输出,就明白多进程是怎么回事情了。
chapter_list_url_full= https://www.dagumanhua.com/manhua/3883/155456.html
chapter_direction= C:\todd\python_files\website\斗破苍穹\斗破苍穹 第1话 陨落的天才(上)
total pages= 13
page 1 of 13 = https://www.dagumanhua.com/manhua/3883/155456.html
image= http://img.baidu.com.manhuapi.com/c/20170823/umyw23z5sr0.jpg
page 2 of 13 = https://www.dagumanhua.com/manhua/3883/155456_2.html
image= http://img.baidu.com.manhuapi.com/c/20170823/hs2qj4l4p4t.jpg
page 3 of 13 = https://www.dagumanhua.com/manhua/3883/155456_3.html
image= http://img.baidu.com.manhuapi.com/c/20170823/ty4slqkzpez.jpg
page 4 of 13 = https://www.dagumanhua.com/manhua/3883/155456_4.html
image= http://img.baidu.com.manhuapi.com/c/20170823/xq0gckj3yr2.jpg
page 5 of 13 = https://www.dagumanhua.com/manhua/3883/155456_5.html
image= http://img.baidu.com.manhuapi.com/c/20170823/i2a1ux2e5l3.jpg
page 6 of 13 = https://www.dagumanhua.com/manhua/3883/155456_6.html
image= http://img.baidu.com.manhuapi.com/c/20170823/wy1lqagtgoy.jpg
page 7 of 13 = https://www.dagumanhua.com/manhua/3883/155456_7.html
image= http://img.baidu.com.manhuapi.com/c/20170823/fcczrky2fdu.jpg
page 8 of 13 = https://www.dagumanhua.com/manhua/3883/155456_8.html
image= http://img.baidu.com.manhuapi.com/c/20170823/ej4grtbpyzb.jpg
page 9 of 13 = https://www.dagumanhua.com/manhua/3883/155456_9.html
image= http://img.baidu.com.manhuapi.com/c/20170823/gagpydhkstj.jpg
page 10 of 13 = https://www.dagumanhua.com/manhua/3883/155456_10.html
image= http://img.baidu.com.manhuapi.com/c/20170823/xzt3rktlecu.jpg
page 11 of 13 = https://www.dagumanhua.com/manhua/3883/155456_11.html
image= http://img.baidu.com.manhuapi.com/c/20170823/kgwbini0x4k.jpg
page 12 of 13 = https://www.dagumanhua.com/manhua/3883/155456_12.html
image= http://img.baidu.com.manhuapi.com/c/20170823/e5sn5wwvslv.jpg
page 13 of 13 = https://www.dagumanhua.com/manhua/3883/155456_13.html
image= http://img.baidu.com.manhuapi.com/c/20170823/fmgzeu12n1x.jpg
代码参考https://download.csdn.net/download/weixin_42555985/12033409