多进程爬取补天的厂商 - 代码天地

多进程爬取补天的厂商

其他 2018-12-18 16:16:20 阅读次数: 0

版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.net/ZZZJX7/article/details/52860161

最近工作上挺多事的，心有点乱，感觉是时候静下心来了。

之前就想找个爬取补天的厂商，又碰巧在一个论坛看到一篇文章，然后自己就改改了，算二次原创吧，自己加了多进程并且自动获取最终页数。

#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import multiprocessing
import time
import requests as req
import re
import lxml
from bs4 import BeautifulSoup


def Spide(url):
  try:
    html=req.get(url,timeout=60).text
    print url
    html=html.encode('utf-8')
    pat='<td  align="left" style="padding-left:20px;">.*</td>'
    u=re.compile(pat)
    ress=u.findall(html)
    res=[]
    R=[]
    for i in ress:
      u=re.compile('>.*<')
      res+=u.findall(i)
    for i in res:
      a=i.strip('<>')
      #print a
      with open('360.txt','a+') as f:
      	f.write(a+'\n')
  except Exception,e:
    pass

def get_page(url):
    a = req.get(url)
    if a.status_code == 200:
        soup = BeautifulSoup(a.text,"lxml")
        pages = soup.select("div.pages > a")[-1].get('href').split('/')[-1]
        return pages

if __name__ == "__main__":
  pool = multiprocessing.Pool(processes=4)
  url_list=[]
  url='https://butian.360.cn/company/lists/page/1'
  page=int(get_page(url))
  for i in range(1,int(page)):
    url='https://butian.360.cn/company/lists/page/'+str(i)
    url_list.append(url)
  pool.map(Spide,url_list)
  pool.close()
  pool.join()
  print("Done!")

效果如下，感觉还可以：

猜你喜欢

转载自blog.csdn.net/ZZZJX7/article/details/52860161

多进程爬取补天的厂商

补天厂商列表

python使用多进程爬取图片

爬虫--多进程爬取妹子图

多进程爬取上海房价并画出热力图分析

（66）-- 多进程爬取腾讯招聘信息

Python爬取糗事百科-多进程方法

多进程爬取淘宝商品信息

西刺代理用多进程爬取

[python爬虫] 使用多进程爬取妹子图

Python使用多进程提高网络爬虫的爬取速度

多进程爬取某图片网站（python爬虫）

利用多进程去爬取短视频

python多进程爬取糗事百科图片

TCP$UDP回顾，多进程爬取美女图片

爬虫——使用多进程爬取视频数据

多进程爬虫python——实例爬取酷狗歌单

多进程爬取携程机票——简单易懂

Scrapy-redis改造scrapy实现分布式多进程爬取

爬虫小项目（四）利用多进程和ajax技术爬取堆糖

【爬虫入门6】多进程爬取糗事百科

多进程，Request+正则表达式爬取榜单类网站

python爬虫日志（10）多进程爬取豆瓣top250

[python爬虫]多进程爬取喜马拉雅音乐

(爬虫)通过正则和多进程的方式,简单爬取猫眼Top100电影信息

[Python爬虫]爬虫实例:爬取PEXELS图片---修改为多进程爬虫

【爬虫小程序：爬取斗鱼所有房间信息】Xpath(多进程版)

python爬虫-使用多进程爬取美图-人工智能语言（高效爬虫）

Python爬虫：全国大学招生信息（一）：爬取数据 (多进程、多线程、代理)

python+正则+多进程爬取糗事百科图片

今日推荐

周排行

成为C++高手之宏与枚举

在CAD二次开发中使用进度条

Js插件ECharts，HighCharts学习网址整理

Celery提交任务出错(on windows.)

cephfs内核客户端性能追踪

thinkphp中PHPExcel用法

EntityFramework动态组合多排序字段

汇编语言（八）实验9 根据材料编程

安装ubuntu后必须做的事情（对我而言）

JS函数式编程

每日归档

更多

2024-10-22(0)

2024-10-21(0)

2024-10-20(0)

2024-10-19(0)

2024-10-18(0)

2024-10-17(0)

2024-10-16(0)

2024-10-15(0)

2024-10-14(0)

2024-10-13(0)