pyspider抓取伯乐在线python相关所有文章

有点软用的pyspider中文文档(这个翻译的和谷歌翻译差不多，如果没有谷歌翻译插件的可以考虑)
英文官方文档（谷歌翻译后完全能看懂，不像python官方的，第三方库的都比较友好）
伯乐在线python相关文章

如果想熟练使用pyspider，建议使用前全篇读一遍官方文档。(帮助很大，可以避免大部分细节问题)

了解了requests和aiohttp之后，是时候学习一波框架了，而被吹的最厉害的莫过于scrapy和pyspider了。两个都接触了一下，我发现国人编写的这个pyspider对Windows充满了深深的恶意(文档说并没有在Windows下测试，可能能运行)，而且百度的文章也非常的少，出现了很多莫名其妙的错误(后面会提到)无从解决，翻了好久遍官方文档，没什么效果。

经过几天的补坑思考:（犯了一个初学者最大的问题，有问题直接百度，而不去查官方文档，也没有自己思考，当百度找不到的时候，查官方文档也是走马观花，导致很多细节没有注意。特别是后者更为严重，如果只是偶尔使用的话，可以查找自己想要的信息。但我是想系统的学习一下pyspider，还是应该通篇阅读）(还是应该自己先思考，不要太依赖百度)
~~（后面要学习爬虫框架的人，个人觉得还是先学scrapy比较好，至少出了问题，还有百度一大堆人帮你解决，学会了scrapy，pyspider易如反掌）~~ ~~(没有仔细读文档说出来的傻话，跳过就行)~~

先贴代码：

from pyspider.libs.base_handler import *
import pymongo


class Handler(BaseHandler):
    crawl_config = {
        'itag':'v1.4'
    }
    client = pymongo.MongoClient()
    db = client['jobbole']
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://python.jobbole.com/category/news/', callback=self.index_page)
        self.crawl('http://python.jobbole.com/category/basic/', callback=self.index_page)
        self.crawl('http://python.jobbole.com/category/guide/', callback=self.index_page)
        self.crawl('http://python.jobbole.com/category/project/', callback=self.index_page)
        self.crawl('http://python.jobbole.com/category/tools/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for i in response.doc('.archive-title').items():
            self.crawl(i.attr.href, callback=self.detail_page)
        next_url = response.doc('.next').attr.href
        self.crawl(next_url, callback=self.index_page)

    @config(priority=2)
    def detail_page(self, response):
        title = response.doc('.entry-header').text()
        temp = response.doc('.entry-meta-hide-on-mobile').text()
        date = temp.split(' ')[0]
        tag = temp.split(' ', 1)[1].replace('· ', '')
        if '评论' in tag:
            tag = tag.split(' ', 1)[0] + ' ' + tag.split(' ', 3)[-1]
        return {
            'title':title,
            'url': response.url,
            'date': date,
            'tag':tag
        }
    
    def on_result(self, result):
        if result:
            dbset = result['tag'].split(' ', 1)[0]
            self.db[dbset].update({'url':result['url']}, {'$set': result}, True)

这里遇到一个坑，当运行第一次的时候没问题，但运行第二次则没有反应。解决方法：加入crawl_config = { ‘itag’: ‘v1.8’}这个就行。每次运行修改itag属性的值。因为默认没有这个参数，如果下次运行，这个值和上一次一样，则抛弃结果。

这个可以完美运行，但是还可以简化这个代码，因为http://python.jobbole.com/all-posts网址包含了五个待爬网址的内容。直接爬这个网址不是更简单，而且在一级界面就包含了所有需要的信息，没有必要访问二级子界面。
于是，就有了第二种代码:

from pyspider.libs.base_handler import *
import pymongo

class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v1.8'
    }
    client = pymongo.MongoClient()
    db = client['jobbole1']
    
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://python.jobbole.com/all-posts/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        self.crawl(response.url, callback=self.detail_page)
        next_url = response.doc('.next').attr.href
        self.crawl(next_url, callback=self.index_page)

    @config(priority=2)
    def detail_page(self, response):
        results = []
        items = response.doc('.post.floated-thumb').items()
        for item in items:
            title = item('.archive-title').text()
            excerpt = item('.excerpt').text()
            url = item('.archive-title').attr.href
            
            temp = item('.post-meta > p')
            temp('.archive-title').remove()
            temp_list = temp.text().split(' ')
            date = temp_list[0]
            tag = temp_list[2]
            
            results.append({
                'title':title,
                'date':date,
                'tag':tag,
                'excerpt':excerpt,
                'url':url
            })
        return results
    
    def on_result(self, result):
        if result:
            for i in result:
                print(i['url'])
                self.db[i['tag']].update({'url':i['url']}, {'$set': i}, True)

这里又出现一个问题，detail_page方法只能返回一个值，就算用yield也是会覆盖原来的，最后也只有一个。所以我选择了返回一个列表，列表中包含所有的值。百度了一下还有另一种官方推荐的方法：

from pyspider.libs.base_handler import *
import pymongo

class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v1.5'
    }
    client = pymongo.MongoClient()
    db = client['jobbole2']
    
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://python.jobbole.com/all-posts/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        self.crawl(response.url, callback=self.detail_page)
        next_url = response.doc('.next').attr.href
        self.crawl(next_url, callback=self.index_page)

    @config(priority=2)
    def detail_page(self, response):
        items = response.doc('.post.floated-thumb').items()
        for i,item in enumerate(items):
            title = item('.archive-title').text()
            excerpt = item('.excerpt').text()
            url = item('.archive-title').attr.href
            
            temp = item('.post-meta > p')
            temp('.archive-title').remove()
            temp_list = temp.text().split(' ')
            date = temp_list[0]
            tag = temp_list[2]
            data = {
                'title':title,
                'date':date,
                'tag':tag,
                'excerpt':excerpt,
                'url':url
             }
            self.send_message(self.project_name, data, url="%s#%s" % (response.url, i))
	
    def on_message(self, project, msg):
        print(msg)
        return msg

    def on_result(self, result):
        print(result)
        if result:
            self.db[result['tag']].update({'url':result['url']}, {'$set': result}, True)

先分析第一种直接返回列表的方法：
测试的时候能得到结果，一个一个运行的话，数据库中也有相应的内容，但运行整个程序的话，数据中却只有20条数据，也就是说只存储了一页的数据。而且我发现如果数据库内容多余20条还会被删成20 条，这是什么操作。update还会删除？换成insert发现也一样。
暂时还没解决，如果知道的还请留言指教。。。如果哪天我找到答案也会更新的。

2018-11-13更新:将index_page中的这一句：
self.crawl(response.url, callback=self.detail_page)
修改为：
self.crawl(response.url+’#0’, callback=self.detail_page)
还是因为url去重问题，第二次抓取直接跳过了，所有给他一个不一样的链接。
但数据库的数据不全，少了几百条。

第二种官方推荐的方法（官方解决方法）：
左侧message显示了20条消息，但on_message无法收到消息，打印msg为空，在detail_page方法中打印则结果正常。当然，result同样为空。

2018-11-14更新:错误同上。（测试的时候on_message并没有被调用，所以我打印msg为空。测试的时候消息被发送给了左侧的message方便调试。将项目改为running运行的时候，才会调用on_message。）

~~我猜测是Windows的缘故，什么时候配置一下Linux虚拟机的环境试一下。（已在Ubuntu上测试，结果一样）~~

pyspider抓取伯乐在线python相关所有文章

猜你喜欢