Scrapy框架的学习(5.scarpy实现翻页爬虫，以及scrapy.Request的相关参数介绍)

1. 创建爬虫项目： scrapy startporject tencent

然后进入到项目中： cd tencent

创建爬虫：scrapy genspider tencent_spider tencent.com

2. 在开始写代码之前，有些知识要知道

可以找到页面上的下一页的url地址，然后用解析函数去处理，解析函数可以是当前函数，类似于函数递归

通过在Request中指定callback参数来确定哪个解析函数来解析

3. 爬虫的代码。 tencent_spider.py 爬取腾讯招聘的网站。
# -*- coding: utf-8 -*-
import scrapy


class TencentSpiderSpider(scrapy.Spider):
    name = 'tencent_spider'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath("//table[@class='tablelist']//tr")[1:-1]
        for tr in tr_list:
            item = {}
            item["position"] = tr.xpath("./td/a/text()").extract_first()
            item["category"] = tr.xpath(".//td[2]/text()").extract_first()
            item["date"] = tr.xpath(".//td[5]/text()").extract_first()
            yield item
        # 找到下一页的url地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url != "javascript:;":
            next_url = "https://hr.tencent.com/"+next_url
            yield scrapy.Request(
                next_url,
                # callback 指定传入的url交给那个解析函数去处理
                callback=self.parse
            )
开启pipeline :

在pipelines.py里面写上打印，先不保存，看打印的结果

扫描二维码关注公众号，回复： 4903180 查看本文章
class TencentPipeline(object):
    def process_item(self, item, spider):
        print(item)
        return item
运行爬虫框架： scrapy crawl tencent_spider

可以看到结果不断打印出来：

可以在pipelines进行相应的保存代码的编写

4. calkback参数：当callback参数指定解析函数时，

自已可以在爬虫中定义解析函数

meta: 两个解析函数之间传递数据

例如：
    def parse(self,response):
            """其他的先省略，主要看下面的参数传递"""
            yield scrapy.Request(
                next_url,
                # callback 指定传入的url交给那个解析函数去处理
                callback=self.parse,
                meta={"item": item}
            )

    """如果还有一解析函数，数据就通过meta传递过来"""
    def parese1(self,response):
        """可以直接通过键取出里面的值"""
        response.meta["item"]
    
dont_filter: 在Scrapy中默认请求是会去重的，请求过的url不会再请求

想要不去重的话，就设置为True

如果请求的页面上的数据会根据时间的变化会更新，这时候就要这个改变参数

Scrapy框架的学习(5.scarpy实现翻页爬虫，以及scrapy.Request的相关参数介绍)

猜你喜欢