scrapy_ [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ’XXXX‘

问题描述：

在使用scrapy框架进行爬虫时，当解析完请求start_urls的响应进行二次访问的时候，出现了无法访问的情况，系统过滤了我的请求 .

# -*- coding: utf-8 -*-
import scrapy


class LuboavSpider(scrapy.Spider):
    name = 'photo'
    allowed_domains = ['https://www.XXXX.com']
    start_urls = ['https://www.XXXX.com/art/type/id/11.html']
    base_urls = 'https://www.XXXX.com/art/type/id/11/page/{}.html'
    detail_base_urls = 'https://www.XXXX.com'

    def parse(self, response):
        print("response.status", response.status)
        detail_url = response.xpath('//table//td[1]/a/@href')
        for i in detail_url:
            print(i.extract())
            yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse)

    def detail_parse(self, response):
        print("dfasgasgas")
        print("detail_parse", response.status)

报错信息:

2018-12-27 00:01:06 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.XXXX.com': <GET https://www.XXXX.com/art/detail/id/114522.html>

原因分析：
因为 Request中请求的 URL 和 allowed_domains 中的域名冲突，所以将你的新的url过滤掉了，无法请求

yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse)

解决：
法一:? 在 Request 请求参数中，设置 dont_filter = True ,Request 中请求的 URL 将不通过 allowed_domains 过滤。

yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse,dont_filter=True)

法二：将allowed_domains = ['www.xxxx.com']更改为allowed_domains = ['xxxx.com'] 即更换为对应的一级域名

法三：将allowed_domains行代码注释掉(不推荐使用)

参考:
https://blog.csdn.net/weixin_41607151/article/details/80515030
https://blog.csdn.net/weixin_42523052/article/details/80778037

scrapy_ [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ’XXXX‘

猜你喜欢