scrapy_ [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ’XXXX‘

问题描述:

在使用scrapy框架进行爬虫时,当解析完请求start_urls的响应进行二次访问的时候,出现了无法访问的情况,系统过滤 了我的请求 .

# -*- coding: utf-8 -*-
import scrapy


class LuboavSpider(scrapy.Spider):
    name = 'photo'
    allowed_domains = ['https://www.XXXX.com']
    start_urls = ['https://www.XXXX.com/art/type/id/11.html']
    base_urls = 'https://www.XXXX.com/art/type/id/11/page/{}.html'
    detail_base_urls = 'https://www.XXXX.com'

    def parse(self, response):
        print("response.status", response.status)
        detail_url = response.xpath('//table//td[1]/a/@href')
        for i in detail_url:
            print(i.extract())
            yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse)

    def detail_parse(self, response):
        print("dfasgasgas")
        print("detail_parse", response.status)

报错信息:

2018-12-27 00:01:06 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.XXXX.com': <GET https://www.XXXX.com/art/detail/id/114522.html>

原因分析:
因为 Request中请求的 URL 和 allowed_domains 中的域名冲突,所以将你的新的url过滤掉了,无法请求

        yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse)
        
解决:
法一:? 在 Request 请求参数中,设置 dont_filter = True ,Request 中请求的 URL 将不通过 allowed_domains 过滤。

        yield scrapy.Request(self.detail_base_urls + i.extract(), callback=self.detail_parse,dont_filter=True)

法二: 将allowed_domains = ['www.xxxx.com']更改为allowed_domains = ['xxxx.com'] 即更换为对应的一级域名

法三: 将allowed_domains行代码注释掉(不推荐使用)

参考:
https://blog.csdn.net/weixin_41607151/article/details/80515030
https://blog.csdn.net/weixin_42523052/article/details/80778037 

猜你喜欢

转载自blog.csdn.net/jss19940414/article/details/85270755