CrawlSpiders模块的使用 - 代码天地

CrawlSpiders模块的使用

其他 2018-10-25 00:41:41 阅读次数: 0

创建文件模板

scrapy genspider -t crawl tencent tencent.com

CrawlSpiders就是为爬取整站孕育而生的，我们以前是分页下一页，然后再yied。这样太麻烦。CrawlSpiders是你只需要写好规则之后。他就会根据你这一页的response得到符合条件的url，然后再进去，再分析。

只需要增加两三行代码就可以替换我们以前的分页逻辑。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest

class TencentSpider(CrawlSpider):
    name = 'tencent'
    #起始页
    start_urls = ['http://lab.scrapyd.cn/page/1']
    rules = (
        #制定爬取的url规则
        Rule(LinkExtractor(allow=r'http://lab.scrapyd.cn/page/'), callback='parse_item', follow=True),
    )
    #爬取每页的这个数据
    def parse_item(self, response):
        print(response.xpath('///*[@id="main"]/div[1]/span[1]/text()'))

猜你喜欢

转载自www.cnblogs.com/coder-lzh/p/9847109.html

CrawlSpiders模块的使用

CrawlSpiders

11.CrawlSpiders

scrapy-CrawlSpiders

Scrapy框架----- CrawlSpiders

scrapy框架CrawlSpiders类

scrapy之CrawlSpiders

Scrapy框架----07CrawlSpiders

scrapy系列--5--crawlspiders----spider的扩展

模块使用及常用模块

模块的使用

模块使用

使用Python模块：collections模块

使用Python模块：struct模块

python模块的使用 json模块

Python模块的使用-- pyyaml模块

Python模块的使用--binascii模块

Python模块的使用-- elasticsearch模块

Python模块的使用--shutil模块

json模块与hashlib模块的使用

xlrd模块使用简介

perl JSON模块使用

使用 timeit 模块

collections模块的使用

Ansible 模块使用

pymysql 模块的使用

nginx upload模块使用

configparser模块的简单使用

Requests模块的使用

python的mysql模块使用

今日推荐

周排行

成为C++高手之宏与枚举

在CAD二次开发中使用进度条

Js插件ECharts，HighCharts学习网址整理

Celery提交任务出错(on windows.)

cephfs内核客户端性能追踪

thinkphp中PHPExcel用法

EntityFramework动态组合多排序字段

汇编语言（八）实验9 根据材料编程

安装ubuntu后必须做的事情（对我而言）

JS函数式编程

每日归档

更多

2024-10-22(0)

2024-10-21(0)

2024-10-20(0)

2024-10-19(0)

2024-10-18(0)

2024-10-17(0)

2024-10-16(0)

2024-10-15(0)

2024-10-14(0)

2024-10-13(0)