【爬虫】使用 Python Scrapy 爬取静态网页中所有文字

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject URLCrawler

Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py under the URLCrawler/spiders directory in your project:

import scrapy
 
class MySpider(scrapy.Spider):
    name = "my_spider"
 
    def start_requests(self):
        urls = [
            'http://www.4g.haval.com.cn/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
 
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        #写文件调用open()函数时，传入标识符'w'或者'wb'表示写文本文件或写二进制文件
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' %filename)

How to run our spider

To put our spider to work, go to the project’s 最高一层的目录 and run:

scrapy crawl my_spider

A shortcut to the start_requests method

Instead of implementing a start_requests() method that 通过URLs生成 scrapy.Request objects, 你可以仅仅定义一个 start_urls 类 attribute with a list of URLs. This list will then be used by the 默认实现 of start_requests() to 创建最初的请求 for your spider:

import scrapy
 
class MySpider(scrapy.Spider):
    name = "my_spider"
    
    start_urls = [
        'http://www.4g.haval.com.cn/',
    ]
 
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell "http://www.4g.haval.com.cn/"

XPath: a brief intro

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data='<title>哈弗SUV官网移动版</title>'>]

In [2]: response.xpath('//title/text()').extract_first()
Out[2]: '哈弗SUV官网移动版'

Extracting data 并将其保存至文件

textlist = response.selector.xpath('//text()').extract()
texts = []
#with open不需要close
with open('filename', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist)):
        text = textlist[i].strip()
        if text != '':
            texts.append(text)
            f.write(text + '\n')

发现JavaScript代码也被提取出来了，如果不需要此部分代码：

textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist_no_scripts)):
        text = textlist_no_scripts[i].strip()
        f.write(text + '\n')

至此这个网页的文本内容已经被我们顺利下载到了文件中，下一步可以对文本内容进行分词、聚类等。但是也发现了一个问题，对于某些动态加载的网页，此方法不适用，比如以下代码下载下来的网页和浏览器访问该URL时看到的页面并不一样：

# write a script to download the html
import scrapy
class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = [
        'http://www.4g.haval.com.cn/',
        'http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2',
    ]
    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)

# execute in prompt
scrapy crawl my_spider

下载下来的文本内容也少了很多：

# explore the data in prompt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
response.selector.xpath('//text()').extract()
response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()

# download text to file in promt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
    for i in range(0, len(textlist_no_scripts)):
        text = textlist_no_scripts[i].strip()
        f.write(text + '\n')

原因参考：https://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/

解决该问题参考：https://blog.csdn.net/sinat_40431164/article/details/81200207