Creating a project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject URLCrawler
Our first Spider
This is the code for our first Spider. Save it in a file named my_spider.py
under the URLCrawler/spiders
directory in your project:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
urls = [
'http://www.4g.haval.com.cn/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
#写文件调用open()函数时,传入标识符'w'或者'wb'表示写文本文件或写二进制文件
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' %filename)
How to run our spider
To put our spider to work, go to the project’s 最高一层的目录 and run:
scrapy crawl my_spider
A shortcut to the start_requests method
Instead of implementing a start_requests()
method that 通过URLs生成 scrapy.Request
objects, 你可以仅仅定义一个
start_urls
类 attribute with a list of URLs. This list will then be used by the 默认实现 of start_requests()
to 创建最初的请求 for your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://www.4g.haval.com.cn/',
]
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:
scrapy shell "http://www.4g.haval.com.cn/"
XPath: a brief intro
In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data='<title>哈弗SUV官网移动版</title>'>]
In [2]: response.xpath('//title/text()').extract_first()
Out[2]: '哈弗SUV官网移动版'
Extracting data 并将其保存至文件
textlist = response.selector.xpath('//text()').extract()
texts = []
#with open不需要close
with open('filename', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist)):
text = textlist[i].strip()
if text != '':
texts.append(text)
f.write(text + '\n')
发现JavaScript代码也被提取出来了,如果不需要此部分代码:
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_no_scripts)):
text = textlist_no_scripts[i].strip()
f.write(text + '\n')
至此这个网页的文本内容已经被我们顺利下载到了文件中,下一步可以对文本内容进行分词、聚类等。但是也发现了一个问题,对于某些动态加载的网页,此方法不适用,比如以下代码下载下来的网页和浏览器访问该URL时看到的页面并不一样:
# write a script to download the html
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://www.4g.haval.com.cn/',
'http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2',
]
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
# execute in prompt
scrapy crawl my_spider
下载下来的文本内容也少了很多:
# explore the data in prompt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
response.selector.xpath('//text()').extract()
response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
# download text to file in promt
scrapy shell "http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2"
textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
with open('filename_no_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_no_scripts)):
text = textlist_no_scripts[i].strip()
f.write(text + '\n')
原因参考:https://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/
解决该问题参考:https://blog.csdn.net/sinat_40431164/article/details/81200207