知识点:
Scrapy模块安装
2种安装模块的方式。
以下两种方式可以安装绝大部分模块,
网络安装:指直接在控制台 pip install XX
下载安装:网络安装虽然简便,但时不时就会失败,这时就可以前往https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
下载后,用控制台移动到下载文件夹后 pip install xx 来安装。
第6条,配置过程:
1.复制:F:\编程\python\Lib\site-packages\pywin32_system32 下的两个.dll文件
2.粘贴到:C:\Windows\System32 里
Scrapy常用指令
语法格式
1.创建爬虫:scrapy startproject XX
2.查看模版:scrapy genspider -l
basic:
# -*- coding: utf-8 -*-
import scrapy
class FstSpider(scrapy.Spider):
name = 'fst'
allowed_domains = ['aliwx.com.cn']
start_urls = ['http://aliwx.com.cn/']
def parse(self, response):
pass
crawl:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SecondSpider(CrawlSpider):
name = 'second'
allowed_domains = ['aliwx.com.cn']
start_urls = ['http://aliwx.com.cn/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
csvfeed:
# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider
class ThirdSpider(CSVFeedSpider):
name = 'third'
allowed_domains = ['aliwx.com.cn']
start_urls = ['http://aliwx.com.cn/feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = '\t'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = {}
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i
xmlfeed:
# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider
class FourthSpider(XMLFeedSpider):
name = 'fourth'
allowed_domains = ['aliwx.com.cn']
start_urls = ['http://aliwx.com.cn/feed.xml']
iterator = 'iternodes' # you can change this; see the docs
itertag = 'item' # change it accordingly
def parse_node(self, response, selector):
i = {}
#i['url'] = selector.select('url').extract()
#i['name'] = selector.select('name').extract()
#i['description'] = selector.select('description').extract()
return i
3.在spiders中创建爬虫: scrapy genspider -t basic/crawl/csvfeed/xmlfeed 爬虫名 模板网站域名
4.运行爬虫:scrapy crawl 爬虫文件名
项目结构:
items : 存储想要爬取的目标字段
siders:存储多个爬虫文件
middelwares:中间件,用处不明
pipelines:爬后处理,
Scrapy爬虫项目编写基础
将数据存储到数据库中
使用pymysql
首先需在pymysql文件夹中的connections.py中,更改charset为utf8,可防止乱码.
项目地址: