生成文件夹在cmd中运行scrapy startproject doubantest(后跟文件夹名字)
遇到的问题:跟视频中不一样,豆瓣电影top250加了反爬虫机制,解决办法是在settings.py中加入agent,方法参见百度经验:如何用pycharm编写scrapy项目:[8]user-agent:
https://jingyan.baidu.com/article/e52e36151bdf2640c60c513f.html
xxx\doubantest\main.py(新建)
#encoding=utf-8 from scrapy import cmdline cmdline.execute("scrapy crawl doubanTest".split()) #使用scrapy里面负责执行Windows命令的一个类,执行scrapy crawl doubanTest,这个命令是爬虫运行,这与 # 平常的python程序不一样,之前运行程序是使用python,然后是这个程序的名字,
xxx\doubantest\doubantest\spiders\spider.py(新建)
#encoding=utf-8 #-*- coding:utf-8 -*- #scrapy生成一个project,然后爬取网页 # from scrapy.contrib.spiders import CrawlSpider from scrapy.spiders import CrawlSpider ######User-Agent要加在settings.py文件中 # hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} # hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'} # User-Agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' # USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' # #让网站认为浏览器在访问 # html = requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = hea) class Douban(CrawlSpider): name = "doubanTest" start_urls = ['https://movie.douban.com/top250'] # start_urls = ['http://www.jikexueyuan.com/course/?pageNum=1'] def parse(self,response): print response.body # print response.url # a = response.url # b = 1
(生成的)settings.py加上
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'