scrapy爬虫基础

生成文件夹在cmd中运行scrapy startproject doubantest(后跟文件夹名字)

遇到的问题:跟视频中不一样,豆瓣电影top250加了反爬虫机制,解决办法是在settings.py中加入agent,方法参见百度经验:如何用pycharm编写scrapy项目:[8]user-agent:

https://jingyan.baidu.com/article/e52e36151bdf2640c60c513f.html

xxx\doubantest\main.py(新建)

#encoding=utf-8
from scrapy import cmdline
cmdline.execute("scrapy crawl doubanTest".split())
#使用scrapy里面负责执行Windows命令的一个类,执行scrapy crawl doubanTest,这个命令是爬虫运行,这与
# 平常的python程序不一样,之前运行程序是使用python,然后是这个程序的名字,

xxx\doubantest\doubantest\spiders\spider.py(新建)

#encoding=utf-8
#-*- coding:utf-8 -*-

#scrapy生成一个project,然后爬取网页
# from scrapy.contrib.spiders import CrawlSpider
from scrapy.spiders import CrawlSpider

######User-Agent要加在settings.py文件中

# hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
# hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
# User-Agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

# USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

# #让网站认为浏览器在访问
# html = requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = hea)

class Douban(CrawlSpider):
    name = "doubanTest"
    start_urls = ['https://movie.douban.com/top250']
    # start_urls = ['http://www.jikexueyuan.com/course/?pageNum=1']

    def parse(self,response):
        print response.body
        # print response.url
        # a = response.url
        # b = 1

(生成的)settings.py加上

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

猜你喜欢

转载自blog.csdn.net/hhyiyuanyu/article/details/80183685