1.对于一些爬取后数据为空,但在网页上f12 上有的 ,可以查看网页源代码 ,发现使用js写的
则只能用模拟浏览器selenium
from selenium import webdriver driver = webdriver.Chrome() driver.get('......略') include_title = [] driver.implicitly_wait(20) author = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[1]').text date = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[2]').text driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[2]/div[1]/button[2]').click() print(author, date)
2.使用scrapy 如果不模拟浏览器,则只能爬取静态网页
3.对于有内容的网页但response url 打开后为404 则表明不希望你访问,暂时没有办法爬取
4.对于scrapy 在setting 里可以设置头
DEFAULT_REQUEST_HEADERS={ "Accept": "*/*", "Accept-Encodingv": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connectionvkeep-alive":"keep-alive", "Host": "bbs.tju.edu.cn", "Referer":"https://bbs.tju.edu.cn/", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36" }
5. 普通的导入mongodb
import pymongo client = pymongo.MongoClient('localhost',27017) mydb = client['mydb'] taobao = mydb['taobao'] #抓取略 commodity = { 'goods': goods,#抓取之后 'price': price, 'sell': sells, 'shop': shop, 'address': address } taobao.insert_one(commodity)
6.scrapy 导入mongodb
#在pipelines import pymongo class JianshuPipeline(object): def __init__(self): client = pymongo.MongoClient('localhost', 27017) test = client['test'] jianshu = test['jianshu'] self.post = jianshu def process_item(self, item, spider): info = dict(item) self.post.insert(info) return item