简单的单线程携程店铺产品爬虫(产品id/销量/评论数)

如上面的链接. 携程店铺里面总共有50000多个玩乐产品.

但是光靠翻页,300页的话,最多只能看到3000条数据.

所以我们根据筛选的话,比如说筛选价格1-5的产品.

只有2000多条,能够正确显示出所有的产品了.

所以,我们通过先筛选,再翻页的逻辑,去爬取携程的全部50000多个产品.

不过因为技术所限,不太方便爬取该产品对应的类型 . 希望有机会能够改进.

#携程玩乐产品爬虫
#维护时间为2018年6月15日
import requests
from bs4 import BeautifulSoup
import re
import mysql.connector
database = mysql.connector.connect(user='root', password='啊啊啊啊啊, database='啊啊啊啊') 
#这里可以自行选择到底要以什么方式储存
cursor = database.cursor()
headers = {'content-type': 'application/json',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
import time
import numpy as np

#根据链接自动爬取需要的数据,然后上传到数据库去中
def get(url):
    r = requests.get(url,headers=headers)
    html = r.text.encode(r.encoding).decode('UTF-8')
    soup= BeautifulSoup(html,'html.parser').find("div", {"id":"xy_list"})
    n=0
    
    for i in soup.find_all("div"):
        addtime=int(time.time())
        n+=1
        if n%2==1:
            productid= re.search(r'href="/activity/([0-9]+).html',str(i)).group(1)
            comment_text=re.search(r'<span class="product_comment">(.*?)</span>',str(i)).group(1)
            #确认点评数num_comment
            if comment_text=='新产品，暂无点评':
                num_comment=0
            elif re.search(r'新产品，已有([0-9]+)条点评',str(i)):
                num_comment=re.search(r'新产品，已有([0-9]+)条点评',str(i)).group(1)
            elif re.search(r'[0-9]+条点评',str(i)):
                num_comment=re.search(r'([0-9]+)条点评',str(i)).group(1)
            else:
                print('查找评论失败')
            #确认销量
            if re.search(r'<span class="product_num">月销：<em class="black">([0-9]+)份</em></span>',str(i)):
                sales=re.search(r'<span class="product_num">月销：<em class="black">([0-9]+)份</em></span>',str(i)).group(1)
            else:
                sales=0
            queryid=int(str(productid+str(addtime)))


            productname = i.h2.string
            img_url=re.search(r'data-original="(https?://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|])',str(i)).group(1)
            palece_path=i.find("p",{"class":"product_destination"}).find("span")
            productType= "特色体验"
            if str(palece_path).find('等')>0:
                number_end = str(palece_path).find('等')
            else:
                number_end = str(palece_path).find(r'</span>')

            if str(palece_path).find('·') >0:
                number_start = str(palece_path).find('·')+2
            else:
                    number_start = str(palece_path).find(r'</em>')+5
            else:
                continue
            
            city=(str(palece_path)[number_start:number_end])
    
    #下面是上传到数据库。因为临时有内容修改，所以有点出入。 
    #建议大家使用 INSERT INTO table VALUES （a1，b1,c1),(a2,b2,c2）...(an,bn,cn)批量加入,效率比较高  
    sql= 'INSERT INTO `query_product` values '
    for row in query_product_list:
        sql += '(%s,%s,%s,%s,%s) ,'%(row[0],row[1],row[2],row[3],row[4])
    sql=sql[:-2]
    cursor.execute(sql)
    database.commit()

def get_max_page(url):
    global max_page
    driver = webdriver.Chrome()
    driver.get(url)
    try:
        max_page=driver.find_element_by_xpath(r'//*[@id="xy_pagebar"]/div/a[last()-1]').text
    except:
        time.sleep(1)
        max_page=driver.find_element_by_xpath(r'//*[@id="xy_pagebar"]/div/a[last()-1]').text
    driver.close()

    
import time
from lxml import etree
number_page=1
filter_min=1
filter_max=5

while 1:
    global filter_min,filter_max
    url = 'http://huodong.ctrip.com/activity/search/?keyword=&filters=pmin%spmax%ss13p%s'%(filter_min,filter_max,number_page)

    #确认链接是否合适. 保证产品数少于3000条,否则后面的不会显示.所以如果最大页=300,则一定要减小筛选区间,免得中间落下.
    while 1:
        get_max_page(url)
        if int(max_page)==300:
            filter_max =filter_max-3
            url = 'http://huodong.ctrip.com/activity/search/?keyword=&filters=pmin%spmax%ss13p%s'%(filter_min,filter_max,number_page)
        elif int(max_page)<=100:
            filter_max =filter_max+3
            url = 'http://huodong.ctrip.com/activity/search/?keyword=&filters=pmin%spmax%ss13p%s'%(filter_min,filter_max,number_page)
        else: 
            print(url,'链接没有问题,开始爬虫')
            break
    while 1:
        get(url)
        
        print('%s-%s: %s'%(filter_min,filter_max,number_page),end='|')
        number_page  +=1
        time.sleep(1)
        if number_page==int(max_page)+1:
            print('最后一页了')
            filter_min=filter_max+1
            filter_max=filter_min+5
            number_page=1
            break

这个爬虫有个很大的问题就是没有使用多线程/多进程.

主要是怕爬得太快被封.

目前来看,一页页爬,sleep(1秒),不太会被封.比较稳定.

(反正我电脑内存大,然后时间多得很,哈哈哈哈)

多线程的话,可以参考下我在看书海全站爬虫+代理+多线程中使用多线程+代理.

主要稳定代理基本上都是付费的.. 虽然说一个小时一块钱不贵..但是..感觉就是不太好,哈哈

简单的单线程携程店铺产品爬虫(产品id/销量/评论数)

猜你喜欢