闲言

这几天忙的要老命啊，天天上网课，5,6个平台手机电脑电视全开，一整天盯着显示屏，我觉得迟早要崩溃，不仅如此，在学校上课基本没什么作业，一上网课就：你的作业即将过期，请及时完成。。。好了，吐槽不说了，忙中抽点闲，前两天弄的东西现在写一写，虽然网上有一堆大神有写这个东西，但是我喜欢折腾，就把我自己踩的坑来写一写。

成果

小声嘀咕：形状是雷大佬的图片。๑乛◡乛๑

填坑

一开始我是想着直接获取网页试试，不出意外啥都没有，不过也很好理解，毕竟大网不可能用这么简单的鼠标右键查看网页源代码就能看到，其实我们平时直接requests获取到的网页就是跟鼠标右键查看网页源代码看到的一样，所以以后如果做这种事我们就可以先了解一下网站构成，也就是鼠标右键查看网页源代码，不然有时候真的瞎忙活。

接着呢我又尝试了selenium动态获取，成是成功了，只不过每页获取到的只是一个评论，也许是我的方法错了，代码就不贴了，太拙劣了。我在这差不多耗了整整一个小时的时间，那会真的头都大了，不过我还是发现了火狐的F12的一个好用处，竟然可以直接获取到Xpath路径，直接点击标签然后鼠标右键copy就有个copy Xpath还有其他一些方法，可能是我太蠢了吧现在才发现，不过好像不太晚，哈哈哈哈哈。

最后没办法去找了一下，发现还真是自己太蠢了，其实网上已经有一堆了，受该作者的启发我又把之前的源码全部删去（心疼），然后又折腾了一下午搞了下面这份，源码已经有很好的注释了，就不再去写思路了，终于完结了！！*★,°*:.☆\(￣▽￣)/$:*.°★*。撒花！

网址详解

https://sclub.jd.com/comment/productPageComments.action？callback=fetchJSON_comment98&productId=100011199522&score=1&sortType=6&page=0&pageSize=10&isShadowSku=0&fold=1

返回模式：callback 商品id：productId=100011199522
评论方式：score=0全部 1差评 2中评 3好评 5追评
排序方式：sortType=5 推荐排序 6时间排序
页面：page=0
一页多少评论：pageSize=10
未知：isShadowSku=0
未知：fold=1

代码

数据库：data_sql.py

# -*- coding:utf-8 -*-
import sqlite3

#创建数据库
def thing_opendb():
    conn = sqlite3.connect("JD_date.db")
    cur = conn.execute("""create table if not exists comments_info(productName varchar(126),commentTime char(30),content varchar(512))""")
    return cur,conn

#  往数据库中添加内容
def thing_insertData(productName,commentTime,content):
        hel = thing_opendb()
        hel[1].execute("insert into comments_info(productName,commentTime,content)values (?,?,?)",(productName,commentTime,content))
        hel[1].commit()
        hel[1].close()
        
#   删除数据库中的全部内容
def thing_delalldb():
        hel = thing_opendb()              # 返回游标conn
        hel[1].execute("delete from comments_info")
        print("删库跑路Cxk我最帅")
        hel[1].commit()
        hel[1].close()
        
#查询全部内容       
def thing_slectTable():
        hel = thing_opendb()
        cur = hel[1].cursor()
        cur.execute("select * from comments_info")
        res = cur.fetchall()
        #for line in res:
                #for h in line:
                        #print(h),
                #print(line)
        return res
        cur.close()
#查询全部评论信息
def thing_slectComment():
        hel = thing_opendb()
        cur = hel[1].cursor()
        cur.execute("select content from comments_info")
        res = cur.fetchall()
        #for line in res:
                #for h in line:
                        #print(h),
                #print(line)
        return res
        cur.close()

源码

import requests
import re
import time
import json
#数据库
from date_sql import *
#词云库
from wordcloud import WordCloud
import PIL .Image as image
import numpy as np
def get_comment(url):
    r = requests.get(url).content
    text=r.decode('gbk')
    #获取返回的json数据
    # print(text)
    data=text
    #返回的并不是字典，我们要先处理一下，把无用字符去除，留下字典
    data1=data.replace("fetchJSON_comment98(", "");
    data2=data1.replace(");", "");
    # print(data2)
    data = json.loads(data2)#data1的内容为一个字典，用{}括起来的内容
    #字典清洗，提取comments内容
    for i in data['comments']:
        #商品名称
        productName = i['referenceName']
        #评论时间
        commentTime = i['creationTime']
        #评论内容
        content = i['content']
        #将数据插入数据库保存
        thing_insertData(productName,commentTime,content)
    print('ok')
    
#我爬取的时候总共0-35页每页10条数据 36页还有2条就不要了
#分批获取，防止封本人是0,11---11，21---21，35
for i in range(21,35):
    number=str(i)
    url = "https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100011199522&score=1&sortType=6&page=%s&pageSize=10&isShadowSku=0&fold=1"%number
    get_comment(url)
    #防止封，加定时，每隔1秒获取一页
    time.sleep(1)

print('爬取完成')
# 查询总共几条数据
a=thing_slectTable()
print(len(a))
# print(a)
#从数据库获取全部评论
b=thing_slectComment()
print(len(b))
# print(b)
#将数据写入txt文件
file = open('jingdongComments.txt','w')
for i in b:
    #暴力去掉无用字符写入txt文件
    strs=str(i)[2:][:-2].replace("\n", "")
    strs=strs.replace("&hellip;", "")
    strs=strs.replace("rdquo", "")
    strs=strs.replace("ldquo", "")
    strs=strs.replace("运行速度", "")
    strs=strs.replace("拍照效果", "")
    strs=strs.replace("待机时间", "")
    strs=strs.replace("外形外观", "")
    strs=strs.replace("其他特色", "")
    strs=strs.replace("屏幕音效", "")
    strs=strs.replace("n", "")
    file.write(strs)
file.close()
print('写入ok')
#读取txt文件
with open("jingdongComments.txt") as fp:
    text=fp.read()
    # 将文本放入WordCoud容器对象中并分析
    # 词云图片
    mask = np.array(image.open("2.jpg"))

    #字体：C:\Windows\Fonts\FZSTK.TTF  C:\Windows\Fonts\FZLTCXHJW.TTF  每个人的系统字符库里都有
    font="C:\Windows\Fonts\FZLTCXHJW.TTF"
    
    WordCloud =WordCloud(
        # 设置字体，不指定就会出现乱码
        font_path=font,  # 这个路径是pc中的字体路径
        
        # 设置背景色
        background_color='white',
        
        # 词云形状
        mask=mask,
        
        # 允许最大词汇
        max_words=100,
        
        # 最大号字体
        max_font_size=100,
        
        # 设置有多少种随机生成状态，即有多少种配色方案
        random_state=30,
        
        # 清晰度
        scale=3
    ).generate(text)
    
    image_produce = WordCloud.to_image()
    image_produce.show()
    
print('词云完成')

乱语

学习使用我只爬取了差评，数据量比较少，官网显示500+，最后爬下来350+，可能有些客服回复也算评论，那也算挺厚道的了，总评好像4w+，本来打算全弄下来，但发现没多大用处就不搞了。将商品id换一下不知道能不能获取到信息了，没试过，不过我觉得应该也是可以的。

半盏清茶℡

发布了41 篇原创文章 · 获赞 27 · 访问量 3万+

私信关注

几行代码爬取某东商品评论并写入数据库做成词云

闲言

成果

填坑

网址详解

代码

数据库：data_sql.py

源码

乱语

猜你喜欢