update the database content

2万多条数据已经爬去完毕，发现格式不正确，该怎么办？

爬取的结果如下：

[{“title”: “工艺：”, “content”: [“油爆”]}, {“title”: “口味：”, “content”: [“咸鲜味”]}, {“title”: “菜系：”, “content”: [“福建菜”]}, {“title”: “功效：”, “content”: [“福建菜”, “通乳调理”, “气血双补调理”, “营养不良调理”]}, {“title”: “主料：”, “content”: [“河虾250克”]}, {“title”: “辅料：”, “content”: [“竹笋35克”, “香菇（鲜）10克”, “青椒20克”, “红萝卜25克”]}, {“title”: “调料：”, “content”: [“大葱10克鸡蛋清10克大蒜5克淀粉(豌豆)12克白砂糖5克盐3克味精1克料酒3克胡椒1克植物油75克各适量”]}]

JSon在线解析后结果如下：
这里写图片描述

通过分析发现：

{“title”: “功效：”, “content”: [“福建菜”, “通乳调理”, “气血双补调理”, “营养不良调理”]}

content中内容还在列表中，我们需要取出来，解决办法有两种：

（1）编写爬虫代码时就应该整理好数据。当数据较少可以修改代码重新跑一次，但是数据太多，重跑不可能。

（2）使用pipelines进行数据整理，这种办法也是数据清理时经常会用到的。方法如下：

#以下代码可以在任意文件夹下运行，只要环境配置正确
import pymysql.cursors
import json
#make a connection with the databases
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             db='baikemy.com',
                             cursorclass=pymysql.cursors.DictCursor,autocommit=True)
try:
    with connection.cursor() as cursor:

        sql = "SELECT `id`, `gongyi` FROM `total_copy1`" 
        cursor.execute(sql) #execute the search
        result = cursor.fetchall() #get all the row from search
        for r in result: 
            g = json.loads(r['gongyi']) # change json string to json object
            #if when make to json objects,we can get the value by  g2['title']
            d = []
            for g2 in g:
                a = g2['title'] #get value 
                b = g2['content'] #b is a list of strings 
                strx = ''
                for s in b:#change content list of strings (b)  into one string
                    strx += s + ' '
                strx = strx[:-1] #remove the last space
                d.append( # new gongyi object
                    {
                        "title": a,
                        "content": strx
                    }
                )
            with connection.cursor() as cursor2:
                d2=str(d).replace("'","\"") # change object to string
                print(d2)
                sql = "UPDATE `total_copy1` SET `gongyi`=%s WHERE `id`=%s" #update gongyi object
                cursor2.execute(sql, (d2, r['id'])) # update by knowing id 
                print(r['id'])
finally:
    connection.close()

update the database content

猜你喜欢