数据采集---json格式数据

页面展示【智联招聘】：

URL：https://sou.zhaopin.com/?jl=801&kw={0}&p={1}
例：https://sou.zhaopin.com/?jl=801&kw=python&p=1

在这里插入图片描述
右键–>查看网页源码【切片处理获得json数据】

json数据获取要用的数据【一一相对应】

在列表页函数中解析：

 def parse(self, response):
        js = response.xpath('//script[contains(.,"__INITIAL_STATE__=")]/text()').extract_first()   #利用xpath获取<script>标签
        # print(js)
        r = js.split("__INITIAL_STATE__=")[1]  #切片处理
        # open("a.json","w",encoding="utf-8").write(r)  #保存到本地进行查看
        d = json.loads(r)  # 把json格式字符串转换成python对象
        companies = d.get("positionList")
         for company in companies:
            itemdata = ZhilianItem()
            itemdata["title"] = company.get("name")
            itemdata["company"] = company.get("companyName")
            itemdata["salary"] = company.get("salary60")
            itemdata["address"] = company.get("workCity")+company.get("cityDistrict")
            zcxx = company.get("welfareLabel")
            arr1 = []
            for i in zcxx:
                arr1 += i.get("value")
            itemdata["post"] = str(arr1)
            itemdata["experience"] = company.get("workType")+",工作经验"+company.get("workingExp")
            article_url = company.get("positionURL")
            # print(article_url)
            # print(111)  
         
            print(itemdata['title'])
            yield scrapy.Request(url=article_url, meta={"item": itemdata}, callback=self.parsedetail,
                                 dont_filter=True)

extract()与extract_first()区别:

extract()返回的所有数据，存在一个list里。
extract_first()返回的是一个string，是extract()结果中第一个值。

JSON 函数

json.dumps:将 Python 对象编码成 JSON 字符串
json.loads:将已编码的 JSON 字符串解码为 Python 对象

参考手册

数据采集---json格式数据

猜你喜欢