爬取链家二手挂单房屋 - 匹配百度地图API坐标 - python出地图【1】百度地图API爬取特定关键词地点坐标

第二个步骤是匹配每个项目的坐标点。百度地图的API最近是不是改版了？之前都可以一路爬几千条的现在25-30条就会报一次下面这个错；

ConnectionError: HTTPConnectionPool(host='api.map.baidu.com', port=80): 
Max retries exceeded with url: /place/v2/search?
query=%E4%B8%AD%E7%B2%AE%E5%87%A4%E5%87%B0%E9%87%8C&tag=%E6%88%BF%E5%9C%B0%E4%BA%A7&
region=%E6%B7%B1%E5%9C%B3&output=json&ak=##########
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000233F42F8B38>:
 Failed to establish a new connection: 
 [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))

于是在下面增加了一个报错后较长的冷却时间，不知道行不行。等明天上班了试一下吧···

其中需要搜索的楼盘名装入lst_input这个list中，使用百度提供的关键词搜索方法。

更新180910：调整到可以运行的代码；感觉百度好像最近调整了并发限制？这边相应延长了每次爬取之后的等待时间

import os
import pandas as pd
import numpy as np
import pymongo
import json
import random
import requests
import time

###### 数据导入: lst - baiduAPI - cll in dbs
def input_lst_baiduAPI_to_cll(lst_input,                                    # the list for searching in baidumap
                              tg_cll,                                       # the collection for storage
                              tg_dbs = 'db_WebCrwr', 
                              tg_hst = 'mongodb://localhost:27017/',
                              tg_rgn = '深圳', 
                              tg_tag = '房地产', 
                              tg_kwd = 're_name',
                              tg_ak  = '##########'): # 百度地图api信令
    myclient = pymongo.MongoClient(host=tg_hst)             # 指向本地mongoDB连接
    mydbs    = myclient[tg_dbs]                             # 指向库db_test
    mycol    = mydbs[tg_cll]                                # 指向集合cl_test_baiduAPI
    url0     = 'http://api.map.baidu.com/place/v2/search?'  # 百度地图api地址前段
    lst_error= []                                           # 预定义一个用于存放失败样本的列表
    for i_qry in lst_input:
        print("step: %.i in %.i, re %s"%((lst_input.index(i_qry)+1), len(lst_input), i_qry))
        if mycol.find_one({tg_kwd : i_qry}):                # 若tg_kwd已存在则不保存
            print("\tre %s: already in the collection"%(i_qry))
        else:
            try:
                url = url0 + 'query=' + i_qry + '&tag='+tg_tag+'&region='+ tg_rgn + '&output=json&ak=' + tg_ak     
                data = requests.get(url, timeout = 500)     # 得到url中的内容, 设置较长的等待返回期
                data.encoding="utf-8"                       # 指定解码
                hjson = json.loads(data.text)               # 将url中的内容通过data.text转化成string，再通过json.loads反向解码为dct或list
                if hjson['message'] == 'ok':
                    cum_insert = 0
                    results = hjson['results']
                    for i in range(len(results)):           # 提取返回的结果进入前文声明的mongoDB集合mycol
                        results[i][tg_kwd] = i_qry          # 将关键词附在百度API返回的结果后面
                        mycol.insert_one(results[i])
                        cum_insert += 1
                    print("\tre %s: %.i documents inserted"%(i_qry, cum_insert))
                else:
                    lst_error.extend([i_qry])               # 记录爬取失败的内容
                    print("\tre %s: requests.get error"%(i_qry))  # 报点: 爬取页面错误
                    print(hjson)                
                    time.sleep(random.uniform(10, 20))      # 避免连续读取api而造成限制访问错误
            except:
                lst_error.extend([i_qry])                   # 记录爬取失败的内容
                time.sleep(random.uniform(10, 20))          # 避免连续读取api而造成限制访问错误
                print("\twarn %s: stoped for other error"%(i_qry))
            time.sleep(random.uniform(1.5, 3))   
    return lst_error
###### 知识点: 此处由于关键词较为详细, 使用city method API
# city mthd: http://api.map.baidu.com/place/v2/search?query=ATM机&tag=房地产&region=深圳&output=json&ak=###
# crcl mthd: http://api.map.baidu.com/place/v2/search?query=银行&tag=美食&location=39.92,116.40&radius=200&output=xml&ak=###
# mthd from: http://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-placeapi
# tag list from: http://lbsyun.baidu.com/index.php?title=lbscloud/poitags
# nodup input from: https://blog.csdn.net/qq_23926575/article/details/79184055
######

# a = requests.get("http://www.baidu.com",timeout = 500) # https://segmentfault.com/q/1010000008125714

运行一下，实际上有很多地点都已经在库里存在了就跳过没有再爬：
这里写图片描述
完了之后会发现API提供了大量奇怪的地址，需要利用前面爬取的信息对明显异常的地址进行筛除。之后取中位数点作为目标楼盘名的实际坐标

爬取链家二手挂单房屋 - 匹配百度地图API坐标 - python出地图【1】百度地图API爬取特定关键词地点坐标

猜你喜欢