前言

看到B站上有个爬取一个婚恋网站自己也就照着写了一个也就是对requests库的一个简单使用和对爬取的信息写入文件

对网站信息的抓取首先要对网站的结构进行分析这会让我们在后续中提供很大的便利

www.7799520.com/api/user/pc/list/search?startage=21&endage=30&gender=2&cityid=221&startheight=161&endheight=170&marry=1&salary=2&page=1
通过它的url地址发现他是根据我们选择的条件筛选出符合的数据加载出来然后可以通过page可以查看多页数据（网站的多页数据都是在一个网页下显示出来的）

导入包

对网站分析之后先把需要的包可以导入进来自己通过resquests库进行爬取当然先把requests库导入后续有文件操作所以导入os模块其他模块在使用过程中可以导入

import requests
import os
import json

设置爬取条件

def set_age():
    #输入期望的年龄
    age = int(input("请输入期望的年龄(如:25): "))
    #年龄区间
    if  21 <= age <= 30:
        startage =21
        endage = 30
    elif 31 <= age <= 40:
        startage = 21
        endage = 30
    elif 41 <= age <= 50:
        startage = 21
        endage = 30
    elif 51 <= age <= 60:
        startage = 21
        endage = 30
    else:
        startage = 0
        endage =0
    return startage,endage
def set_sex():
    #输入性别
    sex = input("请输入对方的性别（如:女): ")
    if sex == '男':
        gender = 1
    else:
        gender = 2
    return gender

def set_heigth():
    #输入期望的身高
    height = int(input("请输入期望的身高(如:162): "))
    if 0 <= height < 150:
        startheight = 0
        endheight = 150
    elif 151 <= height < 161:
        startheight = 151
        endheight =160
    elif 161 <= height < 171:
        startheight = 161
        endheight =170
    elif 171 <= height < 181:
        startheight = 171
        endheight =180
    elif 181 <= height < 191:
        startheight = 181
        endheight =190
    else:
        startheight = 0
        endheight = 0
    return  startheight,endheight

def set_salary():
    #输入期望的薪资
    money = int(input("请输入期望的薪资: "))
    if 2000 <= money < 5000:
        salary = 2
    elif 5000 <= money < 10000:
        salary = 3
    elif 10000 <= money < 20000:
        salary = 4
    elif 20000 <= money < 50000:
        salary = 5
    elif 50000 <= money < 100000:
        salary = 6
    elif 100000 <= money :
        salary = 7
    else:
        salary = 0
    return salary

这里只对部分的条件设置大家有兴趣可以多设置一些其他的就直接固定条件

解析网页

解析网页需要的参数可以通过查询条件传入

def get_one_page(page,startage,endage,gender,startheight,endheight,salary):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}

    base_url = 'http://www.7799520.com/api/user/pc/list/search?startage={}&endage={}&gender={}&cityid=221&startheight={}' \
               '&endheight={}&marry=1&salary={}&page={}'.format(startage,endage,gender,startheight,endheight,salary,page)
    while True :
        try:
            response = requests.get(base_url,headers=headers)
            if response.status_code == 200:
                return response.json()
        except:
            return None

根据筛选条件获取数据传入网页的url中通过for循环查找多页数据进行将人物的图片和信息的保存

def query_data():
    print("请输入你的筛选条件，开始查找")
    #年龄
    startage,endage = set_age()
    #性别
    gender = set_sex()
    #身高
    startheight,endheight = set_heigth()
    #薪资
    salary = set_salary()

    for i in range(1,5):
        json = get_one_page(i,startage,endage,gender,startheight,endheight,salary)
        #print(json['data']['list'])
        for item in json['data']['list']:
            #保存头像
            #save_image(item)
            #保存个人信息
            save_info(item)

保存头像

头像的url在json格式下的item中的avatar中这样就可以获取到头像的url地址

def save_image(item):
    if not os.path.exists('images'):
        os.mkdir('images')
    response  = requests.get(item['avatar'])
    if response.status_code == 200:
        file_path = 'images/{}.jpg'.format(item['username'])
        if not os.path.exists(file_path):
            print("正在获取%s的信息"%(item['username']))
             #图片以二进制格式保存
            with open(file_path,'wb')as f:
                # content获取图片内存
                f.write(response.content)
        else:
            print("已经保存该图片")

保存基本信息

将每个人的信息都保存到一个txt文件中直接可以获取到item下的各种信息写入文件

def save_info(item):
    if not os.path.exists('message'):
        os.mkdir('message')
    file_path = 'message/{}.text'.format(item['username'])
    with open(file_path,'w',encoding='utf-8') as f:
        f.write("username:" +item['username']+"birth:"+ item['birthdayyear'])

自己尝试着把所有抓取的信息写入一个文件但是打开文件发现其实格式很乱

def save_info(item):
    if not os.path.exists('message'):
        os.mkdir('message')
    data = {
        'username' : item['username'],
        'birth' : item['birthdayyear'],
        'gender':item['gender'],
        'height':item['height'],
        'education': item['education'],
        'monolog' : item['monolog'],
        'city': item['city']
    }
    #print(type(data)) dict
    #print(data)
    jsondata = json.dumps(data).encode('utf-8').decode('unicode_escape')
    print(jsondata)
    #print(type(jsondata))
    #将所有的信息放入一个文本文件中
    file_path = 'message/{}.text'.format('meizi')
    with open(file_path,'a',encoding='utf-8') as f:
         f.write(jsondata)

调用函数完成爬取

query_data()

自己刚学爬虫还有很多不知道的地方自己也就是希望用csdn记录下自己学爬虫的过程

Requests简单爬取婚恋网站

前言

导入包

设置爬取条件

解析网页

保存头像

保存基本信息

调用函数完成爬取

猜你喜欢