爬虫基础------自动模拟HTTP请求

1.概念理解：

####客户端与服务端进行通信的时候，需要通过http请求进行，http请求多种，这里会介绍post与get两种请求方式，比如登录、搜索某些信息

2.get请求实战-----实现百度信息的自动搜索：

（代码分析：需要爬取在百度上搜索“范冰冰”关键词，百度页面出现关于范冰冰的所有标题，页数自己定，下面有解释）

import urllib.request
import re
# 备注：keyword = "英文"是完全没问题的，如果输入中文则报错(需要转码)，自己可以亲自尝试一下
keyword = "范冰冰"
# 转码过程
keyword = urllib.request.quote(keyword)
# 爬取多页内容-----page = (num-1)*10
for i in range(1,5):
    url = "http://www.baidu.com/s?wd="+keyword+"&"+str((i-1)*10)
    data = urllib.request.urlopen(url).read().decode("utf-8")
    # 要爬百度页面的每个标题,正则表达式为：
    # pat = "title:'(.*?)',"
    pat1 = '{"title":"(.*?)",'
    # result = re.compile(pat).findall(data)
    result1 = re.compile(pat1).findall(data)
    # for j in range(0,len(result)):
    #     print(result[j])
    for z in range(0,len(result1)):
        print(result1[z])

get请求实战的运行结果：

3.post请求实战：（未完待续！！！）代码案例：

# post请求实战
import urllib.request
# 表单内容需要转码依赖下面的模块
import urllib.parse
posturl = "file:///E:/EditPlus/java-web%E5%9F%BA%E7%A1%80%E6%95%99%E7%A8%8B/%E7%AC%AC%E4%B8%80%E7%AB%A0/%E6%A1%88%E4%BE%8B/htmlDemo5.html"
postdata = urllib.parse.urlencode({
    "name:": "[email protected]",
    "passwd:": "123456"
    }).encode("utf-8")
# 进行post，就需要使用urllib.request下面的Request
# (真实的post地址，post数据)
req = urllib.request.Request(posturl,postdata)
# 进行网页的爬取
rst = urllib.request.urlopen(req).read().decode("utf-8")
fh = open("E:\\Pythondemo\\Python-test\\PythonLX\\post.html","w",encoding="utf-8")
fh.write(rst)
fh.close()

菲神blog

发布了98 篇原创文章 · 获赞 34 · 访问量 3万+

私信关注

爬虫基础------自动模拟HTTP请求

猜你喜欢