Python入门级爬取百度百科词条 - 代码天地

Python入门级爬取百度百科词条

编程语言 2018-05-18 22:48:54 阅读次数: 4

爬取 Angelababy词条历史版本中的value值。

尝试爬取网页

# _*_ coding:utf-8 _*_
import urllib
import urllib2
import re
page = 1
url = 'https://baike.baidu.com/historylist/Angelababy/1509275#page'+str(page)
try:
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    print response.read()
except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

运行结果：

可以看到已经爬取了此网页所有的内容。现在需要实现的就是爬取想要的value值了。

爬取目标内容

可以看到要爬取的内容，格式全部一样都是图中所示，代码如下：


<tr>
    <td class="checkBox">
        <input type="checkbox" value="128140635">
    </td>
      .
      .
      .
</tr>

所以我们做以下正则匹配：


pattern = re.compile('<tr>.*?<td class="checkBox">.*?<input.*?value="(.*?)">.*?</td>.*?</tr>',re.S)

全部代码如下：


# _*_ coding:utf-8 _*_
import urllib
import urllib2
import re
page = 1
url = 'https://baike.baidu.com/historylist/Angelababy/1509275#page'+str(page)
try:
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    content = response.read().decode('utf-8')
    pattern = re.compile('<tr>.*?<td class="checkBox">.*?<input.*?value="(.*?)">.*?</td>.*?</tr>',re.S)
    items = re.findall(pattern,content)
    for item in items:
        print(item)
except urllib2.URLError,e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

爬取结果如下：

学习链接

崔庆才的个人博客

猜你喜欢

转载自blog.csdn.net/qq_40265501/article/details/80345634

Python入门级爬取百度百科词条

python简单爬虫爬取百度百科python词条网页

简单的python爬虫（爬取百度百科词条）

Python爬虫爬取百度百科词条

爬取1000条百度百科词条

Python爬虫实战项目1 | 基础爬虫的实现（爬取100条百度百科词条）

java 如何爬取百度百科词条内容(java如何使用webmagic爬取百度词条)

python爬虫入门--爬取百度百科10000条记录

Python爬虫入门——百度百科词条数据

百度百科全站爬取教程

Python爬取百度百科！付费文档同样爬！

为品牌建立百度百科词条

Python Selenium爬取百度百科旅游景点的消息盒

python 爬虫——针对query爬取百度百科页面

python3爬取1000个百度百科页面（二）

python3爬取1000个百度百科页面（一）

Python——爬取百度百科关键词1000个相关网页

Python爬取百度百科1000个页面的数据

python爬取百度百科保存scv

python爬取百度百科属性框

品牌百科词条怎么写？品牌百度百科创建干货

百度百科人物词条怎么创建，百科人物创建技巧

python自动规则化抓取百度百科词条数据

python网络爬虫批量获取百度百科词条使用request和beautifulsoup

娱乐人物百科词条怎么做百度百科创建词条技巧

百度百科创建词条不通过的原因如何创建百科词条通过审核

《百年孤独》百度百科的爬取

Day2-Python爬虫小练爬取百科词条

入门级Python 正则表达式与Sqlite3数据库练习爬取糗事百科热门

怎么创建百度百科词条百度词条过不了怎么办

今日推荐

周排行

深度学习------Lingvo框架下的加速通道GPipe

webjars管理静态资源

C专家编程_2.2

mysql 源码安装

json文件操作

123231432

注解的实现

Spring MVC 控制器

《人月神话》读后感二

C#使用HttpWebRequest和HttpWebResponse上传文件示例

每日归档

更多

2024-09-08(0)

2024-09-07(0)

2024-09-06(0)

2024-09-05(0)

2024-09-04(0)

2024-09-03(0)

2024-09-02(0)

2024-09-01(0)

2024-08-31(0)

2024-08-30(0)