网易云课堂【Python网络爬虫实战】笔记

1.requests发送请求，BeautifulSoup剖析页面

import requests
from bs4 import BeautifulSoup
url='http://news.sina.com.cn/c/nd/2018-04-17/doc-ifzfkmth5545198.shtml'
res=requests.get(url)
res.encoding='utf-8'#网页乱码问题
soup=BeautifulSoup(res.text,'html.parser')#使用html解析器解析请求内容

2.抓取新闻内容详情

使用浏览器的开发者工具，分析页面中的元素和结构

使用BeautifulSoup的select方法获取页面中的元素，返回的是一个list列表

①获取标题

soup.select('.main-title')[0].text    #通过select方法获取到标签列表，通过下标[0]取得列表中的元素

②获取时间

PS：视频当时的页面元素中，时间信息没有类选择器，包含在通过.date-source筛选出来的span标签下，通过contents取得

from datetime import datetime
time=soup.select('.date-source')[0].contents[1].text    #contents获取元素集合中的每个元素
dt=datetime.strptime(time,'%Y年%m月%d日%H:%M:%S')    #通过datetime的strptime将日期字符串转换为时间
timesource=dt.strftime('%Y%m%d %H:%M:%S')    #通过strftime将日期格式转换为格式化的时间格式字符串

③获取编辑

soup.select('.show_author')[0].text.lstrip('责任编辑：')    #lstrip()去除字符

④获取文章

article=[]
for p in soup.select('#article p')[:-2]:#取得ID选择器下的标签元素，从元素集合中取得第一个到倒数第3个
    article.append(p.text.strip())    #将取得的段落元素内容通过strip()去除空格之后加入到列表中
''.join(article)    #将列表中的元素通过join连接，得到完成的文章

⑤获取评论数

评论数从js中的接口获得，且通过观察评论数的接口链接与新闻链接相关，整合成一个函数

import re
import json
def getComment(url):    #url为新闻详情链接
    #通过观察获取评论的接口链接如下，末尾花括号{}可带入参数
    commentforurl='http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}'
    newsid=re.search('doc-i(.+).shtml',url).group(1)    #用到正则表达式需导入re，通过观察用search()得到作为关联的newsid；
    #newsid=url.split('/')[-1].lstrip('doc-i').rstrip('.shtml')    #也可通过split()和strip
    commenturl=requests.get(commentforurl.format(newsid))    #通过format()得到完成的接口链接
    jd=json.loads(commenturl.text.lstrip('jsonp_1523978275766(').rstrip(')'))#导入json模块，通过loads()将处理后的字串转化为字典
    comment=jd['result']['count']['total']
    return comment

3.将新闻详情整合成一个函数，将新闻中的标题、时间、编辑、文章和评论数，保存至字典中

def getNewsDetail(url):
    res=requests.get(url)
    res.encoding='utf-8'
    result={}
    soup=BeautifulSoup(res.text,'html.parser')
    result['title']=soup.select('.main-title')[0].text
    time=soup.select('.date-source')[0].contents[1].text
    dt=datetime.strptime(time,'%Y年%m月%d日%H:%M:%S')
    result['time']=dt.strftime('%Y%m%d %H:%M:%S')
    result['editor']=soup.select('.show_author')[0].text.lstrip('责任编辑：')
    article=[]
    for p in soup.select('#article p')[:-2]:
        article.append(p.text.strip())
    result['article']= ''.join(article)
    result['comment']=getComment(url)
    return result

4.取得将新闻列表的每个新闻链接带入到新闻详情函数，将返回的新闻详情保存至列表

def getLinklist(url):#url为每页新闻列表链接
    res=requests.get(url)
    jd=json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
    linkdetaillist=[]
    for links in jd['result']['data']:#获取新闻列表中的新闻链接
        linkdetaillist.append(getNewsDetail(links['url']))#调用新闻详情链接，将结果保存至列表中
return linkdetaillist

5.将批次抓取的每页新闻列表整理至list中，使用Pandas整理保存为excel文件

import pandas
url='http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}'
news_total=[]
for i in range(1,3):
    newsurl=url.format(i)
    newsary=(newsurl)
    news_total.extend(newsary)#extend()函数在列表中添加newsary中的多个值
df=pandas.DataFrame(news_total)
df.to_excel('news.xlsx')    #保存至当前目录下，文件为news.xlsx

网易云课堂【Python网络爬虫实战】笔记

猜你喜欢