最近在学习群组的发现,如何通过python 代码实现,说下我们的目标,捕获公司内网的5000多篇技术精英大赛的文章,然后对文章进行分词,最后计算出相关性,通过图的形式输出结果,在输出结果前,我们是无法知道文章的相似性,围绕这一目标进行学习。
本次主要是数据的抓取。
目标网站数据源如图
- 大致有5100篇文章 ,但是一页只有30多篇的样子,然后继续下拉,有个浏览更多
然后开启firefox,启动web开发者-控制台,直接看协议包,思路是直接利用firefox的cookie,然后进行模拟登陆获取。
查看第一个点击“全部帖子”走的是get协议,查看cookie有4个,通过测试实际有用的就最后一个 JSESSIONID,其实懒的话,4个一起丢cookie,好了默认第一页比较简单,然后点击“浏览更多”
发现多了很多参数,然后就思考这些参数的大致意思,应该是记录当前最大url地址和最小url地址,以及当前的页码,圈子id,总的页码,最开始的时候尝试手工记录页码,计算最大url和最小url,但后面想了下,post的数据应该是来源于表单,所以基于这点回头找了下页码源码,然后在响应里面找到了这部分值。
后了基本所有值都找到了,然后想到一个问题,怎么判断最后一页,不是最后一页的时候有“浏览更多”,如果是最后一页就没有这个,然后找了一个是最后一页的代码发现最后一页的时候PAINATION_TARGET_PAGE_NO=-1
好了,基本清楚逻辑,从翻页到最后一页的判断都搞定。贴上代码
#python 3
# encoding: utf-8
import sys
import io
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8') #改变标准输出的默认编码
def getrequest(url):
s=requests.session()
headers={
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'eip.teamshub.com',
'Referer': 'http://eip.teamshub.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0'
}
cookies={
'JSESSIONID': 'BE30CEC4A6A627660E088017B92AE337'}
rs=s.get(url,headers=headers,cookies=cookies,verify=False)
return rs.text
def getpost(url,postdata):
s=requests.session()
headers={
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'eip.teamshub.com',
'Referer': 'http://eip.teamshub.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0'
}
cookies={
'JSESSIONID': 'BE30CEC4A6A627660E088017B92AE337'}
rs=s.post(url,data=postdata,headers=headers,cookies=cookies,verify=False)
return rs.text
def htmlparse(rs):
soup = BeautifulSoup(rs, # HTML文档字符串
'html.parser' # HTML解析器
)
# 查找已经获取a标签节点中所有连接
links = soup.find_all('a', style="color:#454545;")
titles = soup.find_all(attrs={'style': 'width:100%;line-height:20px'});
return links,titles
def getinfo(rstext ):
PAGINATION_CURRENT_PAGE_MAX_ID =0
PAGINATION_CURRENT_PAGE_MIN_ID =0
PAGINATION_CURRENT_PAGE_NO=0
PAGINATION_TARGET_PAGE_NO=0
PAGECOUNT=0
soup = BeautifulSoup(rstext, # HTML文档字符串
'html.parser' # HTML解析器
)
PAGINATION_CURRENT_PAGE_MAX_ID=soup.find(id='PAGINATION_CURRENT_PAGE_MAX_ID')["value"];
PAGINATION_CURRENT_PAGE_MIN_ID =soup.find(id='PAGINATION_CURRENT_PAGE_MIN_ID')["value"];
PAGINATION_CURRENT_PAGE_NO =soup.find(id='PAGINATION_CURRENT_PAGE_NO')["value"];
PAGINATION_TARGET_PAGE_NO =soup.find(id='PAGINATION_TARGET_PAGE_NO')["value"];
PAGECOUNT = soup.find(id='pagecount')["value"];
return PAGINATION_CURRENT_PAGE_MIN_ID,PAGINATION_CURRENT_PAGE_MAX_ID,PAGINATION_CURRENT_PAGE_NO,PAGINATION_TARGET_PAGE_NO,PAGECOUNT
##############################################################main####################################################
rstext=getrequest("http://eip.teamshub.com/sctbt/70171506/0/0?t=1531901903918");
links,titles=htmlparse(rstext)
out=open('urllist.txt','w',encoding='utf-8')
index=0
for index in range(len(links)):
out.write('\t%s' % titles[index].get_text())
out.write('\t%s' % links[index]['href'])
out.write('\t%s' % links[index].get_text())
out.write('\n')
index +=1
#浏览更多
PAGINATION_CURRENT_PAGE_MIN_ID,PAGINATION_CURRENT_PAGE_MAX_ID,PAGINATION_CURRENT_PAGE_NO,PAGINATION_TARGET_PAGE_NO,PAGECOUNT=getinfo(rstext)
print(PAGINATION_CURRENT_PAGE_MIN_ID)
print(PAGINATION_CURRENT_PAGE_MAX_ID)
print(PAGINATION_CURRENT_PAGE_NO)
print(PAGINATION_TARGET_PAGE_NO)
print(PAGECOUNT)
while int(PAGINATION_TARGET_PAGE_NO) >0:
print("-----------------------------------------------------------------------------------------------------------")
url="http://eip.teamshub.com/qtlt?PAGINATION_CURRENT_PAGE_NO="+str(PAGINATION_CURRENT_PAGE_NO)+"&PAGINATION_TARGET_PAGE_NO="+str(PAGINATION_TARGET_PAGE_NO)+"&PAGINATION_CURRENT_PAGE_MAX_ID="+PAGINATION_CURRENT_PAGE_MAX_ID+ "&PAGINATION_CURRENT_PAGE_MIN_ID="+PAGINATION_CURRENT_PAGE_MIN_ID
data={'groupId':'70171506','pageNo':str(PAGINATION_TARGET_PAGE_NO),'typeId':'0'}
print(data)
rstext = getpost(url,data)
#print(rstext)
PAGINATION_CURRENT_PAGE_MIN_ID, PAGINATION_CURRENT_PAGE_MAX_ID, PAGINATION_CURRENT_PAGE_NO, PAGINATION_TARGET_PAGE_NO, PAGECOUNT = getinfo(rstext)
print(url)
links, titles = htmlparse(rstext)
out = open('urllist.txt', 'a',encoding='utf-8')
index = 0
for index in range(len(links)):
out.write('\t%s' % titles[index].get_text())
out.write('\t%s' % links[index]['href'])
out.write('\t%s' % links[index].get_text())
out.write('\n')
index += 1
然后在这里面用到了BeautifulSoup 这个包用于解析内容,生成结果如下:
全部捞出来了,大致5070篇文章,接下来就是逐个地址获取文章。