010、python爬取经济学人最新列表文章,归档为本地文件
首先回顾一下获取首页最新文章列表[[a,title],…]:
def getPaperList():
url = 'https://economist.com'
req = urllib.request.Request(url=url,headers=headers, method='GET')
response = urllib.request.urlopen(req)
html = response.read()
selector = etree.HTML(html.decode('utf-8'))
goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]/div[1]/div[1]/div[3]/ul[1]/li'
art=selector.xpath(goodpath)
awithtext = []
try:
for li in art:
ap = li.xpath('article[1]/a[1]/div[1]/h3[1]/text()')
a = li.xpath('article[1]/a[1]/@href')
awithtext.append([a[0],ap[0]])
except Exception as err:
print(err,'getMain')
finally:
return awithtext
1、接着分析要爬取的文章的html结构
上图中标注分别为:
1.flytitle-and-title__flytitle
2.real title
3.description
4.所有同一DOM级别的P元素包含的就是文章的主体段落
2、爬取文章内容:
def getPaper(url):
req = urllib.request.Request(url=url,headers=headers, method='GET')
response = urllib.request.urlopen(req)
html = response.read()
selector = etree.HTML(html.decode('utf-8'))
goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]//div[2]/div[1]/article'
article=selector.xpath(goodpath)
return article
3、 获取标记1,2,3相关信息得到[1,2,3]:
def getHeadline(article):
headline = []
try:
h1 = article[0].xpath('h1/span')
for item in h1:
headline.append(item.text)
p1 = article[0].xpath('p[1]/text()')
headline.append(p1[0])
except Exception as err:
print(err,'getHeadline')
finally:
return headline
4、获取文章内容p=[p,p,p….]:
def getContent(article):
parr = []
try:
p = article[0].xpath('div[1]/div[3]/p/text()')
for i in p:
print(i)
parr.append(i+'\n')
except Exception as err:
print(err,'getContent')
finally:
return parr
5、爬虫请开始表演
if __name__ == '__main__':
linkArr = getMain()
time.sleep(10)
tmpLast = []
toDayDir = './mds/' + todayDate +'/papers/'
if not os.path.exists(toDayDir):
os.makedirs(toDayDir)
for item in linkArr:
if item[0] not in lastLst:
tmpLast.append(item[0])
url = 'https://economist.com' + item[0]
article = getPaper(url)
headLine = getHeadline(article)
try:
paperRecords[strY][strM][strD].append([item[0],headLine[1]])
content = getContent(article)
paperName = '_'.join(item[1].split(' '))
saveMd = toDayDir + paperName+'.md'
result = headLine[1:]
result.extend(content)
output = '\n'.join(result)
with open(saveMd,'w') as fw:
fw.write(output)
time.sleep(10)
except Exception as err:
print(err)
paperRecords['lastLst'] = tmpLast
with open('spiRecords.json','w') as fwp:
json.dump(paperRecords,fwp)
6、对5中的部分数据结构进行讲解:
首先是归档目录结构:
mds/2018_04_29/papers#日期papers目录都是生成结果时创建的
然后是爬取记录保存的json文件结构,直接给例子好了:
{"a2018":
{"a4":
{"a29":
[
["/blogs/graphicdetail/2018/04/daily-chart-18", "Success is on the cards for Nintendo"]
]
}
},
"lastLst":
["/blogs/graphicdetail/2018/04/daily-chart-18","/blogs/buttonwood/2018/04/affording-retirement"]
}
保存lastLst是为了不重复爬取,当然也可以遍历所有数据剔除重复,但是代价有点大,而且代码要写好长一串太麻烦还没啥明显优点。
对于文章的爬取到这里就算结束了,下一篇将讲述文章中的单词如何去重得重
最后进入今天的阅时即查文章环节