利用python爬虫结合前端技能实现经济学人(The Economist)阅时即查APP(010)

010、python爬取经济学人最新列表文章,归档为本地文件

首先回顾一下获取首页最新文章列表[[a,title],…]:

def getPaperList():
    url = 'https://economist.com'
    req = urllib.request.Request(url=url,headers=headers, method='GET')
    response = urllib.request.urlopen(req)
    html = response.read()
    selector = etree.HTML(html.decode('utf-8'))
    goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]/div[1]/div[1]/div[3]/ul[1]/li'
    art=selector.xpath(goodpath)
    awithtext = []
    try:
        for li in art:
            ap = li.xpath('article[1]/a[1]/div[1]/h3[1]/text()')
            a = li.xpath('article[1]/a[1]/@href')
            awithtext.append([a[0],ap[0]])
    except Exception as err:
        print(err,'getMain')
    finally:
        return awithtext

1、接着分析要爬取的文章的html结构

这里写图片描述

上图中标注分别为:
1.flytitle-and-title__flytitle
2.real title
3.description
4.所有同一DOM级别的P元素包含的就是文章的主体段落

2、爬取文章内容:

def getPaper(url):
    req = urllib.request.Request(url=url,headers=headers, method='GET')
    response = urllib.request.urlopen(req)
    html = response.read()
    selector = etree.HTML(html.decode('utf-8'))
    goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]//div[2]/div[1]/article'
    article=selector.xpath(goodpath)
    return article

3、 获取标记1,2,3相关信息得到[1,2,3]:

def getHeadline(article):
    headline = []
    try:
        h1 = article[0].xpath('h1/span')
        for item in h1:
            headline.append(item.text)
        p1 = article[0].xpath('p[1]/text()')
        headline.append(p1[0])
    except Exception as err:
        print(err,'getHeadline')
    finally:
        return headline

4、获取文章内容p=[p,p,p….]:

def getContent(article):
    parr = []
    try:
        p = article[0].xpath('div[1]/div[3]/p/text()')
        for i in p:
            print(i)
            parr.append(i+'\n')
    except Exception as err:
        print(err,'getContent')
    finally:
        return parr

5、爬虫请开始表演

if __name__ == '__main__':
    linkArr = getMain()
    time.sleep(10)
    tmpLast = []
    toDayDir = './mds/' + todayDate +'/papers/'

    if not os.path.exists(toDayDir):
        os.makedirs(toDayDir)
    for item in linkArr:
        if item[0] not in lastLst:
            tmpLast.append(item[0])
            url = 'https://economist.com' + item[0]
            article = getPaper(url)
            headLine = getHeadline(article)
            try:
                paperRecords[strY][strM][strD].append([item[0],headLine[1]])
                content = getContent(article)
                paperName = '_'.join(item[1].split(' '))
                saveMd = toDayDir + paperName+'.md'
                result = headLine[1:]
                result.extend(content)
                output = '\n'.join(result)
                with open(saveMd,'w') as fw:
                    fw.write(output)
                time.sleep(10)
            except Exception as err:
                print(err)

    paperRecords['lastLst'] = tmpLast
    with open('spiRecords.json','w') as fwp:
        json.dump(paperRecords,fwp)

6、对5中的部分数据结构进行讲解:

首先是归档目录结构:

mds/2018_04_29/papers#日期papers目录都是生成结果时创建的

然后是爬取记录保存的json文件结构,直接给例子好了:

{"a2018":
 {"a4": 
     {"a29":
      [
      ["/blogs/graphicdetail/2018/04/daily-chart-18", "Success is on the cards for Nintendo"]
      ]
      }
      }, 
      "lastLst": 
      ["/blogs/graphicdetail/2018/04/daily-chart-18","/blogs/buttonwood/2018/04/affording-retirement"]
}

保存lastLst是为了不重复爬取,当然也可以遍历所有数据剔除重复,但是代价有点大,而且代码要写好长一串太麻烦还没啥明显优点。

对于文章的爬取到这里就算结束了,下一篇将讲述文章中的单词如何去重得重

最后进入今天的阅时即查文章环节

这里写图片描述
这里写图片描述

猜你喜欢

转载自blog.csdn.net/lockey23/article/details/80143348
今日推荐