Python练习册(八)——找正文

problem0008HTML找正文

第 0008 题: 一个HTML文件,找出里面的正文

  • url下载网页内容
  • beautifulsoup做文本解析

demo:

#!/bin/python3

from bs4 import BeautifulSoup
import requests

def get_html():
    r = 'https://www.toutiao.com/a6485236648832401933/'
    headers = {'user-agent':                            
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64)  AppleWebKit/537.36       \
              (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
    url = requests.get(r,headers=headers)
    soup = BeautifulSoup(url.text,'lxml')      
    [script.extract() for script in soup.findAll('script')] #clear script
    [style.extract() for style in soup.findAll('style')]
    #return soup.prettify()
    return soup.get_text()

if __name__ == '__main__':
         print(get_html())

参考:网页内容爬取:如何提取正文内容 BEAUTIFULSOUP的输出

效果:

这里写图片描述

猜你喜欢

转载自blog.csdn.net/qq_30650153/article/details/80866960