problem0008HTML找正文
第 0008 题: 一个HTML文件,找出里面的正文。
- url下载网页内容
- beautifulsoup做文本解析
demo:
#!/bin/python3
from bs4 import BeautifulSoup
import requests
def get_html():
r = 'https://www.toutiao.com/a6485236648832401933/'
headers = {'user-agent':
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
url = requests.get(r,headers=headers)
soup = BeautifulSoup(url.text,'lxml')
[script.extract() for script in soup.findAll('script')] #clear script
[style.extract() for style in soup.findAll('style')]
#return soup.prettify()
return soup.get_text()
if __name__ == '__main__':
print(get_html())
参考:网页内容爬取:如何提取正文内容 BEAUTIFULSOUP的输出