Python每日一题 010 - 代码天地

Python每日一题 010

其他 2020-07-26 16:05:04 阅读次数: 0

一个HTML文件，找出里面的正文和链接

代码

#coding: utf-8
from goose3 import Goose
from goose3.text import StopWordsChinese
import requests
from bs4 import BeautifulSoup
import re
 
# 要分析的网页url
url = 'https://www.freebuf.com/articles/network/244577.html'

# 提取正文
def extract(url):
    g = Goose({'stopwords_class': StopWordsChinese})
    article = g.extract(url=url)
    return article.cleaned_text
 
# 提取url
def get_url(url):
    html = requests.get(url)
    urls = re.findall('http[s]://(?:[-\w.]|(?:%[\da-fA-F]{2}))+',html.text)
    return urls
 
if __name__ == '__main__':  
    print(extract(url))
    print(get_url(url))

提取结果

猜你喜欢

转载自www.cnblogs.com/CH42e/p/13380588.html

Python每日一题 010

python-010-字典

Python_010

Python学习笔记010

企业面试真题--010

Python笔记-010-字典

010. Python 最大最小

python010 -- 异常处理

python闯关_Day010

010_Python3 字典

python3练习100题——010

010.python学习课程（函数一）

「求众数」| leetcode 刷题010

Python每日一题

python笔记：#010#运算符

Python010__异常完整代码

【010】Python全栈日记-类

010 Python语法之条件语句

010.day010

Python：每日一题003

Python：每日一题001

Python：每日一题002

Python：每日一题005

Python：每日一题004

Python：每日一题008

python每日一题：时间换算

每日一题Python（3.26）

Python每日一题 002

Python每日一题 001

python每日一题（11.26）

今日推荐

周排行

成为C++高手之宏与枚举

在CAD二次开发中使用进度条

Js插件ECharts，HighCharts学习网址整理

Celery提交任务出错(on windows.)

cephfs内核客户端性能追踪

thinkphp中PHPExcel用法

EntityFramework动态组合多排序字段

汇编语言（八）实验9 根据材料编程

安装ubuntu后必须做的事情（对我而言）

JS函数式编程

每日归档

更多

2024-10-22(0)

2024-10-21(0)

2024-10-20(0)

2024-10-19(0)

2024-10-18(0)

2024-10-17(0)

2024-10-16(0)

2024-10-15(0)

2024-10-14(0)

2024-10-13(0)