爬取GitHub上所有链接的PDF（占坑） - 代码天地

爬取GitHub上所有链接的PDF（占坑）

其他 2019-03-14 09:21:02 阅读次数: 0

目标：

爬取这个网站的所有PDF。https://github.com/THUNLP-MT/MT-Reading-List#syntax_based_models

下载一个网站的所有PDF：

#file-name: pdf_download.py
__author__ = 'rxread'
import requests
from bs4 import BeautifulSoup


def download_file(url, index):
    local_filename = index+"-"+url.split('/')[-1]
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
    return local_filename

#http://ww0.java4.datastructures.net/handouts/
root_link="https://github.com/THUNLP-MT/MT-Reading-List#syntax_based_models"
r=requests.get(root_link)
if r.status_code==200:
    soup=BeautifulSoup(r.text)
    # print soup.prettify()
    index=1
    for link in soup.find_all('a'):
        new_link=root_link+link.get('href')
        if new_link.endswith(".pdf"):
            file_path=download_file(new_link,str(index))
            print("downloading:"+new_link+" -> "+file_path)
            index+=1
    print("all download finished")
else:
    print("errors occur.")

下载一个网站所有的链接：

from urllib.request import urlopen#用于获取网页
from bs4 import BeautifulSoup#用于解析网页

html = urlopen('https://github.com/THUNLP-MT/MT-Reading-List#syntax_based_models')
bsObj = BeautifulSoup(html, 'html.parser')
t1 = bsObj.find_all('a')
for t2 in t1:
    t3 = t2.get('href')
    print(t3)

参考博客：

https://blog.csdn.net/bull521/article/details/83448781

https://blog.csdn.net/qq_35193302/article/details/83510213
http://blog.zanlabs.com/2014/11/11/python-webpage-crawling/

https://blog.csdn.net/baidu_28479651/article/details/76158051

https://blog.csdn.net/u012705410/article/details/47708031

https://blog.csdn.net/xtingjie/article/details/73465522

猜你喜欢

转载自blog.csdn.net/qq_39362996/article/details/87861893

爬取GitHub上所有链接的PDF（占坑）

爬取某个链接里面的所有图片...

爬取某github用户所有repo

Python爬虫爬取网页上的所有图片

爬取所有博客

简单爬取github下载链接

爬取某网站所有的乌云漏洞公共文章，并保存为pdf文件

爬取某网站的所有乌云漏洞文章，保存为pdf文件

怎样用python3爬取电子书网站所有下载链接

python 爬取网站获得一个网站的所有链接

如何爬取CSDN博客中分栏的所有文章的标题和链接

爬取博主的所有文章的标题、链接和内容

Spider_权威指南_ch03_爬取所有网页链接

python爬取所有股票报道

python爬取HDU所有题目

爬取晨星所有基金评级

Python爬取网页所有小说

淘宝店铺所有ID爬取

爬取电商站点上所有的商品列表信息

python3爬取csdn上某账号所有文章

python- （scrapy上）爬取csdn所有博客内容

python3爬虫-爬取58同城上所有城市的租房信息

使用selenium + chrome爬取中国大学Mooc网的计算机学科的所有课程链接

PHP抓取网页上的所有链接

爬取github上热门项目并绘制图表

scrapy爬取新浪网导航页所有大类、小类、小类里的子链接，以及子链接页面的新闻内容

Python爬虫从入门到放弃（十八）之 Scrapy爬取所有知乎用户信息(上)

爬取github项目。

github文件的爬取

python爬取所有微信好友的信息

今日推荐

AI小程序有哪些？AI小程序哪个好用？微信小程序AI写作叫什么？免费的ai小程序推荐 ai写作小程序推荐

灵办AI工具(科研学术,代码编程,学习辅导,图书报告)功能介绍

Linux内核源码分析（非常详细）零基础入门到精通，收藏这一篇就够了

【C++篇】启航——初识C++（上篇）

数据飞轮崛起：数据中台真的过时了吗？

828华为云征文——使用Flexus云服务器X实例CentOS镜像下创建MySQL服务器教程

阿里巴巴出品的6款AI神器，你用过几个？

【机器学习】多模态AI——融合多种数据源的智能系统

HashiCorp 创始人向 Zig 软件基金会捐赠 30 万美元

1-8 月我国软件业务收入 85492 亿元，同比增长 11.2%

零基础入门鸿蒙开发 HarmonyOS NEXT星河版开发学习

豆包MarsCode帮我2小时完成Go语言系统从开发、测试到部署全流程最佳实践，云IDE迁移PHP企业级项目最佳实践

周排行

Ubuntu+apache2+php5+mysql+phpmyadmin的php环境搭建

基于YOLOv3+Kalman-Filter实现Multi-target tracking

解释C++实例化类的指针类型中的new

苹果手机页面不兼容问题——mui

Python基础语法

javascript学习笔记一【预解释】

python内置函数 map

【Git】使用webstorm操作git

this与super关键字（一）

python list 使用技巧

每日归档

更多

2024-10-04(63)

2024-10-03(2)

2024-10-02(60)

2024-10-01(0)

2024-09-30(0)

2024-09-29(0)

2024-09-28(4)

2024-09-27(60)

2024-09-26(0)

2024-09-25(0)