一、什么是Python爬虫框架
简单来说,Python的爬虫框架就是一些爬虫项目的半成品。比如我们可以将一些常见爬虫功能的实现代码写好,然后留下一些接口,在做不同的爬虫项目时,我们只需要根据实际情况,手写少量需要变动的代码部分,并按照需要调用这些接口,即可以实现一个爬虫项目。
二、常见的Python爬虫框架
1、Scrapy框架
Scrapy框架是一套比较成熟的Python爬虫框架,是使用Python开发的快速、高层次的信息爬取框架,可以高效的爬取web页面并提取出结构化数据。
Scrapy应用范围很广,爬虫开发、数据挖掘、数据监测、自动化测试等。
Scrapy官网
官方文档请参考Scrapy入门教程 - Scrapy 0.24.6 文档
简单用法:
scrapy startproject tutorial #创建项目
该命令将会创建包含下列内容的 tutorial 目录:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
定义Item:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
编写爬虫:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
运行爬虫:
scrapy crawl dmoz
2、Crawley框架
Crawley也是Python开发出的爬虫框架,该框架致力于改变人们从互联网中提取数据的方式。官网是 Crawley Project • Crawley Project
简单用法,参考文档 Crawley’s Documentation
To start a new project run
~$ crawley startproject [project_name]
~$ cd [project_name]
Write your Models
""" models.py """
from crawley.persistance import Entity, UrlEntity, Field, Unicode
class Package(Entity):
#add your table fields here
updated = Field(Unicode(255))
package = Field(Unicode(255))
description = Field(Unicode(255))
Write your Scrapers
""" crawlers.py """
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *
class pypiScraper(BaseScraper):
#specify the urls that can be scraped by this class
matching_urls = ["%"]
def scrape(self, response):
#getting the current document's url.
current_url = response.url
#getting the html table.
table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
#for rows 1 to n-1
for tr in table[1:-1]:
#obtaining the searched html inside the rows
td_updated = tr[0]
td_package = tr[1]
package_link = td_package[0]
td_description = tr[2]
#storing data in Packages table
Package(updated=td_updated.text, package=package_link.text, description=td_description.text)
class pypiCrawler(BaseCrawler):
#add your starting urls here
start_urls = ["http://pypi.python.org/pypi"]
#add your scraper classes here
scrapers = [pypiScraper]
#specify you maximum crawling depth level
max_depth = 0
#select your favourite HTML parsing tool
extractor = XPathExtractor
Configure your settings
""" settings.py """
import os
PATH = os.path.dirname(os.path.abspath(__file__))
#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)
DATABASE_ENGINE = 'sqlite'
DATABASE_NAME = 'pypi'
DATABASE_USER = ''
DATABASE_PASSWORD = ''
DATABASE_HOST = ''
DATABASE_PORT = ''
SHOW_DEBUG_INFO = True
Finally, just run the crawler
~$ crawley run
3、Portia框架
Portia框架是一款允许没有任何编程基础的用户可视化地爬取网页的爬虫框架,GitHub: scrapinghub/portia
可以直接使用网页版的Portia框架,地址 Login • Scrapinghub
相关信息填写好后,单击“Create Project”,就可以爬取网站了
通过可视化界面,很方便配置爬虫。
4、newspaper框架
newspaper框架是一个用来提取新闻、文章以及内容分析的Python爬虫框架,GitHub: codelucas/newspaper
简单用法:
from newspaper import Article
url = ‘http://news.163.com/17/0525/08/CL95029O0001875P.html’
a = Article(url, language=‘zh’)
a.download()
a.parse()
Building prefix dict from C:\Python35\lib\site-packages\jieba\dict.txt …
Dumping model to file cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.6619999408721924 seconds.
Prefix dict has been built succesfully.print(a.title)
print(a.text)
南都讯 记者彭彬 5月23日下午四点半左右,直播平台主播吴权彼在直播上山采蘑菇时,意外发现一
具腐烂多日、发黑发紫的尸体,吴权彼当即中断直播报警,1000多在线网友观看了这惊魂一幕。随后
,警方到达现场,目前案件仍在进一步调查之中。
‘’’
5、Python-goose框架
Python-goose框架可提取的信息包括:
• 文章主体内容
• 文章主要图片
• 文章中嵌入的任何Youtube/Vimeo视频
• 元描述
• 元标签
GitHub: grangier/python-goose
如果想学习的小伙伴,可学习以下课程知识: