python爬取热门小说

最近在学习python爬虫，写了一个脚本算是初级毕业。废话不多说，先上知识点，再上代码。

类的用法

定义类时后面要加Object参数
类里面函数第一个参数要写为self
定义类里面的全局变量是，可以在init函数里面初始化self.xxname
在调用同一个工程下不同文件中的类时，需要用from xxx import xxClass 来引用

url请求

urlopen(url).read()之后得到网页文本

soup和lxml

soup.select(”)里面可以是css选择器
soup.select()的结果可以转换为字符串之后用来==构造新的soup对象==
soup的那些函数只有soup对象才能使用，find结果和select结果是不能直接使用的
soup.tag 选择soup中的所有tag
soup.find_all()和soup.find()区别是前者返回所有匹配项，后者返回一个。

文件IO

换行直接writ(string+’\n’)
打开文件一定添加三个参数，文件名字，编码，打开模式
用完及时close

xpath

一般来说用直接拷贝的，然后在拷贝的基础上进行修改。

xpath可以获取节点，获取节点内容，获取属性

(1)在xpath以节点内容结尾则获取节点

(2)xpath以/text()结尾则获取标签内容列表

(3)xpath以/@attr结尾则获取属性列表
调用xpath的方法 etree.HTML(html),然后xpath(xpathString)
xpath不能以/结尾

python数据

list长度 len(listname)
创建一个字典 dict={}
为字典添加元素 dict.update(dic)
遍历字典 for k in dict:
str函数可以把soup结果转换为字符串
判断一个字符串是否为空 if p.string is not Node:

代码

Controller.py

from DataOutput import DataOutput
from HTMLParser import HTMLParser
from HTMLDownLoader import HTMLDownLoader

class Controller(object):
    def crawl(self,url):
        downloader=HTMLDownLoader()
        htmlParser=HTMLParser()
        dataOutput=DataOutput()
        htmlContext=downloader.get(url)
        dict=htmlParser.parse(htmlContext)
        dataOutput.save(dict)

if __name__=='__main__':
    controller=Controller()
    controller.crawl('http://www.seputu.com')

HTMLDownLoader.py

from urllib import request
class HTMLDownLoader(object):
    def get(self,url):
        html=request.urlopen(url).read()
        return html
    pass

HTMLParser.py

from lxml import etree
class HTMLParser(object):
    def parse(self,html):
        h=etree.HTML(html)
        mulus=h.xpath(".//div[@class='mulu']/div[@class='box']/ul/li[*]/a[@href]/text()")
        hrefs=h.xpath(".//div[@class='mulu']/div[@class='box']/ul/li[*]/a[@href]/@href")
        i=0
        dict={}
        while i<len(mulus):
            dit={mulus[i]:hrefs[i]}
            dict.update(dit)
            i=i+1
        for k in dict:
            print(k+" "+dict[k])
        return dict

DataOutput.py

from urllib import request
from bs4 import BeautifulSoup
class DataOutput(object):
    def __init__(self):
        self.outfile=open("盗墓笔记.txt",'a',encoding='utf-8')
    def save(self,dict):
        for k in dict:
            text=request.urlopen(dict[k]).read()
            soup=BeautifulSoup(text,'lxml')
            body=soup.select('.content-body')
            bodyhtml=body[0]
            bodysoup=BeautifulSoup(str(bodyhtml),'lxml',from_encoding='utf-8')
            ps=bodysoup.find_all('p')
            self.outfile.write(k+'\n')
            print("writing %s" % k)
            for p in ps:
                if p.string is not None:
                    self.outfile.write(p.string)
                    self.outfile.write('\n')

        self.outfile.close()

python爬取 热门小说

类的用法

url请求

soup和lxml

文件IO

xpath

python数据

代码

猜你喜欢

python爬取热门小说