BeautifulSoup库基本使用(演示豆瓣250)

安装和文档

pip install bs4

简单使用

#简单使用
from bs4 import BeautifulSoup 
# 创建 Beautiful Soup 对象 
# 使用lxml来进行解析 
soup = BeautifulSoup(html,"lxml") print(soup.prettify())

BeautifulSoup库的四种基本常用对象

遍历文档数
遍历文档树：

contents和children： contents:返回所有子节点的列表 children:返回所有子节点的迭代器
strings 和 stripped_strings strings:如果tag中包含多个字符串 ,可以使用 .strings 来循环获取 stripped_strings:输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

搜索文档数
3. find和find_all方法：搜索文档树，一般用得比较多的就是两个方法，一个是find，一个是find_all。find方法是找到第一个满足条件的标签后就立即返回，只返回一个元素。find_all方法是把所有满足条件的标签都选到，然后返回回去。

搜索文档树：
有时候使用css选择器的方式可以更加的方便。使用css选择器的语法，应该使用select 方法。以下列出几种常用的css选择器方法： 2. select方法：（1）通过标签名查找：print(soup.select(‘a’)) （2）通过类名查找：print(soup.select(’.sister’)) （3）通过id查找：print(soup.select("#link1")) （4）组合查找：print(soup.select(“p #link1”)) （5）通过属性查找：print(soup.select(‘a[href=“http://example.com/elsie”]’)) （6）获取内容：print (soup.select(‘title’)[0].get_text())

演示网站豆瓣250:

# @Time : 2019/11/18 10:00
# @Author : 大数据小J
# @File : Python_豆瓣250.py
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}


# 创建一个函数获取该网页对应的所有url信息
def get_every_urls(url):
    response = requests.get(url)
    html = response.text
    # 使用BeautifulSoup方法来实现获取网页url
    soup = BeautifulSoup(html, 'lxml')
    # 因为网页中只有这么一个属性，使用使用find
    urls = soup.find('ol', class_="grid_view").find_all('li')
    # print(urls)
    list = []
    for url in urls:
        href = url.find('a')['href']
        # print(href)
        list.append(href)
    return list


# 创建一个函数获取内容中的所有信息
def get_noter_list(href_url, f):
    try:
        response = requests.get(href_url)
        html = response.text
        soup = BeautifulSoup(html, 'lxml')
        # 电影名
        name = list(soup.find('div', id='content').find('h1').stripped_strings)
        name = ''.join(name)
        # print(name)
        # 导演
        director = list(soup.find('div', id='info').find('span').find('span', class_='attrs').stripped_strings)
        director = ''.join(director)
        # print(director)
        # 编剧
        screenwrite = list(
            soup.find('div', id='info').find_all('span')[3].find('span', class_='attrs').stripped_strings)
        screenwrite = ''.join(screenwrite)
        # print(screenwrite)
        # 演员
        actor = list(soup.find('span', class_='actor').find('span', class_='attrs').stripped_strings)
        # print(actor)
        actor = ''.join(actor)
        # 评分
        score = soup.find('div', id='interest_sectl').find('strong', class_='ll rating_num').string
        # print(score)
        f.write('{},{},{},{},{}\n'.format(name, director, screenwrite, actor, score))
        # f.close()
    except AttributeError as e:
        print(e)


# 创建一个主函数来运行结果
def main():
    for i in range(1, 11):
        page = (i - 1) * 25
        url = 'https://movie.douban.com/top250?start={}&filter='.format(page)
        with open('top250.csv', 'a', encoding='utf-8')as f:

            for href_url in get_every_urls(url):
                get_noter_list(href_url, f)


# 测试程序
if __name__ == '__main__':
    main()

。尘埃

发布了54 篇原创文章 · 获赞 26 · 访问量 6187

私信关注

BeautifulSoup库基本使用(演示豆瓣250)

猜你喜欢