爬虫入门二:用html文件保存爬取到的数据

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/qq_18505209/article/details/99768069

爬虫入门二(续一)

文末附教程博客链接,感兴趣可以去看一下。

用html文件保存爬取到的数据

python代码:
import requests
from bs4 import BeautifulSoup
#1-1.获取网页信息保存到文件的过程
#url = "https://movie.douban.com/cinema/later/chengdu/"
#response = requests.get(url)
#file_obj = open('douban.html','w',encoding="utf-8")
#file_obj.write(response.content.decode('utf-8'))
#file_obj.close()
#1-2.从文件获取信息的过程
#file_obj = open('douban.html','r', encoding="utf-8")
#html = file_obj.read()
#file_obj.close()
#1-3.初始化BeautifulSoup,解析网页
#soup = BeautifulSoup(html, 'lxml')
#print(soup.find)
#2.直接抓取、解析
url = "https://movie.douban.com/cinema/later/chengdu/"
response = requests.get(url)
soup = BeautifulSoup(response.content.decode('utf-8'), 'lxml')
#3.获取并分析元素
all_movies = soup.find('div', id = "showing-soon")
#4.展示有用信息
for each_movie in all_movies.find_all('div', class_ = "item"):
    #print(each_movie)
    all_a_tag = each_movie.find_all('a')
    all_li_tag = each_movie.find_all('li')
    movie_name = all_a_tag[1].text
    movie_href = all_a_tag[1]['href']
    movie_date = all_li_tag[0].text
    movie_type = all_li_tag[1].text
    movie_area = all_li_tag[2].text
    movie_lovers = all_li_tag[3].text
    print('电影名:{},电影链接:{},放映日期:{},电影类型:{},上映地区:{},想看的人数:{}'.format(
        movie_name,movie_href,movie_date,movie_type,movie_area,movie_lovers))
#5.将获取的有用信息保存到一个html文件
#python里面三个"围起来的字符会被看做是一整个字符串,避免了换行符的麻烦。
#.format()这个方法的用法是把字符串里面的{}字符,按次序一一替换成 format() 接受的所有参数。
file_obj = open('data.html', 'w', encoding="utf-8")
file_obj.write("""
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>豆瓣电影即将上映影片信息</title>
    <link href="https://cdn.bootcss.com/bootstrap/4.0.0/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
<h2 class="text-center">豆瓣电影即将上映影片信息</h2>
<table class="table table-striped table-hover mx-auto text-center">
    <thead>
        <tr>
            <th>电影名</th>
            <th>放映日期</th>
            <th>电影类型</th>
            <th>上映地区</th>
            <th>关注者数量</th>
        </tr>
    </thead>
    <tbody>
""")
for each_movie in all_movies.find_all('div', class_ = "item"):
    #print(each_movie)
    all_a_tag = each_movie.find_all('a')
    all_li_tag = each_movie.find_all('li')
    movie_name = all_a_tag[1].text
    movie_href = all_a_tag[1]['href']
    movie_date = all_li_tag[0].text
    movie_type = all_li_tag[1].text
    movie_area = all_li_tag[2].text
    movie_lovers = all_li_tag[3].text
    #print('电影名:{},电影链接:{},放映日期:{},电影类型:{},上映地区:{},想看的人数:{}'.format(
    #    movie_name,movie_href,movie_date,movie_type,movie_area,movie_lovers))
    file_obj.write("""
        <tr>
            <td><a href="{}">{}</a></td>
            <td>{}</td>
            <td>{}</td>
            <td>{}</td>
            <td>{}</td>
        </tr>
    """.format(movie_href, movie_name, movie_date, movie_type, movie_area, movie_lovers))
file_obj.write("""
     </tbody>
</table>
</body>
</html>
    """)
file_obj.close()
print("finshed")
效果展示:

图

附上学习链接:
爬虫入门教程⑨— 用html和csv文件保存爬取到的数据.

猜你喜欢

转载自blog.csdn.net/qq_18505209/article/details/99768069