python虫虫BeautifulSoup与re的比较(通过实战)

分别用正则的方式和BeautifulSoup的方式爬取情歌排行榜上的信息

配置环境:

Anaconda 4.2.0 (64-bit)
Python 3.5.2

from bs4 import BeautifulSoup
import requests
import re
import time

1、BeautifulSoup的方式,注意find()和find_all()的使用。

n = 1
url = 'http://www.yy8844.cn/paihangbang/songlist_25.shtml'
html = requests.get(url)
html.encoding = 'gb2312'
bs = BeautifulSoup(html.text, 'html.parser')
sound_div = bs.find('dl', attrs={"id": "playList"})
sound_a = sound_div.find_all('span', attrs={"class": ["song_name", "singer"]})
for sound in sound_a:
    with open('D:\\***\\***.txt', 'a', encoding='utf-8') as f:
        f.write(str(n)+'、'+sound.string+'\n\n')
        print(sound.string)
    n += 1
    time.sleep(1)

2、正则的方式,注意正则的内容以及re.findall的使用。

n = 1
url = 'http://www.yy8844.cn/paihangbang/songlist_25.shtml'
response = requests.get(url)
response.encoding = 'gb2312'
html = response.text
url_lists =re.findall('<span class="song_name"><a href=".*?" title=".*?" target=".*?">(.*?)</a></span>|<span class="singer"><a href=".*?" class="sing" title=".*?"  target=".*?">(.*?)</a></span>', html)
for u_list in url_lists:
    u_list = list(u_list)
    with open('D:\\***\\***.txt', 'a', encoding='utf-8') as f:
        f.write(str(n)+'、'+str(u_list[0])+u_list[1]+'\n\n')
        print(u_list)
    n += 1
    time.sleep(1)
我曾经跨过山和大海,也穿过人山人海,我曾经拥有着的一切,转眼都飘散如烟,我曾经失落失望失掉所有方向,直到看见平凡才是唯一的答案。
——韩寒《平凡之路》


 

猜你喜欢

转载自blog.csdn.net/shangxiaqiusuo1/article/details/80992412