这是跟网上教程写的爬虫,但是不太一样。因为我没有学过非关系型数据库,也就没有像老师一样使用MongoDB。除了这一点,当初老师录制教程的时候,今日头条并没有反爬,而我跟着学习的时候已经反爬了。(可能是学习的人太多让今日头条不堪重负了吧哈哈哈哈)我在这里就修改了headers,不然返回的就只有html, head, body标签而没有其他内容。
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome"
}
response = requests.get(url, headers = headers)
现在今日头条的gallery中用到了JSON.parse方法,那么正则表达式也会不一样。而且其中的内容多了\,还要我们去掉它,烦人的很,因为\在字符串中还会转义。不过还是在网上找到了一种比较好的解决方法。
temp = result.group(1)
newStr = eval(repr(temp).replace('\\', ''))
eval() 函数用来执行一个字符串表达式,并返回表达式的值。
repr() 函数将对象转化为供解释器读取的形式。
说实话,repr.replace()与str.replace()原理上究竟有什么不同,我还不太清楚,不过效果上貌似repr还好一些。
在我的代码里,为了避免报错,加了不少的if,判断是否为空。而且也只爬取了前7页的图集。(已经是900多张图片了)学习那么久,又通过百度搜索,总算是把今日头条搞定了!下面是我的代码(2018.8.8经测试可行):
import requests
import json
import re
import os
from hashlib import md5
from bs4 import BeautifulSoup
from urllib.parse import urlencode #urlencode可把字典对象转化为url请求参数
from requests.exceptions import RequestException
from multiprocessing.dummy import Pool as ThreadPool
def get_page_index(offset, keyword):
data = {
'offset':offset,
'format':'json',
'keyword':keyword,
'autoload':'true',
'count':'20',
'cur_tab':1,
'from':'search_tab'
}
url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print('请求索引页出错')
return None
def parse_page_index(html):
try:
data = json.loads(html)
if data and 'data' in data.keys(): #keys返回json的所有键名,in优先级高于and
for item in data.get('data'):
yield item.get('article_url')
except:
pass
def get_page_detail(url):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome"
}
response = requests.get(url, headers = headers)
if response.status_code == 200:
return response.text
return None
except RequestException:
print('请求详情页出错', url)
return None
def parse_page_detail(html, url):
soup = BeautifulSoup(html, 'lxml')
title = None
if soup.select('title'):
title = soup.select('title')[0].get_text()
else:
return
print(title)
images_pattern = re.compile('gallery: JSON.parse\("(.*?)"\),', re.S)
result = re.search(images_pattern, html)
if result:
temp = result.group(1)
newStr = eval(repr(temp).replace('\\', ''))
data = json.loads(newStr)
if data and 'sub_images' in data.keys(): #if data: 可判断列表是否为空
sub_images = data.get('sub_images')
images = [item.get('url') for item in sub_images]
for image in images: download_image(image)
return {
'title': title,
'url': url,
'images': images
}
def download_image(url):
print('正在下载', url)
try:
response = requests.get(url)
if response.status_code == 200:
save_image(response.content)
return None
except RequestException:
print('请求图片出错')
return None
def save_image(content):
file_path = '{0}/{1}.{2}'.format('D:\\TmpPicts', md5(content).hexdigest(), 'jpg')
if not os.path.exists(file_path):
with open(file_path, 'wb') as f:
f.write(content)
def main(offset):
html = get_page_index(offset, '街拍')
for url in parse_page_index(html):
if url:
html = get_page_detail(url)
if html:
result = parse_page_detail(html, url)
if __name__ == '__main__':
group = [x * 20 for x in range(1, 7)]
pool = ThreadPool(4)
pool.map(main, group)
pool.close()
pool.join()