Version-2.0
工作之余更新了一版,更新如下:
1.增加了对用户多页文章对获取
2.增加了点击顺序的随机性
3.增加了点击行为时间的随机性
4.增加了点击内容的随机性
5.增加了点击的轮次
缺陷与不足
1.UA还未设置
2.未有多个IP
import requests
from bs4 import BeautifulSoup
import time
import random
#对用户多页文章对获取
def get_writer_article_list(base_url,page_num):
all_article_list = []
for i in range(page_num):
index = i +1
print('cur index is ' + str(index))
cur_page_url = base_url + str(index)
all_article_list = get_article_list(cur_page_url) + all_article_list
return all_article_list
#获取单页所有文章url
def get_article_list(base_url):
web_data = requests.get(base_url)
soup = BeautifulSoup(web_data.text,'lxml')
divs = soup.find_all('div', class_='article-item-box csdn-tracking-statistics')
url_list = []
for div in divs:
label = div.find_all('a')
url = label[0].get('href')
url_list.append(url)
return url_list
#生成每一轮点击的随机list
def click_random(url_list,min_random_rate):
new_url_list = []
max_url_count = len(url_list)
min_url_count = int(max_url_count*min_random_rate)
term_url_count = random.randint(min_url_count,max_url_count)
for i in range(term_url_count):
ramdom_index = random.randint(0,max_url_count-1)
new_url_list.append(url_list[ramdom_index])
return new_url_list
#多轮点击
def click_article_url(term_num,click_random_start,click_random_end,term_random_start,term_random_end,all_list):
for i in range(term_num):
term_url_list = click_random(all_list,0.7)
for url in term_url_list:
requests.get(url)
print('click for '+url)
click_sleep_time = random.randint(click_random_start,click_random_end)
time.sleep(click_sleep_time)
print('sleep for '+str(click_sleep_time))
term_num = i +1
print('finish the term of '+str(term_num))
term_sleep_time = random.randint(term_random_start,term_random_end)
time.sleep(term_sleep_time)
print('sleep for the term '+str(term_sleep_time))
base_url1 = "https://blog.csdn.net/xxx1/article/list/"
base_url2 = "https://blog.csdn.net/xxx2/article/list/"
url_list_1 = get_writer_article_list(base_url1,2)
url_list_2 = get_writer_article_list(base_url2,2)
all_list = url_list_1 + url_list_2
click_article_url(200,8,50,30,60,all_list)
Version-1.0
利用午休的时间,用Python写了一个刷Blog中文章阅读量的小玩具,还不成熟,最近会迭代修改。
当前的整体思路:获取用户的文章页,解析html,获取该页面每一篇文章的url,存入list,然后训练访问list中的url,达到点击的目的。
import requests
from bs4 import BeautifulSoup
import time
//爬取我的粉丝 周英俊 同学的blog
base_url = "https://blog.csdn.net/qq_38835878/article/list/1"
web_data = requests.get(base_url)
soup = BeautifulSoup(web_data.text,'lxml')
divs = soup.find_all('div', class_='article-item-box csdn-tracking-statistics')
url_list = []
for div in divs:
label = div.find_all('a')
url = label[0].get('href')
url_list.append(url)
for url in url_list:
requests.get(url)
print('request for '+url)
time.sleep(61)
print('sleep for 61s')
缺陷与不足
- 没有加入翻页功能
- 没有设计代理Ip
- 没有设置UA
- 行为规律性较强:session时长无偏差
- 点击行为pattern明显:顺序访问所有文章。