2019-12-10 爬网页9-绕过Selenium检测

爬Boss直聘网站https://www.zhipin.com/job_detail/?query=js&city=101020100&industry=&position=，发现无法获得信息，获得html的title是“请稍后”，页面显示的是“正在加载中…”。

检查后发现原来是要验证cookie的，关键就是__zp_stoken__字段。
具体过程可以在浏览器开发者工具里跟踪到。

下面的图是后补的，chrome中截取的。
在这里插入图片描述
这里的过程描述和上面的图是一致的，只是参数的值不一样（因为图是后补的），意思懂了就行。

地址栏中输入https://www.zhipin.com/job_detail/?query=js&city=101020100&industry=&position=后，会发现得到响应https://www.zhipin.com/web/common/security-check.html?seed=/VA4nRzK5Ai5Witwwefvzvj4WXlCwIK7isBb/pi+oPQ=&name=af570efa&ts=1575981514326&callbackUrl=/c101210100/?query=js&city=101020100&industry=&position=，也就是重定向到security-check.html，这里还有几个参数seed，name，ts。

现在来观察security-check.html代码，其实它就是我在爬虫代码中得到返回的html页面。

<title>请稍后</title>
。。。。。。
<p class="gray">正在加载中...</p>

再往下看

 var url = window.location.href;
                    var seed = decodeURIComponent(getQueryString("seed")) || "";
                    var ts = getQueryString("ts");
                    var fileName = getQueryString("name");
                    var callbackUrl = decodeURIComponent(getQueryString("callbackUrl"));
                    var srcReferer = decodeURIComponent(getQueryString("srcReferer")||'');
                    
                    if (seed && ts && fileName) {
                        seriesLoadScripts("security-js/" + fileName + ".js", function() {
                            var expiredate = new Date().getTime() + 32 * 60 * 60 * 1000 * 2;
                            var code = "";
                            var nativeParams = {};
                            var ABC = window.ABC || frame.contentWindow.ABC;
                            try {
                                code = new ABC().z(seed, parseInt(ts)+(480+new Date().getTimezoneOffset())*60*1000);
                            } catch (e) {}
                            if (code && callbackUrl) {
                                Cookie.set("__zp_stoken__", code, expiredate, COOKIE_DOMAIN, "/");
                                // 据说iOS 客户端存在有时写cookie失败的情况，因此调用客户端提供的方法，交由客户端额外写一次cookie
                                if (typeof window.wst != "undefined" && typeof wst.postMessage == "function") {
                                    nativeParams = {
                                        name: "setWKCookie",
                                        params: {
                                            url: COOKIE_DOMAIN,
                                            name: "__zp_stoken__",
                                            value: encodeURIComponent(code),
                                            expiredate: expiredate,
                                            path: "/"
                                        }
                                    };
                                    window.wst.postMessage(JSON.stringify(nativeParams));
                                }

我们刚才看到的参数seed ，name和ts都出现了。
这里还出现了一个js文件，“security-js/” + fileName + “.js”，它的名字是动态的，上面响应的链接中已经有了name=af570efa。
最后也是最重要的，代码中设置了一个cookie，zp_stoken，期限是64小时，值来自上面的那个动态名称的js文件。

继续观察af570efa.js文件，这个应该是加密的，我没有办法了解里面写了什么，也无法知道__zp_stoken__的生成逻辑。

由于在浏览器里直接打开这个链接是可以正常显示查询结果。
所以现在思路就是通过selenium模拟登录，获得cookie，然后再爬。

但是程序里用selenium模拟浏览器后依然失败，无法打开网页。
查了网上，原来可能是被反扒机制检测出来的，我没有找到相关js，猜测可能是被加密了（就是在那个af570efa.js文件中）。

火狐和Chrome被反爬检测到的主要参数是window.navigator.webdriver。
如果是直接打开浏览器，这个参数是不存在的。但是如果通过selenium打开，那这个参数是true。为什么是true，因为

The navigator.webdriver property is true when in:
Chrome
The --enable-automation or the --headless flag is used.
Firefox
The marionette.enabled preference or --marionette flag is passed.

摘自https://developer.mozilla.org/zh-CN/docs/Web/API/Navigator/webdriver

继续查攻略，只有模拟chrome可以避免被检测的方案，我用的是模拟火狐，没有找到对应方案。

不得已，下载chrome。
然后再下载对应版本的chromedriver，http://npm.taobao.org/mirrors/chromedriver/。注意，放置chromedriver的目录必须在环境变量path中。相关配置参见https://blog.csdn.net/weixin_42555985/article/details/103047764。

后面就简单了，加入以下代码。
我偷懒，直接把chromedriver.exe放在火狐目录中了，因为火狐在path中。

chromedriver_path = 'C:\Program Files\Mozilla Firefox\chromedriver.exe' 
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
driver.maximize_window() 
url="https://www.zhipin.com/job_detail/?query=js&city=101020100&industry=&position="

driver.get(url)
time.sleep(5)

# 获得 cookie信息
cookie_list = driver.get_cookies()
print cookie_list[len(cookie_list) -1] #__zp_stoken__
driver.quit()

运行后终于可以打开链接，得到cookie了，特别是那个关键的__zp_stoken__。

【补充1】
突然有一天发现Chrome也被反爬了，控制台中检查window.navigator.webdriver，结果是true，难怪了。
查了一下我的Chrome版本，已经是最新的79了。
从网上下载了一个低版本的，78.0.3904.70，重新安装一下，又恢复正常了。

【补充2–有待进一步研究】
根据https://developer.mozilla.org/zh-CN/docs/Web/API/Navigator/webdriver
火狐中要想让window.navigator.webdriver失效，就是把marionette.enabled设置为false。
找到了部分实现代码，

caps=webdriver.DesiredCapabilities().FIREFOX
caps["marionette"]=False
dirver = webdriver.Firefox( capabilities=caps)
driver.get('https://www.baidu.com')

以上代码确实可以让window.navigator.webdriver失效。但是会报以下错误，然后selenium无法控制火狐浏览器。

in _wait_until_connectable    "The browser appears to have exited "
selenium.common.exceptions.WebDriverException: Message: The browser appears to have exited before we could connect. If you specified a log_file in the FirefoxBinary constructor, check it for details.

【补充3】
检测webdrive参数是否有效，或者说是否是由程序通过selenium控制浏览器，可以通过以下网址来测试
https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html

没人不认识我

发布了122 篇原创文章 · 获赞 7 · 访问量 2万+

私信关注

2019-12-10 爬网页9-绕过Selenium检测

猜你喜欢