ECUST 教务处验证码 hack

前言

原来不要验证码，爬虫写得很爽。结果今年搞了一个验证码，一下把之前的爬虫根源上斩杀了。
之前试过几次，想实在不行就用机器学习来破解吧(虽然我并不会)
然而，突然有一天开窍了。
不用机器学习！不到20行代码！成功率50%以上！
(可以通过保存cookies的方式，在cookies未过期的时间内可以直接带cookies访问需要资源)
代码如下：

from pytesseract import *
from PIL import Image
import os
import requests
def solve(filename):
    gif_url = 'http://inquiry.ecust.edu.cn/ecustedu/Base/VerifyCode.aspx'
    gif = requests.get(gif_url)
    with open('./{}.gif'.format(filename), 'wb') as file:
        file.write(gif.content)
    s = filename
    os.system("magick.exe convert ./{}.gif ./{}.jpg".format(s, s))
    os.system("magick.exe convert -colorspace Gray ./{}.jpg ./{}.jpg".format(s, s))
    os.system("magick.exe convert -resize 500% ./{}.jpg ./{}-gray.jpg".format(s, s))
    img = Image.open("./{}-gray.jpg".format(s))
    ans = pytesseract.image_to_string(img, config='-c tessedit_char_whitelist=BDFHJLNPRTVXZ02468 -psm 9').strip()
    return ans
if __name__ == '__main__':
    print(solve('test'))

1行1行看

import

tesseract

pytesseract 这个库，对tesseract进行了封装。
所以关键是tesseract
那tesseract是什么呢？
是一款Google推出的OCR Engine.
有人或许会说，OCR还能识别验证码？
那当然了，因为学校的验证码长这样的
这里写图片描述
只能用方正来形容了。
所以，要运行这个代码第一步，就是装一个tesseract，并把它加到环境变量PATH里去。
至于怎么装，看github上的wiki吧

PIL

PIL这个库，或者说pillow这个库，是著名的python图像处理库。
当然，这里并没有用到PIL的高级功能，唯一的用处，就是生成一个img对象，作为参数，传给pytesseract

OS

os 这个就是用来执行命令的

requests

著名的http库，这里用处就是把验证码图片.gif文件给下载下来。
Linux下还有wget可以用，但是Windows并没有，所以这是一个好方法。并且requests还是保存session和cookies，这个非常重要，但是这个Demo里面没有。

保存图片

这样就把导入的库说完了。然后直接看代码吧。

def solve(filename):
    gif_url = 'http://inquiry.ecust.edu.cn/ecustedu/Base/VerifyCode.aspx'
    gif = requests.get(gif_url)
    with open('./{}.gif'.format(filename), 'wb') as file:
        file.write(gif.content)

这几行，就是把验证码图片下载下来，存到filename.gif里面

核心步骤 1

os.system("magick.exe convert ./{}.gif ./{}.jpg".format(s, s))
    os.system("magick.exe convert -colorspace Gray ./{}.jpg ./{}.jpg".format(s, s))
    os.system("magick.exe convert -resize 500% ./{}.jpg ./{}-gray.jpg".format(s, s))
    img = Image.open("./{}-gray.jpg".format(s))

magick.exe 是 imagemagick的windows可执行文件。（当然也要加到PATH里去了)。
imagemagick 是一个超级强大的命令行处理图片工具，这里用了3个功能
1. 把.gif 转成 .jpg, 为什么？
因为tesseract不支持gif
2. 把.jpg 转成灰度为什么？
因为灰度好识别，文字和颜色当然没关系了
- 为什么不转成黑白？
- 效果没有灰度的好
3. 放大5倍为什么？
原来图片太小，tesseract识别不了
总之经过一通操作之后，图片变成了
这里写图片描述

核心步骤2

    img = Image.open("./{}-gray.jpg".format(s))
    ans = pytesseract.image_to_string(img, config='-c tessedit_char_whitelist=BDFHJLNPRTVXZ02468 -psm 9').strip()

就是调用tesseract去识别这个图片
关键是啥？
1. config
第一个 whitelist 白名单
经过我统计加大胆猜测加小心求证，只会出现偶数字符(即BDFHJ…)，一下少了一半，那在只在这里面识别，识别率岂不是大大增加？
2. psm pagesegmode

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
来源： https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

经过我实验，参数为9时，识别率最高。

综上

就这么Hack了学校教务处的验证码。