数字验证码破解

爬虫中，有时候需要破解验证码，所以这里就来讲一下，如何破解验证码。

最简单的数字、字母验证码破解

像这个样子的验证码。我们可以直接利用ocr来破解

ocr介绍

百度百科

python中使用ocr

ocr简单运用

我们需要下载ocr相关的软件，这样才能在python中进行ocr识别。

windows下的安装,
- 下载安装包：下载地址
- 将ocr软件放到系统路径中可以看这里
在python中安装相关依赖
- python中安装相关依赖: pip install tesserocr pillow
应用
我们就将上面哪个验证码变成数字

import tesserocr
from PIL import image
image = Image.open("captha1.jpg")  # 这里就相当于将图片数字化。
ans=tesserocr.image_to_text(image)  # 这里就是将数字化的图片进行ocr处理。
'''
:除了image_to_text，这里还有tesserocr的 image_to_text还有file_to_text。用法也很简单，只要将文件所 
:在的路径加上去就可以了
'''
print(ans)

上面代码运行之后的结果就是2127了。

ocr增加难度

这样子可以将简单的验证码进行ocr化，但是复杂一点的，像这个。这个就要涉及到灰度处理、二值化、降噪。这里先来说明一下为什么能进行这三项处理
因为图像在电脑里面就是像素点，每个像素点就是一个3个元素的元组，这三个元素也就分别代表着RGB。
那么我们如何知道这三个元素的值呢？下面的代码能将一张图片的rgb值以数字为基准遍历打印出来

from PIL import Image
img = Image.open("captcha1.jpg")
 width,height=img.size
 for i in range(width):
     for j in range(height):
		     piexl=image.load()[i,j]
				 print(piexl)

这样子就是遍历打印一张图片的像素点。
灰度处理、二值化、降噪。就是设置一个阀值，将不符合这个阀值的所有像素点变成白色，将符合这个阀值的所有像素点变成黑色。这样子的化就达到了灰度处理、二级化和减噪的作用，代码如下
在进行二级化前，需要进行灰度和黑白化处理，代码如下

# 模式L”为灰色图像，它的每个像素用8个bit表示，0表示黑，255表示白，其他数字表示不同的灰度。
    Img = img.convert('L')
    
    # 黑白化处理，自定义灰度界限，大于这个值为黑色，小于这个值为白色
    threshold = 200

    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)

    # 图片二值化
    photo = Img.point(table, '1')
    photo.save("BuildImageCode2.jpg")

进行上面两步灰度和黑白化后，得到的photo就可以进行降噪处理

def judge(pixl):

    THRESHOLD = 80

    if pixl>THRESHOLD:
        return 1
    else:
        return 0
def depoint(img):
    """传入二值化后的图片进行降噪"""
    pixdata = img.load()
    w,h = img.size
    for y in range(1,h-1):
        for x in range(1,w-1):
            count = 0
            if judge(pixdata[x,y-1]):#上
                count = count + 1
            if judge(pixdata[x,y+1]):#下
                count = count + 1
            if judge(pixdata[x-1,y]):#左
                count = count + 1
            if judge(pixdata[x+1,y]):#右
                count = count + 1
            if judge(pixdata[x-1,y-1]):#左上
                count = count + 1
            if judge(pixdata[x-1,y+1]):#左下
                count = count + 1
            if judge(pixdata[x+1,y-1]):#右上
                count = count + 1
            if judge(pixdata[x+1,y+1]):#右下
                count = count + 1
            if count > 4:
                pixdata[x,y] = 255
    
    return img

就可以了，完整的code如下

import tesserocr
from PIL import Image
def judge(pixl):
    THRESHOLD = 80
    if pixl>THRESHOLD:
        return 1
    else:
        return 0
def depoint(img):
    """传入二值化后的图片进行降噪"""
    pixdata = img.load()
    w,h = img.size
    for y in range(1,h-1):
        for x in range(1,w-1):
            count = 0
            if judge(pixdata[x,y-1]):#上
                count = count + 1
            if judge(pixdata[x,y+1]):#下
                count = count + 1
            if judge(pixdata[x-1,y]):#左
                count = count + 1
            if judge(pixdata[x+1,y]):#右
                count = count + 1
            if judge(pixdata[x-1,y-1]):#左上
                count = count + 1
            if judge(pixdata[x-1,y+1]):#左下
                count = count + 1
            if judge(pixdata[x+1,y-1]):#右上
                count = count + 1
            if judge(pixdata[x+1,y+1]):#右下
                count = count + 1
            if count > 4:
                pixdata[x,y] = 255
    return img
def heibaihua(img):
    # 模式L”为灰色图像，它的每个像素用8个bit表示，0表示黑，255表示白，其他数字表示不同的灰度。
    Img = img.convert('L')
    # 黑白化处理，自定义灰度界限，大于这个值为黑色，小于这个值为白色
    threshold = 200
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    # 图片二值化
    photo = Img.point(table, '1')
    photo.save("BuildImageCode2.jpg")
    return photo
img= Image.open("C://Users/asus/Desktop/BuildImageCode.jpg")  # 这里就相当于将图片数字化。
img = heibaihua(img)
img = depoint(img)
ans=tesserocr.image_to_text(img)
print(ans)

最简单的数字、字母验证码破解

ocr介绍

python中使用ocr

ocr简单运用

ocr增加难度

猜你喜欢