Python爬虫项目实战3 | 图片文字识别（以验证码识别为例）

1.项目背景

我在实习过程中，当我抓取环保平台相关数据时，常常发现有图片的情况，比如以下这种图片，所以抓取这种图片中的信息是我进行图片文字识别的动力：

2.项目思路

因为在某一网站中有大量这种想要抓取的图片，所以我的思路是，

1.先抓取这些图片的名称和URL；

2.然后再根据这些URL得到图片信息；

3.然后识别信息。

3.验证码图片识别示例

【1】首先，我们可以找一个有很多验证码的网站，比如：验证码处理网站；从网站页面源代码(在网站中右键)中找到图片的URL，以及他们的名称，然后将这些图片下载下来，代码如下：

spicerman.py

import re
import requests
from bs4 import BeautifulSoup
import chardet
from urllib import parse

url = 'https://captcha.com/captcha-examples.html?cst=corg'
user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0'
headers = {
    'User-Agent': user_agent
}
response = requests.get(url, headers=headers, timeout=10)
response.encoding = chardet.detect(response.content)['encoding']
html = response.text
soup = BeautifulSoup(html, 'lxml')

# 提取我们想要的信息，即图片信息
image_list = soup.find_all(name='img', class_='captcha_sample')
image_names = soup.find_all(name='h3')

image_urls = list()
seq = 0
for image in image_list:
    image_url = parse.urljoin(url, image['src'])
    image_urls.append(image_url)
    print(image_url)
    with open("urls.txt", 'a') as fout:
        fout.write(image_url)
        fout.write('\n')
    # 根据url下载图片
    try:
        url_res = requests.get(image_url, headers=headers, timeout=10)
        if url_res.status_code == 200:
            name = image_names[seq].text + '.jpg'
            with open(name, 'wb') as fout:
                fout.write(url_res.content)
                print('第{}图片下载成功!'.format(seq))
    except Exception as e:
        print(e)

    seq += 1

【2】然后，我们可以借助百度的图片识别模块来处理，百度AI开放平台的链接为：http://ai.baidu.com。在网页右上角的控制台登录，之后会显示如下：

【3】看到左侧有“文字识别”的菜单，按一下：

【4】然后创建应用，名字随意，创建后，会显示如下网页信息，在网页中有应用的AppID,APIKey和Secret Key：

【5】使用上述的信息，我们便可以使用百度的图片识别啦，模块为aip，在终端安装为pip3 install aip。当然，关于模块的文档可在网站中找到，不再赘述。

实现的recognizer.py如下：

from aip import AipOcr

# 填入你自己的信息
APP_ID = '1××××××8'
API_KEY = 'k×××××××××××××××××h81'
SECRET_KEY = 'a×××××××××××××××××××××KV'

client = AipOcr(APP_ID, API_KEY, SECRET_KEY)


# 读取图片
def get_file_content(filepath):
    with open(filepath, 'rb') as fp:
        return fp.read()

# 调用通用文字识别，图片参数为本地图片
image = get_file_content('Wave Captcha Image.jpg')
# 定义参数变量
options = {
    # 定义图像方向
    'detect-direction': 'true',
    'language-type': 'CHN_ENG'
}
result = client.general(image, options)
print(result)
for word in result['words_result']:
    print(word['words'])

【6】在控制台便显示图片中的信息啦～

如下是我识别最前面图片中的信息：