版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_39198406/article/details/85808544
扩展阅读:(tesseract配置学习1)[http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version]
扩展阅读:(tesseract配置学习2)[https://stackoverflow.com/questions/13007245/how-to-find-parameters-supported-in-tesseract-ocr-config-file]
本文主要介绍两个问题:
- 如何把网页上采集的图片不存到本地直接在内存中识别
使用image = BytesIO(response.content)
转换为流数据 - 解决tesseract不识别最左侧字符的问题
参数中加上config="--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789"
然后直接贡献出代码:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
import pytesseract
from PIL import Image
from io import BytesIO
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
}
url = "http://static8.ziroom.com/phoenix/pc/images/price/e72ac241b410eac63a652dc1349521fd.png"
response = requests.get(url=url, headers=headers)
with open("test.png", "wb") as f:
f.write(response.content)
image = BytesIO(response.content)
im = Image.open(image)
text = pytesseract.image_to_string(im, lang="eng", config="--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789")
print(text)