问题背景
在对html的二进制源码进行解码(即将bytes转化成str)时,遇到了如下报错:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1543: character maps to <undefined>
以下是具体执行的代码:
with open('1.html', 'rb') as r:
content = r.read()
encoding = chardet.detect(content)['encoding']
if encoding is None:
encoding = 'utf-8'
content = content.decode(encoding)
print(content)
解决方案
即使是使用检测到的encoding进行解码,也依然可能出错,因为检测到的编码方式不一定就是正确的。此时可直接使用utf-8进行解码:
with open('1.html', 'rb') as r:
content = r.read()
encoding = chardet.detect(content)['encoding']
if encoding is None:
encoding = 'utf-8'
try:
content = content.decode(encoding)
except:
content = content.decode('utf-8')
print(content)
事实上,在本例中,chardet.detect(content)
的输出是
{'encoding': 'Windows-1254', 'confidence': 0.417065260641214, 'language': 'Turkish'}
可以看出置信度非常低。
⚠️
rb
和encoding
参数不兼容,如果同时指定,会报错:ValueError: binary mode doesn't take an encoding argument
。