UnicodeDecodeError: ‘charmap‘ codec can‘t decode byte 0x90 in position 1543: character maps to ＜unde

问题背景

在对html的二进制源码进行解码（即将bytes转化成str）时，遇到了如下报错：

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1543: character maps to <undefined>

以下是具体执行的代码：

with open('1.html', 'rb') as r:
    content = r.read()
    encoding = chardet.detect(content)['encoding']
    if encoding is None:
        encoding = 'utf-8'
    content = content.decode(encoding)
    print(content)

解决方案

即使是使用检测到的encoding进行解码，也依然可能出错，因为检测到的编码方式不一定就是正确的。此时可直接使用utf-8进行解码：

with open('1.html', 'rb') as r:
    content = r.read()
    encoding = chardet.detect(content)['encoding']
    if encoding is None:
        encoding = 'utf-8'
    try:
        content = content.decode(encoding)
    except:
        content = content.decode('utf-8')
    print(content)

事实上，在本例中，chardet.detect(content) 的输出是

{'encoding': 'Windows-1254', 'confidence': 0.417065260641214, 'language': 'Turkish'}

可以看出置信度非常低。

⚠️ rb 和 encoding 参数不兼容，如果同时指定，会报错：ValueError: binary mode doesn't take an encoding argument。

UnicodeDecodeError: ‘charmap‘ codec can‘t decode byte 0x90 in position 1543: character maps to ＜unde

问题背景

解决方案

猜你喜欢