项目代码中有人使用iconv函数将utf8转成ucs2,但是没有对转换失败的流程做处理,产生现网bug。
了解后发现,iconv_open有个自带功能可能会解决。那就是在目标编码后面追加//IGNORE,可以忽略转换失败的部分。man手册中的解释是这样的:
iconv_t iconv_open(const char *tocode, const char *fromcode);
DESCRIPTION
The iconv_open() function allocates a conversion descriptor suitable for converting byte sequences from character encoding fromcode to character
encoding tocode.
The values permitted for fromcode and tocode and the supported combinations are system-dependent. For the GNU C library, the permitted values
are listed by the iconv --list command, and all combinations of the listed values are supported. Furthermore the GNU C library and the GNU libi-
conv library support the following two suffixes:
//TRANSLIT
When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be represented
in the target character set, it can be approximated through one or several similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently dis-
carded.
The resulting conversion descriptor can be used with iconv(3) any number of times. It remains valid until deallocated using iconv_close(3).
A conversion descriptor contains a conversion state. After creation using iconv_open(), the state is in the initial state. Using iconv(3) modi-
fies the descriptor’s conversion state. (This implies that a conversion descriptor can not be used in multiple threads simultaneously.) To
bring the state back to the initial state, use iconv(3) with NULL as inbuf argument.
结果很无奈,异常图标过滤不了,比如火式样的图标。这网站竟然不支持这个图标,服了!
异常图标转成utf8时,占用4个字节,每个字节都在汉字的合法范围内,正则pass
最后使用utf8,汉字部分的编码特点解决:汉字占用的3字节分别为1110xxxx,10xxxxxx,10xxxxxx