前言:
一直负责数据处理的核心环节,对于爬虫数据的特征变化也就非常敏感,商家为了凸显和个性化自己的信息,表情符号也经常出现在一些必要属性中。
场景再现:
### Cause: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x87\xB7\xF0\x9F...' for column 'goods_title' at row 1
### Error updating database. Cause: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x8F\x86Ch...' for column 'goods_title' at row 1
异常原因:
出现上述原因,是由于表或者库字段编码是utf8,无法有效编码出表情符号引起。官方文档有如下说法:
MySQL supports multiple Unicode character sets:
- utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
- utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
- utf8: An alias for utf8mb3.
- ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.
- utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.
- utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.
- utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.
由于表情符是四个字节编码的,而utf8只能编码3个字节,因此导致上述错误。而且mysql官方文档里给了特殊提示:
Note:
The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at that point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
大致的意思是,以后utf8mb3会被放弃掉,而后utf8将会默认指向utf8mb4.
官宣参见
字符集、连接字符集、排序字符集比较
- utf8mb4对应的排序字符集有utf8mb4_unicode_ci、utf8mb4_general_ci.
utf8mb4_unicode_ci和utf8mb4_general_ci的对比:
- 准确性:
- utf8mb4_unicode_ci是基于标准的Unicode来排序和比较,能够在各种语言之间精确排序
- utf8mb4_general_ci没有实现Unicode排序规则,在遇到某些特殊语言或者字符集,排序结果可能不一致。
但是,在绝大多数情况下,这些特殊字符的顺序并不需要那么精确。
- 性能
- utf8mb4_general_ci在比较和排序的时候更快
- utf8mb4_unicode_ci在特殊情况下,Unicode排序规则为了能够处理特殊字符的情况,实现了略微复杂的排序算法。
但是在绝大多数情况下发,不会发生此类复杂比较。相比选择哪一种collation,使用者更应该关心字符集与排序规则在db里需要统一。
- 大小写敏感
- utf8mb4_general_ci 大小写敏感
- utf8mb4_bin 大小写敏感
- utf8_unicode_ci 大小写不敏感
下面说一下修改数据库参数支持表情符入库
- 修改表和字段特性
ALTER TABLE `goods_record`
MODIFY COLUMN `goods_title` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '' COMMENT '商品标题' AFTER `goods_id`;
ALTER TABLE `goods_record`
DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8mb4_bin;
- 修改数据库配置文件
找到/etc/my.cnf文件,添加如下内容
[mysqld]
character-set-server=utf8mb4
[mysql]
default-character-set=utf8mb4
查看是否生效的方法:
show variables like '%char%'
- 重启MySQL即可
关于my.cnf里的参数说明:
-
character_set_client、character_set_connection 以及 character_set_results 这几个参数都是客户端的设置
-
character_set_system、character_set_server 以及 character_set_database 是指服务器端的设置。
而对于这三个服务器端的参数来说的优先级是:
列级字符集 > 表级字符集 > character_set_database > character_set_server > character_set_system
列级的字符编码在服务器端是具有最高优先级的。