MySQL学习之编码特性

前言：

一直负责数据处理的核心环节，对于爬虫数据的特征变化也就非常敏感，商家为了凸显和个性化自己的信息，表情符号也经常出现在一些必要属性中。

场景再现：

### Cause: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x87\xB7\xF0\x9F...' for column 'goods_title' at row 1
### Error updating database.  Cause: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x8F\x86Ch...' for column 'goods_title' at row 1

异常原因：

出现上述原因，是由于表或者库字段编码是utf8，无法有效编码出表情符号引起。官方文档有如下说法：

MySQL supports multiple Unicode character sets:

utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.

utf8: An alias for utf8mb3.

ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.

utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.

utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.

utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.

由于表情符是四个字节编码的，而utf8只能编码3个字节，因此导致上述错误。而且mysql官方文档里给了特殊提示：

Note：

The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at that point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

大致的意思是，以后utf8mb3会被放弃掉，而后utf8将会默认指向utf8mb4.
官宣参见

字符集、连接字符集、排序字符集比较

utf8mb4对应的排序字符集有utf8mb4_unicode_ci、utf8mb4_general_ci.

utf8mb4_unicode_ci和utf8mb4_general_ci的对比：

准确性：
- utf8mb4_unicode_ci是基于标准的Unicode来排序和比较，能够在各种语言之间精确排序
- utf8mb4_general_ci没有实现Unicode排序规则，在遇到某些特殊语言或者字符集，排序结果可能不一致。
  但是，在绝大多数情况下，这些特殊字符的顺序并不需要那么精确。
性能
- utf8mb4_general_ci在比较和排序的时候更快
- utf8mb4_unicode_ci在特殊情况下，Unicode排序规则为了能够处理特殊字符的情况，实现了略微复杂的排序算法。
  但是在绝大多数情况下发，不会发生此类复杂比较。相比选择哪一种collation，使用者更应该关心字符集与排序规则在db里需要统一。
大小写敏感
- utf8mb4_general_ci 大小写敏感
- utf8mb4_bin 大小写敏感
- utf8_unicode_ci 大小写不敏感

下面说一下修改数据库参数支持表情符入库

修改表和字段特性

ALTER TABLE `goods_record`
MODIFY COLUMN `goods_title`  varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '' COMMENT '商品标题' AFTER `goods_id`;
 
 
ALTER TABLE `goods_record`
DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8mb4_bin;

修改数据库配置文件
找到/etc/my.cnf文件，添加如下内容

[mysqld]

character-set-server=utf8mb4

[mysql]

default-character-set=utf8mb4

查看是否生效的方法:

show variables like '%char%'

重启MySQL即可

关于my.cnf里的参数说明：

character_set_client、character_set_connection 以及 character_set_results 这几个参数都是客户端的设置
character_set_system、character_set_server 以及 character_set_database 是指服务器端的设置。

而对于这三个服务器端的参数来说的优先级是:

列级字符集 > 表级字符集 > character_set_database > character_set_server > character_set_system

列级的字符编码在服务器端是具有最高优先级的。