全文索引
全文索引存储在索引数据中的词频和所在记录,频率越高,权重越低,用过一定的算个给出相关性评分(relevance score)。MySQL的MyISAM和InnoDB支持全文检索,但要注意:
- InnoDB在版本5.6.4才开始提供全文索引
- 虽然语法一样,但MyISAM和InnoDB在实现和算法是不同的,它们之间的相关性评分是不具备可比性,即不要用一个InnoDB表的相关度值和一个MyIASM表来比对。
- MyISAM有543个stopwords(因词太常用,不作为索引),而InnoDB有36个,MySQL文档给出如果在引擎中加入或者删除stopword的命令,也给出了Full Text search的配置参数调整方式。
SQL
具体参考MYSQL的官方文档
为表格设置FULLTEXT KEY
-- 全文索引 ALTER TABLE TicketComment ADD FULLTEXT INDEX TicketComment_Search (Body); -- 全文联合索引,同时检索这两列,只要在这两列当中出现的,都进行相关度打分 ALTER TABLE Ticket ADD FULLTEXT INDEX Ticket_Search (Subject, Body);
检索的sql语句
单词搜索
mysql> SELECT * FROM `TicketComment` WHERE MATCH(`Body`) AGAINST('test'); +-----------+----------+--------+-------------------+----------------------------+ | CommentId | TicketId | UserId | Body | DateCreated | +-----------+----------+--------+-------------------+----------------------------+ | 2 | 1 | 4 | Comment Two: test | 2018-03-07 15:40:22.631000 | | 12 | 1 | 4 | Test | 2018-03-07 15:50:35.068000 | +-----------+----------+--------+-------------------+----------------------------+ 2 rows in set (0.04 sec)
查看关联分值
mysql> SELECT *, MATCH(`Body`) AGAINST('test') AS score From TicketComment; +-----------+----------+--------+-------------------+----------------------------+--------------------+ | CommentId | TicketId | UserId | Body | DateCreated | score | +-----------+----------+--------+-------------------+----------------------------+--------------------+ | 1 | 1 | 4 | my comment: Hello | 2018-03-07 15:40:00.719000 | 0 | | 2 | 1 | 4 | Comment Two: test | 2018-03-07 15:40:22.631000 | 0.6055193543434143 | | 3 | 1 | 4 | Comment Three : 3 | 2018-03-07 15:40:51.588000 | 0 | | 4 | 1 | 4 | Comment Four: 4 | 2018-03-07 15:41:00.622000 | 0 | | 5 | 1 | 4 | Comment Five: 5 | 2018-03-07 15:41:09.777000 | 0 | | 6 | 1 | 4 | Comment Six: 6 | 2018-03-07 15:41:16.899000 | 0 | | 7 | 1 | 4 | Comment Serven: 7 | 2018-03-07 15:41:28.665000 | 0 | | 8 | 1 | 4 | Comment 8 | 2018-03-07 15:41:37.733000 | 0 | | 9 | 1 | 4 | Comment 9 | 2018-03-07 15:41:43.515000 | 0 | | 10 | 1 | 4 | Comment 10 | 2018-03-07 15:41:51.349000 | 0 | | 11 | 1 | 4 | Comment 11 | 2018-03-07 15:42:01.263000 | 0 | | 12 | 1 | 4 | Test | 2018-03-07 15:50:35.068000 | 0.6055193543434143 | +-----------+----------+--------+-------------------+----------------------------+--------------------+ 12 rows in set (0.01 sec) SELECT *, MATCH(`Body`) AGAINST('test Hello') AS score From TicketComment; # 下面的等同与上面,但我们可以将不同列的关联性加起来,或者将不同表格里面的关联性加起来(使用到join) SELECT *, (MATCH(`Body`) AGAINST('test')+ MATCH(`Body`) AGAINST('Hello')) AS score From TicketComment;
多词搜索
mysql> select * from TicketComment where match(`Body`) against('Five Six'); +-----------+----------+--------+-----------------+----------------------------+ | CommentId | TicketId | UserId | Body | DateCreated | +-----------+----------+--------+-----------------+----------------------------+ | 5 | 1 | 4 | Comment Five: 5 | 2018-03-07 15:41:09.777000 | | 6 | 1 | 4 | Comment Six: 6 | 2018-03-07 15:41:16.899000 | +-----------+----------+--------+-----------------+----------------------------+ 2 rows in set (0.00 sec)
测试发现数字属于stopwords,my也属于stopwords。
联合索引
mysql> select * from Ticket where Match(`subject`,`Body`) against('hello'); +----------+--------+---------+-----------------------------------+----------------------------+ | TicketId | UserId | Subject | Body | DateCreated | +----------+--------+---------+-----------------------------------+----------------------------+ | 1 | 3 | hello | This is the frist ticket created! | 2018-01-15 16:09:13.016000 | +----------+--------+---------+-----------------------------------+----------------------------+ 1 row in set (0.01 sec)
使用boolean mode
我们在against里面可以注明使用boolean mode,可以得到一些逻辑组合,例如必须以什么开头,必须不包含,与还是或的关系。可使用的符号可以查询ft_boolean_syntax,其中ft就是fulltext的缩写。
mysql> SHOW VARIABLES LIKE 'ft%'; +--------------------------+----------------+ | Variable_name | Value | +--------------------------+----------------+ | ft_boolean_syntax | + -><()~*:""&| | | ft_max_word_len | 84 | | ft_min_word_len | 4 | | ft_query_expansion_limit | 20 | | ft_stopword_file | (built-in) | +--------------------------+----------------+
- + 表示必须包含。例如+apple,表示必须含有apple,并且以apple开始的,例如apple123。
- 空 表示含有或者。例如apple banana,表示含有apple或者banana
- - 表示不能包含。例如+apple -banana,表示含有apple但不能含有banana
- > 提高该词的相关性,即优先含有该词
- < 降低该词相关性,
- ( ) 可以通过括号来使用字条件。例如+aaa +(>bbb <ccc)
- ~ 将其相关性由正转负,表示拥有该字会降低相关性,但不像「-」将之排除,只是排在较后面。
- * 通配符,这个只能接在字符串后面。
- " " :整体匹配,用双引号将一段句子包起来表示要完全相符,不可拆字。
使用例子:
select * from TicketComment where match(`Body`) against('Test -two' in boolean mode);【参考】 https://blog.csdn.net/u011734144/article/details/52817766