MySQL中的全文索引（InnoDB存储引擎）

原文地址：https://www.jianshu.com/p/645402711dac

全文索引介绍

5.6版本之后InnoDB存储引擎开始支持全文索引，从MySQL 5.7.6开始，MySQL内置了ngram全文解析器，用来支持中文、日文、韩文分词。之前仅支持英文，因为是通过空格作为分词的分隔符，对于中文来说是不合适的。
MySQL允许在char、varchar、text类型上建立全文索引。
本文使用的MySQL 版本是5.7.24，InnoDB数据库引擎。

全文索引使用

MySQL支持三种模式的全文检索模式：

1.自然语言模式（IN NATURAL LANGUAGE MODE）：通过MATCH AGAINST 传递某个特定的字符串进行检索

2.布尔模式（IN BOOLEAN MODE）：可以为检索的字符串增加操作符：
+ 表示必须包含
- 表示必须排除
> 表示出现该单词时增加相关性
< 表示出现该单词时降低相关性
* 表示通配符
~ 允许出现该单词，但是出现时相关性为负
"" 表示短语

布尔操作符可以通过sql语句查看：

mysql> show variables like '%ft_boolean_syntax%';
+-------------------+----------------+
| Variable_name     | Value          |
+-------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
+-------------------+----------------+

3.查询扩展模式（WITH QUERY EXPANSION）：
应用场景：查询的关键字太短，用户需要implied knowledge(隐含知识)时进行。比如，对于单词database的查询，用户可能希望查询的不仅仅是包含database的文档，还指那些包含mysql、oracle、db2的单词。
该查询会执行两次检索，第一次使用给定的短语进行检索，第二次结合第一次相关性比较高的进行检索

建立数据库（关键词“with parser ngram”指示了全文索引使用ngram插件）：

mysql> CREATE TABLE articles (
           id int AUTO_INCREMENT PRIMARY KEY,
           title varchar(200),
           body text,
           FULLTEXT KEY title (title,body) WITH PARSER ngram
       ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

查看数据库中的数据：

mysql> select * from articles;
+----+----------------+--------+
| id | title          | body   |
+----+----------------+--------+
|  1 | 数据库管理     | 专业课 |
|  2 | 数据库         | 专业课 |
|  3 | 计算机操作系统 | 专业课 |
|  4 | MySQL          | 专业课 |
|  5 | MySQL数据库    | 专业课 |
+----+----------------+--------+

使用1、2、3三种模式查询数据库
语法：

select * from table 
where match(fulltext_field) 
against('find_key' in natural language mode
  |in boolean mode
  |with query expansion
  |in natural language mode with query expansion)

match：指定需要被查询的列
against：指定使用何种方法进行查询
模式一：自然语言模式

mysql> select * from articles where match(title, body) against ('MySQL数据库' in natural language mode);
+----+-------------+--------+
| id | title       | body   |
+----+-------------+--------+
|  5 | MySQL数据库 | 专业课 |
|  4 | MySQL       | 专业课 |
|  1 | 数据库管理  | 专业课 |
|  2 | 数据库      | 专业课 |
+----+-------------+--------+

只要存在find_key中的字，记录就会被查询到。
默认采用natural language模式，因此上述sql语句等同于：
select * from articles where match(title, body) against ('MySQL数据库');

模式二：布尔模式
示例-1：查询既有“数据”，又有“管理”的记录

mysql> select * from articles where match(title, body) against ('+数据 +管理' in boolean mode);
+----+------------+--------+
| id | title      | body   |
+----+------------+--------+
|  1 | 数据库管理 | 专业课 |
+----+------------+--------+
1 row in set (0.00 sec)

示例-2：查询有“数据”，没有“管理”的记录

mysql> select * from articles where match(title, body) against ('+数据 -管理' in boolean mode);
+----+-------------+--------+
| id | title       | body   |
+----+-------------+--------+
|  2 | 数据库      | 专业课 |
|  5 | MySQL数据库 | 专业课 |
+----+-------------+--------+
2 rows in set (0.00 sec)

示例-3：查询有“MySQL”，可以有“数据库”或没有“数据库”的记录

mysql> select * from articles where match(title, body) against ('>数据库 +MySQL' in boolean mode);
+----+-------------+--------+
| id | title       | body   |
+----+-------------+--------+
|  5 | MySQL数据库 | 专业课 |
|  4 | MySQL       | 专业课 |
+----+-------------+--------+
2 rows in set (0.00 sec)

模式三：查询扩展模式
示例-4
为了验证模式三，首先在数据库中插入以下数据：

mysql> insert into articles(title,body) values
       ('MySQL Tutorial','DBMS stands for DataBase...'),
       ('How To Use MySQL Well','After you went through a ...'),
       ('Optimizing MySQL','In this tutorial we will show...'),
       ('1001 MySQL Tricks','1.Never run mysqld as root. 2 . ...'),
       ('MySQL vs. YourSQL','In the following database comparision...'),
       ('Tuning DB2','For IBM database ...'),
       ('IBM History','DB2 history for IBM ...');

假设使用自然模式

mysql> select * from articles where match(title,body) against('database' in natural language mode);
+----+-----------------------+------------------------------------------+
| id | title                 | body                                     |
+----+-----------------------+------------------------------------------+
| 11 | MySQL Tutorial        | DBMS stands for DataBase...              |
| 12 | How To Use MySQL Well | After you went through a ...             |
| 15 | MySQL vs. YourSQL     | In the following database comparision... |
| 16 | Tuning DB2            | For IBM database ...                     |
+----+-----------------------+------------------------------------------+

只有四条查询结果，使用查询扩展模式

mysql> select * from articles where match(title,body) against('database' with query expansion);
+----+-----------------------+------------------------------------------+
| id | title                 | body                                     |
+----+-----------------------+------------------------------------------+
| 12 | How To Use MySQL Well | After you went through a ...             |
| 15 | MySQL vs. YourSQL     | In the following database comparision... |
| 11 | MySQL Tutorial        | DBMS stands for DataBase...              |
|  8 | hello world           | This is my first java project            |
| 13 | Optimizing MySQL      | In this tutorial we will show...         |
| 16 | Tuning DB2            | For IBM database ...                     |
| 17 | IBM History           | DB2 history for IBM ...                  |
| 14 | 1001 MySQL Tricks     | 1.Never run mysqld as root. 2 . ...      |
|  9 | article_1             | Some like it hot                         |
|  4 | MySQL                 | 专业课                                   |
|  5 | MySQL数据库           | 专业课                                   |
|  6 | MySQL数据库           | 认真学习                                 |
| 10 | article_2             | I like hot                               |
+----+-----------------------+------------------------------------------+

查询出了13条语句。原先查询出的语句中即自然查询中包含MySQL、DB2这些字，所以进行扩展查询第二步时会将包含这些关键字的记录也查询出来。
慎用查询扩展模式！因为可能会带来很多非相关性的查询

查询返回结果是根据相关性进行降序排序的，相关性最高的结果放在第一位。相关性的计算依据四个条件
1.find key在文档中是否存在
2.find key在文档中出现的次数
3.find key在索引列的数量
4.多少个文档包含该find key

以示例-3为例查询相关性，因为后面又插了几条数据，结果和上述查询结果不一致

mysql> select id,title,body, match(title, body) against ('MySQL数据库' in natural language mode) as relevance from articles;
+----+----------------+-------------------------------+--------------------+
| id | title          | body                          | relevance          |
+----+----------------+-------------------------------+--------------------+
|  1 | 数据库管理     | 专业课                        | 0.1812381148338318 |
|  2 | 数据库         | 专业课                        | 0.1812381148338318 |
|  3 | 计算机操作系统 | 专业课                        |                  0 |
|  4 | MySQL          | 专业课                        | 0.6349670886993408 |
|  5 | MySQL数据库    | 专业课                        |  1.178681492805481 |
|  6 | MySQL数据库    | 认真学习                      |  1.178681492805481 |
|  7 | 作文           | 好好写                        |                  0 |
|  8 | hello world    | This is my first java project | 0.0906190574169159 |
+----+----------------+-------------------------------+--------------------+

id=8的记录相关性为0.09因为有个m吧

底层实现原理

只有了解了底层原理才可以更好的分析结果。全文索引的底层实现为倒排索引。

全文索引建立过程

为什么叫倒排索引

倒排索引被称为反向索引更为合适
当表上存在全文索引时，就会隐式的建立一个名为FTS_DOC_ID的列，并在其上创建一个唯一索引，用于标识分词出现的记录行。你也可以显式的创建一个名为FTS_DOC_ID的列，但需要和隐式创建的列类型保持一致，否则创建的时候将会报错，并且不能通过FTS_DOC_ID来查找列：

mysql> select * from articles where FTS_DOC_ID = 1;
ERROR 1054 (42S22): Unknown column 'FTS_DOC_ID' in 'where clause'

执行报错
所以建立的articles表中列为
FTS_DOC_ID、id、title、body
常规的索引是文档到关键词的映射：文档——>关键词
倒排索引是关键词到文档的映射：关键词——>文档
全文索引通过关键字找到关键字所在文档，可以提高查询效率

倒排索引结构

Number	Text	Documents
1	code	(1:6),(4:8)
2	days	(3:2),(6:2)
3	hot	(1:3),(4:4)

是word + ilist的存储结构
Text对应于word，是一个分词。Document存储的是键值对，键为FTS_DOC_ID，值为在文档中的位置，对应于ilist。其中word保存在
Auxiliary Table中，总共有六张，每张表根据word的Latin编码进行分区，下面有介绍

FTS Index Cache(全文检索索引缓存)

在事务提交的时候将分词写入到FTS Index Cache中
批量更新到Auxiliary Table，为了提高性能不会插入一条数据立刻更新到Auxiliary Table。进行批量更新的几种情况：
1. 全文检索索引缓存已满，默认大小为32M，可以通过修改innodb_ft_cache_size来改变FTS Index Cache的大小
2. 关闭数据库的时候，将FTS Index Cache中的数据库会同步到磁盘上的Auxiliary Table中
3. 当对全文检索进行查询时，首先会将在FTS Index Cache中对应的字段合并到Auxiliary Table中，然后在进行查询
4. 当数据库突然宕机时，可能会导致一些FTS Index Cache中的数据未同步到Auxiliary Table上。数据库重启时，当用户对表进行全文检索时，InnoDB存储引擎会自动读取未完成的文档，然后进行分词操作，在将分词的结果放入到FTS Index Cache中。innodb_ft_cache_size的大小会影响恢复的时间
FTS Index Cache为红黑树结构，会根据(word,ilist)进行排序插入

Auxiliary Table(辅助表)

Auxiliary Table存储在磁盘中，进入保存mysql数据的目录下

tianthe@L-SHC-15008567 MINGW64 /c/programdata/mysql/MySQL Server 5.7/data/study
$ ls -lh
total 1.6M
-rw-r--r-- 1 tianthe 1049089 8.5K Apr 17 16:37 articles.frm
-rw-r--r-- 1 tianthe 1049089 112K Apr 17 17:41 articles.ibd
-rw-r--r-- 1 tianthe 1049089   65 Apr 17 15:24 db.opt
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_1.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_2.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_3.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_4.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_5.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_0000000000000087_INDEX_6.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_BEING_DELETED.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_BEING_DELETED_CACHE.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_CONFIG.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_DELETED.ibd
-rw-r--r-- 1 tianthe 1049089  96K Apr 17 16:37 FTS_000000000000005e_DELETED_CACHE.ibd

看到有FTS_000000000000005e_0000000000000087_INDEX_0~6.ibd，其对应的就是六张Auxiliary Table
其余文件介绍：

FTS_000000000000005e_DELETED.ibd
FTS_000000000000005e_DELETED_CACHE.ibd

记录的是从Auxiliary Table中删除的FTS_DOC_ID，后者是前者的内存缓存

FTS_000000000000005e_BEING_DELETED.ibd 
FTS_000000000000005e_BEING_DELETED_CACHE.ibd

记录的是已经被删除索引记录并真正从FTS Index Cache删除的FTS_DOC_ID(即删除FTS Index Cache并做了OPTIMIZE TABLE)，后者是前者的内存缓存。这两个表主要用于辅助进行OPTIMIZE TABLE时将DELETED/DELETED_CACHED表中的记录转储到其中

FTS_000000000000005e_CONFIG.ibd

包含全文索引的内部信息，最重要的存储是FTS_SYNCED_DOC_ID，表示已经解析并刷到磁盘的FTS_DOC_ID，在系统宕机时，可以根据这个值判断哪些该重新分词并加入到FTS Index Cache中

DML操作

插入操作
插入操作较为简单，当往表中插入记录时，提交事务时会对全文索引上的列进行分词存储到FTS Index Cache，最后在批量更新到Auxiliary Table中
删除操作
当提交删除数据的事务以后，不会删除Auxiliary Table中的数据，而只会删除FTS Index Cache中的数据。对于Auxiliary Table中被删除的记录，InnoDB存储引擎会记录其FTS Document Id，并将其保存在DELETED Auxiliary Table中。可以通过OPTIMIZE TABLE手动删除索引中的记录。
更新操作
查找操作
分为两步。第一步：根据检索词搜集符合条件的FTS_DOC_ID，在搜集满足条件的FTS_DOC_ID首先读取delete表中记录的FTS_DOC_ID，这些FTS_DOC_ID随后被用做过滤
第二步：根据FTS_DOC_ID找到对应的记录，找到的记录是根据相关性大小降序返回的

查看插入记录的分词

此处引用官方手册例子：

mysql> USE test;

mysql> CREATE TABLE articles (
         id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
         title VARCHAR(200),
         body TEXT,
         FULLTEXT (title,body)
       ) ENGINE=InnoDB;

mysql> INSERT INTO articles (title,body) VALUES
       ('MySQL Tutorial','DBMS stands for DataBase ...'),
       ('How To Use MySQL Well','After you went through a ...'),
       ('Optimizing MySQL','In this tutorial we will show ...'),
       ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
       ('MySQL vs. YourSQL','In the following database comparison ...'),
       ('MySQL Security','When configured properly, MySQL ...');

mysql> SET GLOBAL innodb_optimize_fulltext_only=ON;

mysql> OPTIMIZE TABLE articles;
+---------------+----------+----------+----------+
| Table         | Op       | Msg_type | Msg_text |
+---------------+----------+----------+----------+
| test.articles | optimize | status   | OK       |
+---------------+----------+----------+----------+

mysql> SET GLOBAL innodb_ft_aux_table = 'test/articles';

mysql> SELECT WORD, DOC_COUNT, DOC_ID, POSITION
       FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE LIMIT 5;
+------------+-----------+--------+----------+
| WORD       | DOC_COUNT | DOC_ID | POSITION |
+------------+-----------+--------+----------+
| 1001       |         1 |      4 |        0 |
| after      |         1 |      2 |       22 |
| comparison |         1 |      5 |       44 |
| configured |         1 |      6 |       20 |
| database   |         2 |      1 |       31 |
+------------+-----------+--------+----------+

mysql> SET GLOBAL innodb_optimize_fulltext_only=OFF;