QUERY EXPANSION algorithm when phrases are too large. For the sake of simplicity, in this article we won't go through the FTS parsers. That is possible material for a future post.
SELECT clause you'll see both FTS expressions using NATURAL LANGUAGE with query expansion and BOOLEAN modes respectively. https://gist.github.com/3manuek/9ffc83e2db4fcef85e57dd3dadbe7b6e The noise generated by the query expansion is expected and described in the official documentation here. The interesting case is the following row, which has 2 exact occurrences (you can see the positions 1 and 63) and it is not the highest rank using query extension. Remember, this is expected. Text: "country districts. As Lucca had five gates, he divided his own country" bookid: 1232 pos: 1,63 QERank: 80 BoolRank: 14 This is even worse when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches. https://gist.github.com/3manuek/d5e5f3257696fb9a9d2a6b6a12e261b5 The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is 443.1163 characters.
set global innodb_ft_aux_table = 'ftslab/bookContent'; statement, which loads the index content to memory for an easy querying. If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges. i.e. You can easily get the most frequent tokens: https://gist.github.com/3manuek/692a628fc70228194396f33d6d76d4d9 We can query the index contents with a simple SQL statement like the following: https://gist.github.com/3manuek/cb815326a56bd2d29a0db23e0423455a
In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism.
stopwords in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions. We can build a custom stopwords table based on our current data: https://gist.github.com/3manuek/30b1f8483857018cc82796dd7dda44ae Let's build our stopwords table using both default and new entries and keeping the alphabetical order: https://gist.github.com/3manuek/2146012a17c76acfa0fcbfcc229fa1c7 The idea behind choosing our own stopwords is to measure how much index do we safe filtering those words that are extremely frequent and don't add a necessary meaning to the search. This topic could be covered in a separate blog post.
match(content,title) against ("first (third)") > 1 . Generally, scores lower than 1 can be ignored when using boolean or natural mode searches. - `OPTIMIZE TABLE` does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table. - Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.) - If you plan to order by rank, it is not necessary to specify the clause `ORDER BY` as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC). - If you come from MyISAM's FTS implementation, recall that the ranking scoring is different. - Create the FULLTEXT index after the data is loaded InnoDB Bulk Load. When restoring FTS backups, you will probably hit the "ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID". - Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123. - Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the "no ranking" FT_hints. The LIMIT hint won't be used if invoked explicitly an ORDER BY and the MATCH clause in the projection. https://gist.github.com/3manuek/2fd59b9a3a187784e02f82da520d775a - If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as `bigint unsigned`. i.e: https://gist.github.com/3manuek/df441753ea7666c3df7d13babc566c52
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < ddl.dump emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < onlyData.dump emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID mysqldump is not very clever if you use `FTS_DOC_ID`: 2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000 It takes dumps without considering the restriction coded in `innobase/row/row0mysql.cc`: Difference between Doc IDs are restricted within 4 bytes integer. See fts_get_encoded_len() The fix to this is backuping the table by chunks of 10000 documents.
Fine tuning Maintenance: innodb_optimize_fulltext_only Writing FTS parser plugins
Ready to optimize your MySQL Database for the future?