Full text search

Since Totara 12 it has been possible to add full text search indexes into database schema, and to execute full text searches against them.

Working with full text search solutions

When is it appropriate to use full text searches

Full text searches are possible, but they should not be abused.

Building full text searches is expensive, and maintaining them extra so.
Any use of a full text search index should be carefully considered.
Just because it is possible does not mean that it should be done.

If you are searching content that is updated infrequently then full text searching may be the best path for you.
If the content you are searching frequently changes then you should reconsider what you are doing and look at what the best possible solution for you would be.

If you do decide to work with full text searches then this document should provide you with all the information you need to get started.

Understanding language support

Unlike other existing DDL features full text search is language aware.
The language is specified during the creation of the index and is used by your database in the production of the index.
It is used for the likes of breaking words, ignoring noise words (a, it, and etc) and for other linguistic features provided by the full text search index of your database.

Totara full text search supports one language only.
It should be specified in your config.php file and we recommend setting it to a language compatible with your $CFG→lang.
The actual value is database independent.

To set your language add the following to your config.php

$CFG->dboptions['ftslanguage'] = 'your language';

The actual value is database dependent.

PostgreSQL

PostgreSQL is using 'english' configuration for full text search by default, for list of available options see result of "SELECT cfgname FROM pg_ts_config;".
For example:

$CFG->dboptions['ftslanguage'] = 'english';
$CFG->dboptions['ftslanguage'] = 'simple';
$CFG->dboptions['ftslanguage'] = 'german';

NOTE: PostgreSQL does not support Japanese and other languages with very short words without spaces in between, enable the following setting to get a basic experimental support of these languages.
If the value changes then you need to run: admin/cli/fts_repopulate_tables.php

$CFG->dboptions['fts3bworkaround'] = true;

MySQL

MySQL is using case and accent insensitive collation for full text search by default, you can specify a different collation here, for example:

$CFG->dboptions['ftslanguage'] = 'utf8_unicode_ci';
$CFG->dboptions['ftslanguage'] = 'utf8mb4_0900_as_ci';
$CFG->dboptions['ftslanguage'] = 'utf8mb4_de_pb_0900_ai_ci';

MySQL does not support Japanese and other languages with very short words without spaces in between, enable the following setting to get a basic experimental support of these languages.

If the value changes then you need to run: admin/cli/fts_repopulate_tables.php

$CFG->dboptions['fts3bworkaround'] = true;

MSSQL

MS SQL Server is using 'English' language by default, list of options in the MSSQL full text languages documentation.

$CFG->dboptions['ftslanguage'] = 'English';
$CFG->dboptions['ftslanguage'] = 'German';
$CFG->dboptions['ftslanguage'] = 'Japanese';
$CFG->dboptions['ftslanguage'] = 1028; // Traditional Chinese
$CFG->dboptions['ftslanguage'] = 2052; // Simplified Chinese

Developing full text search solutions

Defining a full text search solution

Full text search indexes must be defined within the INSTALL.xml file for your plugin/component.
We recommend adding them through the XMLDB editor that can be found within the product to ensure that the definitions are perfect.
The XMLDB editor can also give you the PHP code required to add the search index during upgrade.

The following is the checklist for adding a full text search index:

  1. Design a new table that has links to data in another table. 
    1. Should have an id field that links to the field in another table
    2. Should have one or more nullable text fields to which full text search indexes can be added.
    3. Should have a time modified which can be used to compare against a time modified field on the other table.
  2. Add a time modified field to the other table if it doesn't have one, and update code to populate it.
  3. Define your new table in db/INSTALL.xml
  4. Upgrade step added to db/upgrade.php to create your new table
  5. A cron job that:
    1. Creates records for new entires in the other table that doesn't have a record in your table yet.
    2. Deletes any records from your table where the reference record in the other table has been deleted.
    3. Compares time modified on all records against the other tables time modified field and updates the records in your table if required.
  6. Event observers that catch changes to the other table and can update your table as/when required.
    1. When creating add a new record.
    2. When editing update the record.
    3. When deleting remove the record.

The following example shows what the end outcome in INSTALL.xml will look like:

Example 1
<TABLE NAME="searchables" COMMENT="Searchable learning items">
  <FIELDS>
    <FIELD NAME="id" TYPE="int" LENGTH="10" NOTNULL="true" SEQUENCE="true"/>
    <FIELD NAME="component" TYPE="char" LENGTH="100" NOTNULL="true" SEQUENCE="false" COMMENT="The component this record belongs to"/>
    <FIELD NAME="type" TYPE="char" LENGTH="100" NOTNULL="true" SEQUENCE="false" COMMENT="The type of this record within the component"/>
    <FIELD NAME="instanceid" TYPE="int" LENGTH="10" NOTNULL="true" SEQUENCE="false" COMMENT="The identifier for this record in the component table"/>
    <FIELD NAME="timemodified" TYPE="int" LENGTH="10" NOTNULL="true" SEQUENCE="false" COMMENT="When this record was last modified"/>
    <FIELD NAME="high" TYPE="text" NOTNULL="false" SEQUENCE="false" COMMENT="Searchable text that is high value"/>
    <FIELD NAME="medium" TYPE="text" NOTNULL="false" SEQUENCE="false" COMMENT="Searchable text that is of medium value"/>
    <FIELD NAME="low" TYPE="text" NOTNULL="false" SEQUENCE="false" COMMENT="Searchable text that is low value"/>
  </FIELDS>
  <KEYS>
    <KEY NAME="primary-id" TYPE="primary" FIELDS="id"/>
  </KEYS>
  <INDEXES>
    <INDEX NAME="component-type-id" UNIQUE="true" FIELDS="component, type, instanceid"/>
    <INDEX NAME="high" UNIQUE="false" FIELDS="high" HINTS="full_text_search"/>
    <INDEX NAME="medium" UNIQUE="false" FIELDS="medium" HINTS="full_text_search"/>
    <INDEX NAME="low" UNIQUE="false" FIELDS="low" HINTS="full_text_search"/>
  </INDEXES>
</TABLE>

Populating the search table

During installation or upgrade your new table will be created, as will any full text search indexes on it.

Don't worry about updating it during upgrade.
Instead write a cron job as described above that will populate the table and keep it up to date.
The first time cron runs after upgrade it will populate your index table by creating records for all records in the other table.

Additionally we recommend you write event observers that listen for events that lead to changes in the other table and respond by updating the records in your new table when required.
This way you will have a system that is guaranteed to be up to date by cron, but should also be up-to-date with the live environment thanks to events.

In both cases, the cron routine, and the event observers it is important that all data is formatted specifically for searching prior to inserting it into your new table.
This can be done by calling the $DB→unformat_fts_content() method on each piece of textual content that will be inserted into a field with a full text search index on it.
By always using this method you can ensure that your content is consistently and correctly formatted in the best possible way.

Running a full text search

Once an index exists running a full text search can be easily run by using preparing a query that includes the full text search as a subquery.
This is facilitated by our DML layer in the following fashion:

Example 2
list($searchsql, $params) = $DB->get_fts_subquery('searchables', ['high' => 3, 'medium' => 2, 'low' => 1], $searchterm);
$sql = "SELECT id, component, type, instanceid
          FROM {searchables} s
          JOIN {$searchsql} fts ON fts.id = s.id
	  ORDER BY fts.score DESC";
$results = $DB->get_records($sql, $params);

Read on for more information on the get_fts_subquery.

The DML unformat_fts_content() method

This method should be used to ensure that all content written into a field  on which a full text search index exists is correctly and consistently formatted for searching.

It takes two arguments:

  • Content: raw text that will be inserted into the index
  • Format: is one of FORMAT_PLAIN, FORMAT_MOODLE, FORMAT_HTML, FORMAT_MARKDOWN.

It returns the now prepared content.

The DML get_fts_subquery() method

This method is the cornerstone of the of the full text search DML functionality.
It produces SQL for a subquery that executes the search and can be easily included in the join in order to filter results using a full text search.

It expected three arguments:

  1. The table to search on.
    This table must contain one or more full text indexes being targeted by this search.
  2. An array of search fields and a weight for each field.
    A key => value array, where the key is the name of a field on which a full text search index exists, and the value is the weight to give this field within the results.
  3. The search text.

It returns an array containing two elements.

  1. The SQL subquery snippet as a string.
  2. An array of named parameters for the subquery that will need to be provided when executing it.

It has the following nuances:

  • It only executes the search in the one set language for the site.
  • The search table MUST contain a field called id.
  • At least one search field must be provided.
  • The weight must be a positive number.
  • If an empty search term is provided a debugging notice will be triggered and an empty result set will be returned.
  • The SQL snippet provided exposes two fields, id and score.
    • id: this is the id of the record in the table being searched.
    • score: the score is a float used to rank the results.
      It's actual value will depend upon how well the record matches the search term.
      The range will be database dependent and should not be relied upon.
  • The SQL snippet provided does not order the results. You need to do this yourself when making the search.

Recommendations, notes and FAQs

The following are things that we think will help you when designing and implementing your full text search solution.

Don't add full text search indexes to existing fields

This may seem like a good idea. It is not.

The data in the field on which the index is applied needs to be properly formatted for the purpose of searching.
It should be passed through the unformat_fts_content() method to ensure it is appropriately formatted.
This makes the purpose of the field specifically searching.
Programming best practices regularly push separation of concerns, and this is the same.
If data exists in a raw format in one field, and in a searchable format in another than you are unlikely to use the searchable content in other contexts.

It also ensures that you never get into a situation where when adding a full text search index that you have to update every single record in a table by running it through unformat_fts_content() during upgrade.

Don't add full text search indexes to existing tables

We recommend creating a new table when designing full text solutions.
On occasion you may be tempted to add a new field to an existing table, and to add a full text search to that new field.
This avoids the trouble noted above in why you shouldn't add full text searches to existing fields.
However we recommend you add a new table still for a number of reasons:

  1. If the new table contains only searchable records and references the original table then you can essentially truncate it and repopulate it when building the index. The system can essentially treat it like a temp table.
  2. Having fields in an existing table adds complications to observers, particularly if the event is triggered within a transaction, and may lead you to having to abandon live updates via events. These cause lead to hard to replicate regressions.
  3. Requires additional consideration in how the site upgrades to ensure that everything ends up consistent in all upgrade paths.
  4. Promotes good separation of concerns.