Trigram Wildcard String Search in SQL Server

Searching string data for an arbitrary substring match can be an expensive operation in SQL Server. Queries of the form Column LIKE '%match%' cannot use the seeking abilities of a b-tree index, so the query processor has to apply the predicate to each row individually. In addition, each test must correctly apply the full set of complicated collation rules. Combining all these factors together, it is no surprise that these types of searches can be resource-intensive and slow.

Full-Text Search is a powerful tool for linguistic matching, and the newer Statistical Semantic Search is great for finding documents with similar meanings. But sometimes, you really just need to find strings that contain a particular substring – a substring that might not even be a word, in any language.

If the searched data is not large, or response time requirements are not critical, using LIKE '%match%' could well be a suitable solution. But, on the odd occasion where the need for a super-fast search beats all other considerations (including storage space), you might consider a custom solution using n-grams. The specific variation explored in this article is a three character trigram.

Wildcard searching using trigrams

The basic idea of a trigram search is quite simple:

Persist three-character substrings (trigrams) of the target data.
Split the search term(s) into trigrams.
Match search trigrams against the stored trigrams (equality search)
Intersect the qualified rows to find strings that match all trigrams
Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

Sample table and data

The script below creates an example table and populates it with a million rows of string data. Each string is 20 characters long, with the first 10 characters being numeric. The remaining 10 characters are a mixture of numbers and letters from A to F, generated using NEWID(). There is nothing terribly special about this sample data; the trigram technique is quite general.


-- The test table
CREATE TABLE dbo.Example 
(
    id integer IDENTITY NOT NULL,
    string char(20) NOT NULL,

    CONSTRAINT [PK dbo.Example (id)]
        PRIMARY KEY CLUSTERED (id)
);
GO
-- 1 million rows
INSERT dbo.Example WITH (TABLOCKX)
    (string)
SELECT TOP (1 * 1000 * 1000)
    -- 10 numeric characters
    REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0') +
    -- plus 10 mixed numeric + [A-F] characters
    RIGHT(NEWID(), 10)
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
OPTION (MAXDOP 1);

It takes around 3 seconds to create and populate the data on my modest laptop. The data is pseudorandom, but as an indication it will look something like this:

Data sample

Generating trigrams

The following inline function generates distinct alphanumeric trigrams from a given input string:


--- Generate trigrams from a string
CREATE FUNCTION dbo.GenerateTrigrams (@string varchar(255))
RETURNS table
WITH SCHEMABINDING
AS RETURN
    WITH
        N16 AS 
        (
            SELECT V.v 
            FROM 
            (
                VALUES 
                    (0),(0),(0),(0),(0),(0),(0),(0),
                    (0),(0),(0),(0),(0),(0),(0),(0)
            ) AS V (v)),
        -- Numbers table (256)
        Nums AS 
        (
            SELECT n = ROW_NUMBER() OVER (ORDER BY A.v)
            FROM N16 AS A 
            CROSS JOIN N16 AS B
        ),
        Trigrams AS
        (
            -- Every 3-character substring
            SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END)
                trigram = SUBSTRING(@string, N.n, 3)
            FROM Nums AS N
            ORDER BY N.n
        )
    -- Remove duplicates and ensure all three characters are alphanumeric
    SELECT DISTINCT 
        T.trigram
    FROM Trigrams AS T
    WHERE
        -- Binary collation comparison so ranges work as expected
        T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%';

As an example of its use, the following call:


SELECT
    GT.trigram
FROM dbo.GenerateTrigrams('SQLperformance.com') AS GT;

Produces the following trigrams:

SQLperformance.com trigrams

The execution plan is a fairly direct translation of the T-SQL in this case:

Generating rows (cross join of Constant Scans)
Row numbering (Segment and Sequence Project)
Limiting the numbers needed based on the length of the string (Top)
Remove trigrams with non-alphanumeric characters (Filter)
Remove duplicates (Distinct Sort)

Plan for generating trigrams

Loading the trigrams

The next step is to persist trigrams for the example data. The trigrams will be held in a new table, populated using the inline function we just created:


-- Trigrams for Example table
CREATE TABLE dbo.ExampleTrigrams
(
    id integer NOT NULL,
    trigram char(3) NOT NULL
);
GO
-- Generate trigrams
INSERT dbo.ExampleTrigrams WITH (TABLOCKX)
    (id, trigram)
SELECT
    E.id,
    GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string) AS GT;

That takes around 20 seconds to execute on my SQL Server 2016 laptop instance. This particular run produced 17,937,972 rows of trigrams for the 1 million rows of 20-character test data. The execution plan essentially shows the function plan being evaluated for each row of the Example table:

Populating the trigram table

Since this test was performed on SQL Server 2016 (loading a heap table, under database compatibility level 130, and with a TABLOCK hint), the plan benefits from parallel insert. Rows are distributed between threads by the parallel scan of the Example table, and remain on the same thread thereafter (no repartitioning exchanges).

The Sort operator might look a bit imposing, but the numbers show the total number of rows sorted, over all iterations of the nested loop join. In fact, there are a million separate sorts, of 18 rows each. At a degree of parallelism of four (two cores hyperthreaded in my case), there are a maximum of four tiny sorts going on at any one time, and each sort instance can reuse memory. This explains why the maximum memory use of this execution plan is a mere 136KB (though 2,152 KB was granted).

The trigrams table contains one row for each distinct trigram in each source table row (identified by id):

Trigram table sample

We now create a clustered b-tree index to support searching for trigram matches:


-- Trigram search index
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.ExampleTrigrams (trigram, id)]
ON dbo.ExampleTrigrams (trigram, id)
WITH (DATA_COMPRESSION = ROW);

This takes around 45 seconds, though some of that is due to the sort spilling (my instance is limited to 4GB memory). An instance with more memory available could probably complete that minimally-logged parallel index build quite a bit faster.

Index building plan

Notice that the index is specified as unique (using both columns in the key). We could have created a non-unique clustered index on the trigram alone, but SQL Server would have added 4-byte uniquifiers to almost all rows anyway. Once we factor in that uniquifiers are stored in the variable-length part of the row (with the associated overhead), it just makes more sense to include id in the key and be done with it.

Row compression is specified because it usefully reduces the size of the trigram table from 277MB to 190MB (for comparison, the Example table is 32MB). If you are not at least using SQL Server 2016 SP1 (where data compression became available for all editions) you can omit the compression clause if necessary.

As a final optimization, we will also create an indexed view over the trigrams table to make it fast and easy to find which trigrams are the most and least common in the data. This step can be omitted, but is recommended for performance.


-- Selectivity of each trigram (performance optimization)
CREATE VIEW dbo.ExampleTrigramCounts
WITH SCHEMABINDING
AS
SELECT ET.trigram, cnt = COUNT_BIG(*)
FROM dbo.ExampleTrigrams AS ET
GROUP BY ET.trigram;
GO
-- Materialize the view
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.ExampleTrigramCounts (trigram)]
ON dbo.ExampleTrigramCounts (trigram);

Indexed view building plan

This only takes a couple of seconds to complete. The size of the materialized view is tiny, just 104KB.

Trigram search

Given a search string (e.g. '%find%this%'), our approach will be to:

Generate the complete set of trigrams for the search string
Use the indexed view to find the three most selective trigrams
Find the ids matching all available trigrams
Retrieve the strings by id
Apply the full filter to the trigram-qualified rows

Finding selective trigrams

The first two steps are pretty straightforward. We already have a function to generate trigrams for an arbitrary string. Finding the most selective of those trigrams can be achieved by joining to the indexed view. The following code wraps the implementation for our example table in another inline function. It pivots the three most selective trigrams into a single row for ease of use later on:


-- Most selective trigrams for a search string
-- Always returns a row (NULLs if no trigrams found)
CREATE FUNCTION dbo.Example_GetBestTrigrams (@string varchar(255))
RETURNS table
WITH SCHEMABINDING AS
RETURN
    SELECT
        -- Pivot
        trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END),
        trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END),
        trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END)
    FROM 
    (
        -- Generate trigrams for the search string
        -- and choose the most selective three
        SELECT TOP (3)
            rn = ROW_NUMBER() OVER (
                ORDER BY ETC.cnt ASC),
            GT.trigram
        FROM dbo.GenerateTrigrams(@string) AS GT
        JOIN dbo.ExampleTrigramCounts AS ETC
            WITH (NOEXPAND)
            ON ETC.trigram = GT.trigram
        ORDER BY
            ETC.cnt ASC
    ) AS BT;

As an example:


SELECT
    EGBT.trigram1,
    EGBT.trigram2,
    EGBT.trigram3 
FROM dbo.Example_GetBestTrigrams('%1234%5678%') AS EGBT;

returns (for my sample data):

Selected trigrams

The execution plan is:

GetBestTrigrams execution plan

This is the familiar trigram-generating plan from earlier, followed by a lookup into the indexed view for each trigram, sorting by the number of matches, numbering the rows (Sequence Project), limiting the set to three rows (Top), then pivoting the result (Stream Aggregate).

Finding IDs matching all trigrams

The next step is to find Example table row ids that match all of the non-null trigrams retrieved by the previous stage. The wrinkle here is that we may have zero, one, two, or three trigrams available. The following implementation wraps the necessary logic in a multi-statement function, returning the qualifying ids in a table variable:


-- Returns Example ids matching all provided (non-null) trigrams
CREATE FUNCTION dbo.Example_GetTrigramMatchIDs
(
    @Trigram1 char(3),
    @Trigram2 char(3),
    @Trigram3 char(3)
)
RETURNS @IDs table (id integer PRIMARY KEY)
WITH SCHEMABINDING AS
BEGIN
    IF  @Trigram1 IS NOT NULL
    BEGIN
        IF @Trigram2 IS NOT NULL
        BEGIN
            IF @Trigram3 IS NOT NULL
            BEGIN
                -- 3 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.ExampleTrigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                INTERSECT
                SELECT ET2.id
                FROM dbo.ExampleTrigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                INTERSECT
                SELECT ET3.id
                FROM dbo.ExampleTrigrams AS ET3
                WHERE ET3.trigram = @Trigram3
                OPTION (MERGE JOIN);
            END;
            ELSE
            BEGIN
                -- 2 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.ExampleTrigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                INTERSECT
                SELECT ET2.id
                FROM dbo.ExampleTrigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                OPTION (MERGE JOIN);
            END;
        END;
        ELSE
        BEGIN
            -- 1 trigram available
            INSERT @IDs (id)
            SELECT ET1.id
            FROM dbo.ExampleTrigrams AS ET1 
            WHERE ET1.trigram = @Trigram1;
        END;
    END;

    RETURN;
END;

The estimated execution plan for this function shows the strategy:

Trigram match ids plan

If there is one trigram available, a single seek into the trigram table is performed. Otherwise, two or three seeks are performed and the intersection of ids found using an efficient one-to-many merge(s). There are no memory-consuming operators in this plan, so no chance of a hash or sort spill.

Continuing the example search, we can find ids matching the available trigrams by applying the new function:


SELECT EGTMID.id 
FROM dbo.Example_GetBestTrigrams('%1234%5678%') AS EGBT
CROSS APPLY dbo.Example_GetTrigramMatchIDs
    (EGBT.trigram1, EGBT.trigram2, EGBT.trigram3) AS EGTMID;

This returns a set like the following:

Matching IDs

The actual (post-execution) plan for the new function shows the plan shape with three trigram inputs being used:

Actual ID matching plan

This shows the power of trigram matching quite well. Although all three trigrams each identify around 11,000 rows in the Example table, the first intersection reduces this set to 1,004 rows, and the second intersection reduces it to just 7.

Complete trigram search implementation

Now that we have the ids matching the trigrams, we can look up the matching rows in the Example table. We still need to apply the original search condition as a final check, because trigrams may generate false positives (but not false negatives). The final issue to address is that the previous stages may not have found any trigrams. This might be, for example, because the search string contains too little information. A search string of '%FF%' cannot use trigram search because two characters are not enough to generate even a single trigram. To handle this scenario gracefully, our search will detect this condition and fall back to a non-trigram search.

The following final inline function implements the required logic:


-- Search implementation
CREATE FUNCTION dbo.Example_TrigramSearch
(
    @Search varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS
RETURN
    SELECT
        Result.string
    FROM dbo.Example_GetBestTrigrams(@Search) AS GBT
    CROSS APPLY
    (
        -- Trigram search
        SELECT
            E.id,
            E.string
        FROM dbo.Example_GetTrigramMatchIDs
            (GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID
        JOIN dbo.Example AS E
            ON E.id = MID.id
        WHERE
            -- At least one trigram found 
            GBT.trigram1 IS NOT NULL
            AND E.string LIKE @Search

        UNION ALL

        -- Non-trigram search
        SELECT
            E.id,
            E.string
        FROM dbo.Example AS E
        WHERE
            -- No trigram found 
            GBT.trigram1 IS NULL
            AND E.string LIKE @Search
    ) AS Result;

The key feature is the outer reference to GBT.trigram1 on both sides of the UNION ALL. These translate to Filters with start-up expressions in the execution plan. A start-up filter only executes its subtree if its condition evaluates to true. The net effect is that only one part of the union all will be executed, depending on whether we found a trigram or not. The relevant part of the execution plan is:

Start-up filter effect

Either the Example_GetTrigramMatchIDs function will be executed (and results looked up in Example using a seek on id), or the Clustered Index Scan of Example with a residual LIKE predicate will run, but not both.

Performance

The following code tests the performance of trigram search against the equivalent LIKE:


SET STATISTICS XML OFF
DECLARE @S datetime2 = SYSUTCDATETIME();

SELECT F2.string
FROM dbo.Example AS F2
WHERE
    F2.string LIKE '%1234%5678%'
OPTION (MAXDOP 1);

SELECT ElapsedMS = DATEDIFF(MILLISECOND, @S, SYSUTCDATETIME());
GO
SET STATISTICS XML OFF
DECLARE @S datetime2 = SYSUTCDATETIME();

SELECT ETS.string
FROM dbo.Example_TrigramSearch('%1234%5678%') AS ETS;

SELECT ElapsedMS = DATEDIFF(MILLISECOND, @S, SYSUTCDATETIME());

These both produce the same result row(s) but the LIKE query runs for 2100ms, whereas the trigram search takes 15ms.

Even better performance is possible. Generally, performance improves as the trigrams become more selective, and fewer in number (below the maximum of three in this implementation). For example:


SET STATISTICS XML OFF
DECLARE @S datetime2 = SYSUTCDATETIME();

SELECT ETS.string
FROM dbo.Example_TrigramSearch('%BEEF%') AS ETS;

SELECT ElapsedMS = DATEDIFF(MILLISECOND, @S, SYSUTCDATETIME());

That search returned 111 rows to the SSMS grid in 4ms. The LIKE equivalent ran for 1950ms.

Maintaining the trigrams

If the target table is static, there is obviously no problem keeping the base table and related trigram table synchronized. Likewise, if the search results are not required to be completely up to date at all times, a scheduled refresh of the trigram table(s) may work well.

Otherwise, we can use some fairly simple triggers to keep the trigram search data synchronized to the underlying strings. The general idea is to generate trigrams for deleted and inserted rows, then add or delete them in the trigram table as appropriate. The insert, update, and delete triggers below show this idea in practice:


-- Maintain trigrams after Example inserts
CREATE TRIGGER MaintainTrigrams_AI
ON dbo.Example
AFTER INSERT
AS
BEGIN
    IF @@ROWCOUNT = 0 RETURN;
    IF TRIGGER_NESTLEVEL(@@PROCID, 'AFTER', 'DML') > 1 RETURN;
    SET NOCOUNT ON;
    SET ROWCOUNT 0;

    -- Insert related trigrams
    INSERT dbo.ExampleTrigrams
        (id, trigram)
    SELECT
        INS.id, GT.trigram
    FROM Inserted AS INS
    CROSS APPLY dbo.GenerateTrigrams(INS.string) AS GT;
END;


-- Maintain trigrams after Example deletes
CREATE TRIGGER MaintainTrigrams_AD
ON dbo.Example
AFTER DELETE
AS
BEGIN
    IF @@ROWCOUNT = 0 RETURN;
    IF TRIGGER_NESTLEVEL(@@PROCID, 'AFTER', 'DML') > 1 RETURN;
    SET NOCOUNT ON;
    SET ROWCOUNT 0;

    -- Deleted related trigrams
    DELETE ET
        WITH (SERIALIZABLE)
    FROM Deleted AS DEL
    CROSS APPLY dbo.GenerateTrigrams(DEL.string) AS GT
    JOIN dbo.ExampleTrigrams AS ET
        ON ET.trigram = GT.trigram
        AND ET.id = DEL.id;
END;


-- Maintain trigrams after Example updates
CREATE TRIGGER MaintainTrigrams_AU
ON dbo.Example
AFTER UPDATE
AS
BEGIN
    IF @@ROWCOUNT = 0 RETURN;
    IF TRIGGER_NESTLEVEL(@@PROCID, 'AFTER', 'DML') > 1 RETURN;
    SET NOCOUNT ON;
    SET ROWCOUNT 0;

    -- Deleted related trigrams
    DELETE ET
        WITH (SERIALIZABLE)
    FROM Deleted AS DEL
    CROSS APPLY dbo.GenerateTrigrams(DEL.string) AS GT
    JOIN dbo.ExampleTrigrams AS ET
        ON ET.trigram = GT.trigram
        AND ET.id = DEL.id;

    -- Insert related trigrams
    INSERT dbo.ExampleTrigrams
        (id, trigram)
    SELECT
        INS.id, GT.trigram
    FROM Inserted AS INS
    CROSS APPLY dbo.GenerateTrigrams(INS.string) AS GT;
END;

The triggers are pretty efficient, and will handle both single- and multi-row changes (including the multiple actions available when using a MERGE statement). The indexed view over the trigrams table will be automatically maintained by SQL Server without us needing to write any trigger code.

Trigger operation

As an example, run a statement to delete an arbitrary row from the Example table:


-- Single row delete
DELETE TOP (1) dbo.Example;

The post-execution (actual) execution plan includes an entry for the after delete trigger:

Delete trigger execution plan

The yellow section of the plan reads rows from the deleted pesudo-table, generates trigrams for each deleted Example string (using the familiar plan highlighted in green), then locates and deletes the associated trigram table entries. The final section of the plan, shown in red, is automatically added by SQL Server to keep the indexed view up to date.

The plan for the insert trigger is extremely similar. Updates are handled by performing a delete followed by an insert. Run the following script to see these plans and confirm that new and updated rows can be located using the trigram search function:


-- Single row insert
INSERT dbo.Example (string) 
VALUES ('SQLPerformance.com');

-- Find the new row
SELECT ETS.string
FROM dbo.Example_TrigramSearch('%perf%') AS ETS;

-- Single row update
UPDATE TOP (1) dbo.Example 
SET string = '12345678901234567890';

-- Multi-row insert
INSERT dbo.Example WITH (TABLOCKX)
    (string)
SELECT TOP (1000)
    REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0') +
    RIGHT(NEWID(), 10)
FROM master.dbo.spt_values AS SV1;

-- Multi-row update
UPDATE TOP (1000) dbo.Example 
SET string = '12345678901234567890';

-- Search for the updated rows
SELECT ETS.string 
FROM dbo.Example_TrigramSearch('12345678901234567890') AS ETS;

Merge example

The next script shows a MERGE statement being used to insert, delete, and update the Example table all at once:


-- MERGE demo
DECLARE @MergeData table 
(
    id integer UNIQUE CLUSTERED NULL,
    operation char(3) NOT NULL,
    string char(20) NULL
);

INSERT @MergeData 
    (id, operation, string)
VALUES 
    (NULL, 'INS', '11223344556677889900'),  -- Insert
    (1001, 'DEL', NULL),                    -- Delete
    (2002, 'UPD', '00000000000000000000');  -- Update

DECLARE @Actions table 
(
    action$ nvarchar(10) NOT NULL, 
    old_id integer NULL, 
    old_string char(20) NULL, 
    new_id integer NULL, 
    new_string char(20) NULL
);

MERGE dbo.Example AS E
USING @MergeData AS MD
    ON MD.id = E.id
WHEN MATCHED AND MD.operation = 'DEL' 
    THEN DELETE
WHEN MATCHED AND MD.operation = 'UPD' 
    THEN UPDATE SET E.string = MD.string
WHEN NOT MATCHED AND MD.operation = 'INS'
    THEN INSERT (string) VALUES (MD.string)
OUTPUT $action, Deleted.id, Deleted.string, Inserted.id, Inserted.string
INTO @Actions (action$, old_id, old_string, new_id, new_string);

SELECT * FROM @Actions AS A;

The output will show something like:

Action output

Final thoughts

There is perhaps some scope to speed up large delete and update operations by referencing ids directly instead of generating trigrams. This is not implemented here because it would require a new nonclustered index on the trigrams table, doubling the (already significant) storage space used. The trigrams table contains a single integer and a char(3) per row; a nonclustered index on the integer column would gain the char(3) column at all levels (courtesy of the clustered index, and the need for index keys to be unique at every level). There is also memory space to consider, since trigram searches work best when all reads are from cache.

The extra index would make cascading referential integrity an option, but that is often more trouble than it is worth.

Trigram search is no panacea. The extra storage requirements, implementation complexity, and impact on update performance all weigh heavily against it. The technique is also useless for searches that do not generate any trigrams (3 character minimum). Although the basic implementation shown here can handle many types of search (starts with, contains, ends with, multiple wildcards) it does not cover every possible search expression that can be made to work with LIKE. It works well for search strings that generate AND-type trigrams; more work is needed to handle search strings that require OR-type handling, or more advanced options like regular expressions.

All that said, if your application really must have fast wildcard string searches, n-grams are something to seriously consider.

Related content: One way to get an index seek for a leading %wildcard by Aaron Bertrand.

18 thoughts on “Trigram Wildcard String Search in SQL Server”

Mikael says:

September 8, 2017 at 6:48 PM

Amazing
Sergey Berezniker says:

September 9, 2017 at 7:00 PM

What is the reason for including 'SET ROWCOUNT 0;' in the triggers?
Robert A. Magrogan, Jr. says:

September 10, 2017 at 12:47 AM

Well done, Sir.
Paul White says:

September 10, 2017 at 6:37 AM

It's to ensure the trigger functions correctly (processing all rows) even if the calling code has set ROWCOUNT. See http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/21/defensive-database-programming-fun-with-rowcount.aspx for example.
Pratheek says:

September 12, 2017 at 10:40 AM

Great Article. Amazing work.
David Wiseman says:

September 12, 2017 at 7:53 PM

Great article. I’m using this technique to speed the performance of like %something% searches and I can attest that it works very well to improve performance. I’m using this technique on 2 columns. The larger of the two fragment (trigram) tables has 1.1 billion rows (46GB) and the smaller has 900 million rows (35GB). The indexed table has about 90 million rows. The technique works well at this scale.
My implementation is very similar to what is described in this article. I need to deal with NVARCHAR data though and there are a few other minor differences. I experimented with different fragment sizes – I found 3 characters to be the sweet spot where there was significant performance gains and it was able to be used on the majority of searches where users enter more than 3 chars.
My function to generate trigrams/ngrams is quite simple as it uses a separate function to generate a table of numbers (credit to Itzik Ben-Gan for the table of numbers). They are here if anyone wants to use them to experiment with different fragment sizes:
```
CREATE FUNCTION [dbo].[fnFragments](@Value NVARCHAR(1024),@FragmentSize TINYINT)
RETURNS TABLE
AS
/* Created 17/11/2011
Function to split input into fragments of specified size.
*/
RETURN
SELECT DISTINCT SUBSTRING(@Value,n,@FragmentSize) as Fragment
FROM dbo.fn_nums(LEN(@Value)-(@FragmentSize-1))

CREATE FUNCTION [dbo].[fn_nums](@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)
  SELECT n FROM Nums WHERE n <= @n;
```
My implementation is a multi-tenant database so a key design alteration that made sense for me was to include the tenant id in the trigram/fragment table.
I’m also using an indexed view to help find the most selective fragment in the search criteria. I’m not using the best 3 though– that looks like an interesting technique – I might have to test it at some point.
I think I did experiment with matching all fragments but found it was better to use the most selective fragment.
I’d encourage other people to use this technique if you need to optimize like %something% searches. FullText is the most obvious solution but this isn’t always a suitable replacement – it behaves in a very different way. It’s worth experimenting with this alternative solution as it can yield significant performance benefits while maintaining the existing behaviour of your queries.

Great stuff!
Paul White says:

September 13, 2017 at 10:14 AM

Thanks for the feedback, David. Glad it is working so well for you.
Jens Rügge says:

December 19, 2017 at 7:39 AM

Thank you for this great example!

The only thing is that the speed difference is much lower at us. But this is due to the number of data records. There are only about 50,000 on our test server. But even with this records search is about twice faster.
Brian Williams says:

January 12, 2018 at 3:04 PM

Thanks Paul for this excellent Blog.

I am looking at implementing this along with SQL Server 2016 SP1 Always Encrypted. Where the PII data is encrypted but the Trigrams may or may not be encrypted. I have generated a table with the 17,576 possible Trigrams for the 26 character alphabet and am trying to see if I can get it working Encrypted.

As I understand it the European Union has mandated that PII data be encrypted at rest by May 2018. I know TDE will accomplish that but am trying to push further to Always Encrypted.

Regards
Brian Williams says:

January 18, 2018 at 4:16 PM

I was successful in getting Trigrams working with Always Encrypted Deterministic Encryption. I use a seed string concatenated to the string to help with the guessability issue with Deterministic Encryption.
I had to remove a few things like Schema Binding to get it to work.
One negative is the Always Encrypted Trigam is so much larger than the non-Encrypted Trigram. But It does allow me to wild card query Always Encrypted Fields like last Name, Company Name etc. I am willing to post the modifications.
Paul White says:

January 18, 2018 at 8:34 PM

That's really cool, Brian. Thanks for the information.
Adam Wilbur says:

August 31, 2018 at 3:41 PM

Is the "AND E.string LIKE @Search" supposed to be on the first half of the UNION ALL in the Example_TrigramSearch? Seems like the answer is "no" but I wanted to make sure I wasn't misunderstanding something
Paul White says:

September 1, 2018 at 7:28 PM

Yes because trigram-search may generate false positives. The LIKE test is needed to ensure the matches are true matches.
Evan J Carroll says:

September 7, 2018 at 7:59 PM

So does this make '%LIKE%' SARGABLE in Microsoft SQL?
Paul White says:

September 8, 2018 at 2:57 PM

I guess it depends on the definition of SARGable in use.

I would expect most people to say "no", since the %string% search is not natively SARGable.

Replacing the non-SARGable LIKE with this function call is a workaround; one that allows most of the real work done to be performed using the seeking ability of b-tree indexes.
Henning Krause says:

December 21, 2018 at 8:21 AM

I have tried this solution on a table with about 1 million rows.

If the search string results in relatively few items, the query runs extremely fast (8ms compared to 1600ms with like).

But when the search string returns a large portion of the table, the execution times may rise to 4200ms compared to the 1600ms of the like query.

Any ideas what to do about this?
Ray Whitmer says:

January 8, 2019 at 7:39 PM

I am not yet a SQL Server user but am interested in this technology. So I have to ask when you say "modifications", what are you talking about modifying? Is this posted anywhere?

Are the trigrams computed pre-encryption and then encrypted, both when populating the index and when searching the index?

Have you tested this with large data sets to make sure the index is as fast as an index on non-encrypted fields?
Brian Williams says:

January 9, 2019 at 2:31 PM

Yes My current employer is in the Communications environment and our databases are in the low Terrabytes. Many large tables of 100 million records or more and 1 table that is often beyond 1 Billion records (we do love our cell phones).
I found the bigger the table the better the Trigram method worked compared to fulltext or simple indexes.
Towards modifications I ment what I had to change to get Paul's code to work inside Always Encrypted.

Comments are closed.