Aaron Bertrand

One way to get an index seek for a leading %wildcard

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul Randal, CEO of SQLskills, writes about knee-jerk performance tuning, DBCC, and SQL Server internals.

Paul’s Posts

A performance issue I see often is when users need to match part of a string with a query like the following:

... WHERE SomeColumn LIKE N'%SomePortion%';

With a leading wildcard, this predicate is "non-SARGable" – just a fancy way of saying we can't find the relevant rows by using a seek against an index on SomeColumn.

One solution we get kind of hand-wavy about is full-text search; however, this is a complex solution, and it requires that the search pattern consists of full words, doesn't use stop words, and so on. This can help if we're looking for rows where a description contains the word "soft" (or other derivatives like "softer" or "softly"), but it doesn't help when we're looking for strings that could be contained within words (or that aren't words at all, like all product SKUs that contain "X45-B" or all license plates that contain the sequence "7RA").

What if SQL Server somehow knew about all of the possible portions of a string? My thought process is along the lines of trigram / trigraph in PostgreSQL. The basic concept is that the engine has the ability to do point-style lookups on substrings, meaning you don't have to scan the entire table and parse every full value.

The specific example they use there is the word cat. In addition to the full word, it can be broken down into portions: c, ca, and at (they leave out a and t by definition – no trailing substring can be shorter than two characters). In SQL Server, we don't need it to be that complex; we only really need half the trigram – if we think about building a data structure that contains all of the substrings of cat, we only need these values:

  1. cat
  2. at
  3. t

We don't need c or ca, because anyone searching for LIKE '%ca%' could easily find value 1 by using LIKE 'ca%' instead. Similarly, anyone searching for LIKE '%at%' or LIKE '%a%' could use a trailing wildcard only against these three values and find the one that matches (e.g. LIKE 'at%' will find value 2, and LIKE 'a%' will also find value 2, without having to find those substrings inside value 1, which is where the real pain would come from).

So, given that SQL Server does not have anything like this built in, how do we generate such a trigram?

A Separate Fragments Table

I'm going to stop referencing "trigram" here because this isn't truly analogous to that feature in PostgreSQL. Essentially, what we are going to do is build a separate table with all of the "fragments" of the original string. (So in the cat example, there'd be a new table with those three rows, and LIKE '%pattern%' searches would be directed against that new table as LIKE 'pattern%' predicates.)

Before I start to show how my proposed solution would work, let me be absolutely clear that this solution should not be used in every single case where LIKE '%wildcard%' searches are slow. Because of the way we're going to "explode" the source data into fragments, it is likely limited in practicality to smaller strings, such as addresses or names, as opposed to larger strings, like product descriptions or session abstracts.

A more practical example than cat is to look at the Sales.Customer table in the WideWorldImporters sample database. It has address lines (both nvarchar(60)), with the valuable address info in the column DeliveryAddressLine2 (for reasons unknown). Someone might be looking for anyone who lives on a street named Hudecova, so they will be running a search like this:

SELECT CustomerID, DeliveryAddressLine2
  FROM Sales.Customers
  WHERE DeliveryAddressLine2 LIKE N'%Hudecova%';

/* This returns two rows:

    CustomerID  DeliveryAddressLine2
    ----------  ----------------------
    61          1695 Hudecova Avenue
    181         1846 Hudecova Crescent
*/

As you would expect, SQL Server needs to perform a scan to locate those two rows. Which should be simple; however, because of the complexity of the table, this trivial query yields a very messy execution plan (the scan is highlighted, and has a warning for residual I/O):

Plan for LIKE '%wildcard%' query

Blecch. To keep our lives a simple, and to make sure we don't chase a bunch of red herrings, let's create a new table with a subset of the columns, and to start we'll just insert those same two rows from above:

CREATE TABLE Sales.CustomersCopy
(
  CustomerID int IDENTITY(1,1) CONSTRAINT PK_CustomersCopy PRIMARY KEY,
  CustomerName         nvarchar(100) NOT NULL,
  DeliveryAddressLine1 nvarchar(60) NOT NULL,
  DeliveryAddressLine2 nvarchar(60)
);
GO

INSERT Sales.CustomersCopy
(
  CustomerName,
  DeliveryAddressLine1,
  DeliveryAddressLine2
)
SELECT 
  CustomerName,
  DeliveryAddressLine1,
  DeliveryAddressLine2
FROM Sales.Customers
WHERE DeliveryAddressLine2 LIKE N'%Hudecova%';

Now, if we run the same query we ran against the main table, we get something a lot more reasonable to look at as a starting point. This is still going to be a scan no matter what we do – if we add an index with DeliveryAddressLine2 as the leading key column, we'll most likely get an index scan, with a key lookup depending on whether the index covers the columns in the query. As is, we get a clustered index scan:

Scan against the Customers copy

Next, let's create a function that will "explode" these address values into fragments. We would expect 1846 Hudecova Crescent, for example, to have the following set of fragments:

  • 1846 Hudecova Crescent
  • 846 Hudecova Crescent
  • 46 Hudecova Crescent
  • 6 Hudecova Crescent
  •  Hudecova Crescent
  • Hudecova Crescent
  • udecova Crescent
  • decova Crescent
  • ecova Crescent
  • cova Crescent
  • ova Crescent
  • va Crescent
  • a Crescent
  •  Crescent
  • Crescent
  • rescent
  • escent
  • scent
  • cent
  • ent
  • nt
  • t

It is fairly trivial to write a function that will produce this output – we just need a recursive CTE that can be used to step through each character throughout the length of the input:

CREATE FUNCTION dbo.CreateStringFragments( @input nvarchar(60) )
RETURNS TABLE WITH SCHEMABINDING
AS
  RETURN 
  (
    WITH x(x) AS 
    (
      SELECT 1 UNION ALL SELECT x+1 FROM x WHERE x < (LEN(@input))
    )
    SELECT Fragment = SUBSTRING(@input, x, LEN(@input)) FROM x
  );
GO

SELECT Fragment FROM dbo.CreateStringFragments(N'1846 Hudecova Crescent');
-- same output as above bulleted list

Now, let's create a table that will store all of the address fragments, and which Customer they belong to:

CREATE TABLE Sales.CustomerAddressFragments
(
  CustomerID  int          NOT NULL,
  Fragment    nvarchar(60) NOT NULL,
  CONSTRAINT FK_Customers FOREIGN KEY(CustomerID) REFERENCES Sales.CustomersCopy(CustomerID)
);

CREATE CLUSTERED INDEX s_cat ON Sales.CustomerAddressFragments(Fragment, CustomerID);

Then we can populate it like this:

INSERT Sales.CustomerAddressFragments(CustomerID, Fragment)
SELECT c.CustomerID, f.Fragment
  FROM Sales.CustomersCopy AS c
  CROSS APPLY dbo.CreateStringFragments(c.DeliveryAddressLine2) AS f;

For our two values, this inserts 42 rows (one address has 20 characters, and the other has 22). Now our query becomes:

SELECT c.CustomerID, c.DeliveryAddressLine2
  FROM Sales.CustomersCopy AS c
  INNER JOIN Sales.CustomerAddressFragments AS f
    ON  f.CustomerID = c.CustomerID
    AND f.Fragment LIKE N'Hudecova%';

Here is a much nicer plan - two clustered index *seeks* and a nested loops join:

Seeks against customers and address fragments

We can also do this another way, using EXISTS:

SELECT c.CustomerID, c.DeliveryAddressLine2
  FROM Sales.CustomersCopy AS c
  WHERE EXISTS
  (
    SELECT 1 FROM Sales.CustomerAddressFragments
    WHERE CustomerID = c.CustomerID
      AND Fragment LIKE N'Hudecova%'
  );

Here is that plan, which looks on the surface to be more expensive - it chooses to scan the CustomersCopy table. We'll see shortly why this is the better query approach:

Scan and seek using EXISTS

Now, on this massive data set of 42 rows, the differences seen in duration and I/O are so minuscule they're irrelevant (and in fact, at this small size, the scan against the base table looks cheaper in terms of I/O than the other two plans that use the fragment table):

Metrics against small table

What if we were to populate these tables with a lot more data? My copy of Sales.Customers has 663 rows, so if we cross join that against itself, we'd get somewhere near 440,000 rows. So let's just mash up 400K and generate a much larger table:

TRUNCATE TABLE Sales.CustomerAddressFragments;
DELETE Sales.CustomersCopy;
DBCC FREEPROCCACHE WITH NO_INFOMSGS;

INSERT Sales.CustomersCopy WITH (TABLOCKX) (CustomerName, DeliveryAddressLine1, DeliveryAddressLine2)
SELECT TOP (400000) c1.CustomerName, c1.DeliveryAddressLine1, c2.DeliveryAddressLine2
  FROM Sales.Customers c1 CROSS JOIN Sales.Customers c2
  ORDER BY NEWID(); -- fun for distribution

-- this will take a bit longer - on my system, ~4 minutes
-- probably because I didn't bother to pre-expand files
INSERT Sales.CustomerAddressFragments WITH (TABLOCKX) (CustomerID, Fragment)
SELECT c.CustomerID, f.Fragment
  FROM Sales.CustomersCopy AS c
  CROSS APPLY dbo.CreateStringFragments(c.DeliveryAddressLine2) AS f;

-- repeated for compressed version of the table
-- 7.25 million rows, 278MB (79MB compressed; saving those tests for another day)

Now to compare performance and execution plans given a variety of possible search parameters, I tested the above three queries with these predicates:

Query Predicates
WHERE DeliveryLineAddress2 LIKE ... N'%Hudecova%' N'%cova%' N'%ova%' N'%va%'
JOIN ... WHERE Fragment LIKE ... N'Hudecova%' N'cova%' N'ova%' N'va%'
WHERE EXISTS (... WHERE Fragment LIKE ...)

 
As we remove letters from the search pattern, I would expect to see more rows output, and perhaps different strategies chosen by the optimizer. Let's see how it went for each pattern:

    Hudecova%

    For this pattern, the scan was still the same for the LIKE condition; however, with more data, the cost is much higher. The seeks into the fragments table really pay off at this row count (1,206), even with really low estimates. The EXISTS plan adds a distinct sort, which you would expect to add to the cost, though in this case it ends up doing fewer reads:

    Hudecova-metrics

    Hudecova-joinPlan for the JOIN to the fragments table
    Hudecova-existsPlan for the EXISTS against the fragments table

    cova%

    As we strip some letters off our predicate, we see the reads actually a bit higher than with the original clustered index scan, and now we over-estimate the rows. Even still, our duration remains significantly lower with both fragment queries; the difference this time is more subtle - both have sorts (only EXISTS is distinct):

    cova-metrics

    cova-joinPlan for the JOIN to the fragments table
    cova-existsPlan for the EXISTS against the fragments table

    ova%

    Stripping an additional letter didn't change much; though it is worth noting how much the estimated and actual rows jump here - meaning that this may be a common substring pattern. The original LIKE query is still quite a bit slower than the fragment queries.

    ova-metrics

    ova-joinPlan for the JOIN to the fragments table
    ova-existsPlan for the EXISTS against the fragments table

    va%

    Down to two letters, this does introduce our first discrepancy. Notice that the JOIN produces more rows than the original query or the EXISTS. Why would that be?

    va-metrics

    va-joinPlan for the JOIN to the fragments table
    va-existsPlan for the EXISTS against the fragments table
    We don't have to look far. Remember that there is a fragment starting from each successive character in the original address, which means something like 899 Valentova Road will have two rows in the fragments table that start with va (case sensitivity aside). So you'll match on both Fragment = N'Valentova Road' and Fragment = N'va Road'. I'll save you the searching and provide a single example of many:

    SELECT TOP (2) c.CustomerID, c.DeliveryAddressLine2, f.Fragment
    FROM Sales.CustomersCopy AS c
    INNER JOIN Sales.CustomerAddressFragments AS f
    ON c.CustomerID = f.CustomerID
    WHERE f.Fragment LIKE N'va%'
    AND c.DeliveryAddressLine2 = N'899 Valentova Road'
    AND f.CustomerID = 351;
    
    /*
    CustomerID  DeliveryAddressLine2  Fragment
    ----------  --------------------  --------------
    351         899 Valentova Road    va Road
    351         899 Valentova Road    Valentova Road
    */

    This readily explains why the JOIN produces more rows; if you want to match the output of the original LIKE query, you should use EXISTS. The fact that the correct results can usually also be obtained in a less resource-intensive way is just a bonus. (I'd be nervous to see people choose the JOIN if that were the repeatedly more efficient option - you should always favor correct results over worrying about performance.)

Summary

It's clear that in this specific case - with an address column of nvarchar(60) and a max length of 26 characters - breaking up each address into fragments can bring some relief to otherwise expensive "leading wildcard" searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I've also demonstrated why EXISTS is better in scenarios where multiple matches are possible - with a JOIN, you will get redundant output unless you add some "greatest n per group" logic.

Caveats

I will be the first to admit that this solution is imperfect, incomplete, and not thoroughly fleshed out:

  • You'll need to keep the fragments table synchronized with the base table using triggers - simplest would be for inserts and updates, delete all rows for those customers and re-insert them, and for deletes obviously remove the rows from the fragments table.
  • As mentioned, this solution worked better for this specific data size, and may not do so well for other string lengths. It would warrant further testing to ensure it is appropriate for your column size, average value length, and typical search parameter length.
  • Since we will have a lot of copies of fragments like "Crescent" and "Street" (never mind all the same or similar street names and house numbers), could further normalize it by storing each unique fragment in a fragments table, and then yet another table that acts as a many-to-many junction table. That would be a lot more cumbersome to set up, and I'm not quite sure there would be much to gain.
  • I didn't fully investigate page compression yet, it seems that - at least in theory - this could provide some benefit. I also have a feeling a columnstore implementation may be beneficial in some cases, too.

Anyway, all that said, I just wanted to share it as a developing idea, and solicit any feedback on its practicality for specific use cases. I can dig into more specifics for a future post if you can let me know what aspects of this solution interest you.