A performance issue I see often is when users need to match part of a string with a query like the following:
... WHERE SomeColumn LIKE N'%SomePortion%';
With a leading wildcard, this predicate is "non-SARGable" – just a fancy way of saying we can't find the relevant rows by using a seek against an index on
One solution we get kind of hand-wavy about is full-text search; however, this is a complex solution, and it requires that the search pattern consists of full words, doesn't use stop words, and so on. This can help if we're looking for rows where a description contains the word "soft" (or other derivatives like "softer" or "softly"), but it doesn't help when we're looking for strings that could be contained within words (or that aren't words at all, like all product SKUs that contain "X45-B" or all license plates that contain the sequence "7RA").
What if SQL Server somehow knew about all of the possible portions of a string? My thought process is along the lines of
trigraph in PostgreSQL. The basic concept is that the engine has the ability to do point-style lookups on substrings, meaning you don't have to scan the entire table and parse every full value.
The specific example they use there is the word
cat. In addition to the full word, it can be broken down into portions:
at (they leave out
t by definition – no trailing substring can be shorter than two characters). In SQL Server, we don't need it to be that complex; we only really need half the trigram – if we think about building a data structure that contains all of the substrings of
cat, we only need these values:
We don't need
ca, because anyone searching for
LIKE '%ca%' could easily find value 1 by using
LIKE 'ca%' instead. Similarly, anyone searching for
LIKE '%at%' or
LIKE '%a%' could use a trailing wildcard only against these three values and find the one that matches (e.g.
LIKE 'at%' will find value 2, and
LIKE 'a%' will also find value 2, without having to find those substrings inside value 1, which is where the real pain would come from).
So, given that SQL Server does not have anything like this built in, how do we generate such a trigram?
A Separate Fragments Table
I'm going to stop referencing "trigram" here because this isn't truly analogous to that feature in PostgreSQL. Essentially, what we are going to do is build a separate table with all of the "fragments" of the original string. (So in the
cat example, there'd be a new table with those three rows, and
LIKE '%pattern%' searches would be directed against that new table as
LIKE 'pattern%' predicates.)
Before I start to show how my proposed solution would work, let me be absolutely clear that this solution should not be used in every single case where
LIKE '%wildcard%' searches are slow. Because of the way we're going to "explode" the source data into fragments, it is likely limited in practicality to smaller strings, such as addresses or names, as opposed to larger strings, like product descriptions or session abstracts.
A more practical example than
cat is to look at the
Sales.Customer table in the WideWorldImporters sample database. It has address lines (both
nvarchar(60)), with the valuable address info in the column
DeliveryAddressLine2 (for reasons unknown). Someone might be looking for anyone who lives on a street named
Hudecova, so they will be running a search like this:
SELECT CustomerID, DeliveryAddressLine2 FROM Sales.Customers WHERE DeliveryAddressLine2 LIKE N'%Hudecova%'; /* This returns two rows: CustomerID DeliveryAddressLine2 ---------- ---------------------- 61 1695 Hudecova Avenue 181 1846 Hudecova Crescent */
As you would expect, SQL Server needs to perform a scan to locate those two rows. Which should be simple; however, because of the complexity of the table, this trivial query yields a very messy execution plan (the scan is highlighted, and has a warning for residual I/O):
Blecch. To keep our lives a simple, and to make sure we don't chase a bunch of red herrings, let's create a new table with a subset of the columns, and to start we'll just insert those same two rows from above:
CREATE TABLE Sales.CustomersCopy ( CustomerID int IDENTITY(1,1) CONSTRAINT PK_CustomersCopy PRIMARY KEY, CustomerName nvarchar(100) NOT NULL, DeliveryAddressLine1 nvarchar(60) NOT NULL, DeliveryAddressLine2 nvarchar(60) ); GO INSERT Sales.CustomersCopy ( CustomerName, DeliveryAddressLine1, DeliveryAddressLine2 ) SELECT CustomerName, DeliveryAddressLine1, DeliveryAddressLine2 FROM Sales.Customers WHERE DeliveryAddressLine2 LIKE N'%Hudecova%';
Now, if we run the same query we ran against the main table, we get something a lot more reasonable to look at as a starting point. This is still going to be a scan no matter what we do – if we add an index with
DeliveryAddressLine2 as the leading key column, we'll most likely get an index scan, with a key lookup depending on whether the index covers the columns in the query. As is, we get a clustered index scan:
Next, let's create a function that will "explode" these address values into fragments. We would expect
1846 Hudecova Crescent, for example, to have the following set of fragments:
- 1846 Hudecova Crescent
- 846 Hudecova Crescent
- 46 Hudecova Crescent
- 6 Hudecova Crescent
- Hudecova Crescent
- Hudecova Crescent
- udecova Crescent
- decova Crescent
- ecova Crescent
- cova Crescent
- ova Crescent
- va Crescent
- a Crescent
It is fairly trivial to write a function that will produce this output – we just need a recursive CTE that can be used to step through each character throughout the length of the input:
CREATE FUNCTION dbo.CreateStringFragments( @input nvarchar(60) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN ( WITH x(x) AS ( SELECT 1 UNION ALL SELECT x+1 FROM x WHERE x < (LEN(@input)) ) SELECT Fragment = SUBSTRING(@input, x, LEN(@input)) FROM x ); GO SELECT Fragment FROM dbo.CreateStringFragments(N'1846 Hudecova Crescent'); -- same output as above bulleted list
Now, let's create a table that will store all of the address fragments, and which Customer they belong to:
CREATE TABLE Sales.CustomerAddressFragments ( CustomerID int NOT NULL, Fragment nvarchar(60) NOT NULL, CONSTRAINT FK_Customers FOREIGN KEY(CustomerID) REFERENCES Sales.CustomersCopy(CustomerID) ); CREATE CLUSTERED INDEX s_cat ON Sales.CustomerAddressFragments(Fragment, CustomerID);
Then we can populate it like this:
INSERT Sales.CustomerAddressFragments(CustomerID, Fragment) SELECT c.CustomerID, f.Fragment FROM Sales.CustomersCopy AS c CROSS APPLY dbo.CreateStringFragments(c.DeliveryAddressLine2) AS f;
For our two values, this inserts 42 rows (one address has 20 characters, and the other has 22). Now our query becomes:
SELECT c.CustomerID, c.DeliveryAddressLine2 FROM Sales.CustomersCopy AS c INNER JOIN Sales.CustomerAddressFragments AS f ON f.CustomerID = c.CustomerID AND f.Fragment LIKE N'Hudecova%';
Here is a much nicer plan – two clustered index *seeks* and a nested loops join:
We can also do this another way, using
SELECT c.CustomerID, c.DeliveryAddressLine2 FROM Sales.CustomersCopy AS c WHERE EXISTS ( SELECT 1 FROM Sales.CustomerAddressFragments WHERE CustomerID = c.CustomerID AND Fragment LIKE N'Hudecova%' );
Here is that plan, which looks on the surface to be more expensive – it chooses to scan the CustomersCopy table. We'll see shortly why this is the better query approach:
Now, on this massive data set of 42 rows, the differences seen in duration and I/O are so minuscule they're irrelevant (and in fact, at this small size, the scan against the base table looks cheaper in terms of I/O than the other two plans that use the fragment table):
What if we were to populate these tables with a lot more data? My copy of
Sales.Customers has 663 rows, so if we cross join that against itself, we'd get somewhere near 440,000 rows. So let's just mash up 400K and generate a much larger table:
TRUNCATE TABLE Sales.CustomerAddressFragments; DELETE Sales.CustomersCopy; DBCC FREEPROCCACHE WITH NO_INFOMSGS; INSERT Sales.CustomersCopy WITH (TABLOCKX) (CustomerName, DeliveryAddressLine1, DeliveryAddressLine2) SELECT TOP (400000) c1.CustomerName, c1.DeliveryAddressLine1, c2.DeliveryAddressLine2 FROM Sales.Customers c1 CROSS JOIN Sales.Customers c2 ORDER BY NEWID(); -- fun for distribution -- this will take a bit longer - on my system, ~4 minutes -- probably because I didn't bother to pre-expand files INSERT Sales.CustomerAddressFragments WITH (TABLOCKX) (CustomerID, Fragment) SELECT c.CustomerID, f.Fragment FROM Sales.CustomersCopy AS c CROSS APPLY dbo.CreateStringFragments(c.DeliveryAddressLine2) AS f; -- repeated for compressed version of the table -- 7.25 million rows, 278MB (79MB compressed; saving those tests for another day)
Now to compare performance and execution plans given a variety of possible search parameters, I tested the above three queries with these predicates:
|WHERE DeliveryLineAddress2 LIKE …||N'%Hudecova%'||N'%cova%'||N'%ova%'||N'%va%'|
|JOIN … WHERE Fragment LIKE …||N'Hudecova%'||N'cova%'||N'ova%'||N'va%'|
|WHERE EXISTS (… WHERE Fragment LIKE …)|
As we remove letters from the search pattern, I would expect to see more rows output, and perhaps different strategies chosen by the optimizer. Let's see how it went for each pattern:
For this pattern, the scan was still the same for the LIKE condition; however, with more data, the cost is much higher. The seeks into the fragments table really pay off at this row count (1,206), even with really low estimates. The EXISTS plan adds a distinct sort, which you would expect to add to the cost, though in this case it ends up doing fewer reads:
As we strip some letters off our predicate, we see the reads actually a bit higher than with the original clustered index scan, and now we over-estimate the rows. Even still, our duration remains significantly lower with both fragment queries; the difference this time is more subtle – both have sorts (only EXISTS is distinct):
Stripping an additional letter didn't change much; though it is worth noting how much the estimated and actual rows jump here – meaning that this may be a common substring pattern. The original LIKE query is still quite a bit slower than the fragment queries.
Down to two letters, this does introduce our first discrepancy. Notice that the JOIN produces more rows than the original query or the EXISTS. Why would that be?
899 Valentova Roadwill have two rows in the fragments table that start with
va(case sensitivity aside). So you'll match on both
Fragment = N'Valentova Road'and
Fragment = N'va Road'. I'll save you the searching and provide a single example of many:
SELECT TOP (2) c.CustomerID, c.DeliveryAddressLine2, f.Fragment FROM Sales.CustomersCopy AS c INNER JOIN Sales.CustomerAddressFragments AS f ON c.CustomerID = f.CustomerID WHERE f.Fragment LIKE N'va%' AND c.DeliveryAddressLine2 = N'899 Valentova Road' AND f.CustomerID = 351; /* CustomerID DeliveryAddressLine2 Fragment ---------- -------------------- -------------- 351 899 Valentova Road va Road 351 899 Valentova Road Valentova Road */
This readily explains why the JOIN produces more rows; if you want to match the output of the original LIKE query, you should use EXISTS. The fact that the correct results can usually also be obtained in a less resource-intensive way is just a bonus. (I'd be nervous to see people choose the JOIN if that were the repeatedly more efficient option – you should always favor correct results over worrying about performance.)
It's clear that in this specific case – with an address column of
nvarchar(60) and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive "leading wildcard" searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I've also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some "greatest n per group" logic.
I will be the first to admit that this solution is imperfect, incomplete, and not thoroughly fleshed out:
- You'll need to keep the fragments table synchronized with the base table using triggers – simplest would be for inserts and updates, delete all rows for those customers and re-insert them, and for deletes obviously remove the rows from the fragments table.
- As mentioned, this solution worked better for this specific data size, and may not do so well for other string lengths. It would warrant further testing to ensure it is appropriate for your column size, average value length, and typical search parameter length.
- Since we will have a lot of copies of fragments like "Crescent" and "Street" (never mind all the same or similar street names and house numbers), could further normalize it by storing each unique fragment in a fragments table, and then yet another table that acts as a many-to-many junction table. That would be a lot more cumbersome to set up, and I'm not quite sure there would be much to gain.
- I didn't fully investigate page compression yet, it seems that – at least in theory – this could provide some benefit. I also have a feeling a columnstore implementation may be beneficial in some cases, too.
Anyway, all that said, I just wanted to share it as a developing idea, and solicit any feedback on its practicality for specific use cases. I can dig into more specifics for a future post if you can let me know what aspects of this solution interest you.