Hit-highlighting is a feature that many people wish SQL Server's Full-Text Search would support natively. This is where you can return the entire document (or an excerpt) and point out the words or phrases that helped match that document to the search. Doing so in an efficient and accurate manner is no easy task, as I found out first hand.
As an example of hit-highlighting: when you perform a search in Google or Bing, you get the key words bolded in both the title and the excerpt (click either image to enlarge):
[As an aside, I find two things amusing here: (1) that Bing favors Microsoft properties a lot more than Google does, and (2) that Bing bothers returning 2.2 million results, many of which are likely irrelevant.]
These excerpts are commonly called "snippets" or "query-biased summarizations." We've been asking for this functionality in SQL Server for some time, but have yet to hear any good news from Microsoft:
- Connect #295100 : Full-text search summaries (hit-highlighting)
- Connect #722324 : Would be nice if SQL Full Text Search provided snippet / highlighting support
The question pops up on Stack Overflow from time to time as well:
- How to do hit-highlighting of results from a SQL Server full-text query
- Will Sql Server 2012 FTS have native support for hit highlighting?
There are some partial solutions. This script from Mike Kramar, for example, will produce a hit-highlighted extract, but does not apply the same logic (such as language-specific word breakers) to the document itself. It also uses an absolute character count, so the excerpt can begin and end with partial words (as I will demonstrate shortly). The latter is pretty easy to fix, but another issue is that it loads the entire document into memory, rather than performing any kind of streaming. I suspect that in full-text indexes with large document sizes, this will be a noticeable performance hit. For now I'll focus on a relatively small average document size (35 KB).
A simple example
So let's say we have a very simple table, with a full-text index defined:
CREATE FULLTEXT CATALOG [FTSDemo];
GO
CREATE TABLE [dbo].[Document]
(
[ID] INT IDENTITY(1001,1) NOT NULL,
[Url] NVARCHAR(200) NOT NULL,
[Date] DATE NOT NULL,
[Title] NVARCHAR(200) NOT NULL,
[Content] NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_DOCUMENT PRIMARY KEY(ID)
);
GO
CREATE FULLTEXT INDEX ON [dbo].[Document]
(
[Content] LANGUAGE [English],
[Title] LANGUAGE [English]
)
KEY INDEX [PK_Document] ON ([FTSDemo]);
This table is populated with a few documents (specifically, 7), such as the Declaration of Independence, and Nelson Mandela's "I am prepared to die" speech. A typical full-text search against this table might be:
SELECT d.Title, d.[Content]
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY [RANK] DESC;
The result returns 4 rows out of 7:
Now using a UDF function like Mike Kramar's:
SELECT d.Title,
Excerpt = dbo.HighLightSearch(d.[Content], N'states', 'font-weight:bold', 80)
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY [RANK] DESC;
The results show how the excerpt works: a <SPAN>
tag is injected at the first keyword, and the excerpt is carved out based on an offset from that position (with no consideration for using complete words):
(Again, this is something that can be fixed, but I want to be sure I properly represent what is out there now.)
ThinkHighlight
Eran Meyuchas of Interactive Thoughts has developed a component that solves many of these issues. ThinkHighlight is implemented as a CLR Assembly with two CLR scalar-valued functions:
(You'll also see Mike Kramar's UDF in the list of functions.)
Now, without getting into all of the details about installing and activating the assembly on your system, here is how the above query would be represented with ThinkHighlight:
SELECT d.Title,
Excerpt = dbo.HitHighlight(dbo.HitHighlightContext('Document', 'Content', N'states', -1),
'top-fragment', 100, d.ID)
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY t.[RANK] DESC;
The results show how the most relevant keywords are highlighted, and an excerpt is derived from that based on full words and an offset from the term being highlighted:
Some additional advantages that I haven't demonstrated here include the ability to choose different summarization strategies, controlling the presentation of each keyword (rather than all) using unique CSS, as well as support for multiple languages and even documents in binary format (most IFilters are supported).
Performance results
Initially I tested the runtime metrics for the three queries using SQL Sentry Plan Explorer, against the 7-row table. The results were:
Next I wanted to see how they would compare on a much larger data size. I inserted the table into itself until I was at 4,000 rows, then ran the following query:
SET STATISTICS TIME ON;
GO
SELECT /* FTS */ d.Title, d.[Content]
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY [RANK] DESC;
GO
SELECT /* UDF */ d.Title,
Excerpt = dbo.HighLightSearch(d.[Content], N'states', 'font-weight:bold', 100)
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY [RANK] DESC;
GO
SELECT /* ThinkHighlight */ d.Title,
Excerpt = dbo.HitHighlight(dbo.HitHighlightContext('Document', 'Content', N'states', -1),
'top-fragment', 100, d.ID)
FROM dbo.[Document] AS d
INNER JOIN CONTAINSTABLE(dbo.[Document], *, N'states') AS t
ON d.ID = t.[KEY]
ORDER BY t.[RANK] DESC;
GO
SET STATISTICS TIME OFF;
GO
I also monitored sys.dm_exec_memory_grants while the queries were running, to pick up any discrepancies in memory grants. Results averaging over 10 runs:
While both hit-highlighting options incur a significant penalty over not highlighting at all, the ThinkHighlight solution – with more flexible options – represents a very marginal incremental cost in terms of duration (~1%), while using significantly less memory (36%) than the UDF variant.
Conclusion
It should not come as a surprise that hit-highlighting is an expensive operation, and based on the complexity of what has to be supported (think multiple languages), that very few solutions exist out there. I think Mike Kramar has done an excellent job producing a baseline UDF that gets you a good way toward solving the problem, but I was pleasantly surprised to find a more robust commercial offering – and found it to be very stable, even in beta form. I do plan to perform more thorough tests using a wider range of document sizes and types. In the meantime, if hit-highlighting is a part of your application requirements, you should try out Mike Kramar's UDF and consider taking ThinkHighlight for a test drive.
Hmm, UDF is not working for accented words like 'café' because display_term column is not showing them… if 'café' and 'cafe' are different words (with different meaning) hit highlighting is not working properly.
Robert, I didn't write the UDF or the CLR component, so I can't speak to whether that situation was intended. There is probably an easy fix, though, if you want to treat e and é the same. For example:
Results:
café cafx
No, its the reverse: in polish "mila" and "miła" are two completely different words (first means "mile" and second means "nice"). But display_term for both is "mila"… same could be for languages like german or norwegian.
Ok Robert, I'm not sure I'm going to have a solution for you, I'm approaching this from a narrow-minded English view. :-) The author of the CLR component insists that ThinkHighlight already supports accent insensitivity. The UDF should be able to do so also (try adding a case-insensitive COLLATE clause to the CHARINDEX). But for further help I'll have to direct you to its author.