Filtered indexes are amazingly powerful, but I still see some confusion out there about them – particularly about the columns that are used in the filters, and what happens when you want to tighten the filters.
A recent question on dba.stackexchange asked for help about why columns used in the filter of a filtered index should be included in the 'included' columns of the index. Excellent question – except that I felt like it started on a poor premise, because those columns shouldn't have to be included in the index. Yes they help, but not in the way that the question seemed to suggest.
To save you looking at the question itself, here's a quick summary:
To satisfy this query…
SELECT Id, DisplayName FROM Users WHERE Reputation > 400000;
…the following filtered index is pretty good:
CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club ON dbo.Users ( DisplayName, Id ) INCLUDE ( Reputation ) WHERE Reputation > 400000;
But despite having this index in place, the Query Optimizer recommends the following index if the filtered value is tightened to, say, 450000.
CREATE NONCLUSTERED INDEX IndexThatWasMissing ON dbo.Users ( Reputation ) INCLUDE ( DisplayName, Id );
I'm paraphrasing the question a little here, which starts by referring to this situation and then builds a different example, but the idea is the same. I just didn't want to make things more complicated by involving a separate table.
Point is – the index suggested by the QO is the original index but turned on its head. The original index had Reputation in the INCLUDE list, and DisplayName and Id as key columns, while the new recommended index is the opposite way around with Reputation as the key column and DisplayName & ID in the INCLUDE. Let's look into why.
The question refers to a post by Erik Darling, where he explains that he tuned the '450,000' query above by putting Reputation into the INCLUDE column. Erik shows that without Reputation in the INCLUDE list, a query which filters to a higher value of Reputation needs to do Lookups (bad!), or maybe even give up entirely on the filtered index (potentially even worse). He concludes that having the Reputation column in the INCLUDE list lets SQL have statistics, so that it can make better choices, and shows that with Reputation in the INCLUDE a variety of queries that all filter on higher Reputation values all scan his filtered index.
In an answer to the dba.stackexchange question, Brent Ozar points out that Erik's improvements aren't particularly great because they cause Scans. I'll come back to that one, because it's an interesting point in itself.
First let's think a little about indexes in general.
An index provides an ordered structure to a set of data. (I could be pedantic and point out that reading through the data in an index from start to end might jump you from page to page in a seeming haphazard way, but still as you're reading through pages, following the pointers from one page to the next you can be confident the data is ordered. Within each page you might even jump around to read the data in order, but there is a list showing you which parts (slots) of the page should be read in which order. There really is no point in my pedantry except to answer those equally pedantic who will comment if I don't.)
And this order is according to the key columns – that's the easy bit that everyone gets. It's useful not only for being able to avoid re-ordering the data later, but also for being able to quickly locate any particular row or range of rows by those columns.
The leaf levels of the index contain the values in any columns in the INCLUDE list, or in the case of a Clustered Index, the values across all the columns in the table (except non-persisted computed columns). The other levels in the index contain just the key columns and (if the index is not unique) the unique address of the row – which is either the keys of the clustered index (with the row's uniquifier if the clustered index isn't unique either) or the RowID value for a heap, enough to allow easy access to all the other column values for the row. The leaf levels also include all the ‘address’ information.
But that's not the interesting bit for this post. The interesting bit for this post is what I mean by "to a set of data". Remember I said "An index provides an ordered structure to a set of data".
In a clustered index, that set of data is the whole table, but it could be something else. You can probably already picture how most non-clustered indexes don't involve all the columns of the table. This is one of the things that make non-clustered indexes so useful, because they're typically a lot smaller than the underlying table.
In the case of an indexed view, our set of data it could be the results of a whole query, including joins across many tables! That's for another post.
But in a filtered index, it's not just a copy of a subset of columns, but a subset of rows too. So in the example here, the index is only across the users with more than 400k reputation.
CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club_NoInclude ON dbo.Users ( DisplayName, Id ) WHERE Reputation > 400000;
This index takes the users who have more than 400k reputation, and orders them by DisplayName and Id. It can be unique because (assumedly) the Id column is already unique. If you try something similar on your own table, you might need to be careful of that.
But at this point, the index doesn't care what the Reputation is for each user – it just cares whether the Reputation is high enough to be in the index or not. If a user's reputation gets updated and it tips over the threshold that user's DisplayName and Id will be inserted into the index. If it drops below, it'll get deleted from the index. It's just like having a separate table for the high rollers, except that we get people into that table by increasing their Reputation value over the 400k threshold in the underlying table. It can do this without having to actually store the Reputation value itself.
So now if we want to find people who have a threshold over 450k, that index is missing some information.
Sure, we could confidently say that everyone we'll find is in that index – but the index doesn't contain enough information in itself to filter any further on Reputation. If I told you I had an alphabetical list of Best Picture Oscar winning films from the 1990s (American Beauty, Braveheart, Dances With Wolves, English Patient, Forrest Gump, Schindler's List, Shakespeare in Love, Silence of the Lambs, Titanic, Unforgiven), then I can assure you that the winners for 1994-1996 would be a subset of those, but I can't answer the question without first getting some more information.
Obviously my filtered index would be more useful if I had included the year, and potentially even more so if the year were a key column, since my new query wants to find the ones for 1994-1996. But I probably designed this index around a query to list all the movies from the 1990s in alphabetical order. That query doesn't care about what the actual year is, only whether it's in the 1990s or not, and I don't even need to return the year – just the title – so I can scan my filtered index to get the results. For that query I don't even need to reorder the results or find the starting point – my index really is perfect.
A more practical example of not caring about the value of the column in the filter is on status, such as:
WHERE IsActive = 1
I frequently see code that moves data from one table to another when rows stop being 'active'. People don't want old rows cluttering up their table, and they recognise that their 'hot' data is only a small subset of all their data. So they move their cooling data off into an Archive table, keeping their Active table small.
A filtered index can do this for you. Behind the scenes. As soon as you update the row and change that IsActive column to something other than 1. If you only care about having active data in most of your indexes, then filtered indexes are ideal. It'll even bring rows back into the indexes if the IsActive value changes back to 1.
But you don't need to put IsActive in the INCLUDE list to achieve this. Why would you want to store the value – you already know what the value is – it's 1! Unless you're asking to return the value you shouldn't need it. And why would you return the value when you already know that the answer is 1, right?! Except that frustratingly, the statistics that Erik refers to in his post will take advantage of being in the INCLUDE list. You don't need it for the query, but you should include it for the stats.
Let's think about what the Query Optimizer needs to do to figure out the usefulness of an index.
Before it can do much at all, it needs to consider whether the index is a candidate. No point in using an index if it doesn't have all the rows that might be needed – not unless we have an effective way of getting the rest. If I want movies from 1985-1995, then my index of 1990s movies is pretty pointless. But for 1994-1996, maybe it's not bad.
At this point, just like any index consideration, I need to think about whether it will help enough for finding the data and getting it into an order that will help executing the rest of the query (possibly for a Merge Join, Stream Aggregate, satisfying an ORDER BY, or various other reasons). If my query filter matches the index filter exactly, then I don't need to filter any further – just using the index is enough. This sounds great, but if it doesn't match exactly, if my query filter is tighter that the index filter (like my 1994-1996 example, or Erik's 450,000), I'm going to need to have those Year values or Reputation values to check – hopefully getting them either from the INCLUDEd at the leaf level or somewhere in my key columns. If they're not in the index, I'm going to have to do a Lookup for every row in my filtered index (and ideally, have an idea about how many times my Lookup will be called, which are the stats that Erik wants the column included for).
Ideally, any index I plan to use is ordered correctly (via the keys), INCLUDEs all the columns I need to return, and is pre-filtered to just the rows I need. That would be the perfect index, and my execution plan will be a Scan.
That's right, a SCAN. Not a Seek, but a Scan. It'll start on the first page of my index and keep giving me rows until I've got as many as I need, or until there are no more rows to return. Not skipping any, not sorting them – just giving me the rows in order.
A Seek would suggest that I don't need the whole index, which means I'm wasting resources in maintaining that part of the index, and to query it I have to find the starting point and keep checking rows to see if I've hit the end or not. If my Scan has a Predicate, then sure, I'm having to look through (and test) more data than I need to, but if my index filters are perfect, then the Query Optimizer should recognise that and not have to perform those checks.
INCLUDEs aren't critical for filtered indexes. They're useful for providing easy access to columns which might be useful for your query, and if you happen to be tightening what's in your filtered index by any column, whether it's mentioned in the filter or not, you should consider having that column in the mix. But at that point you ought to be asking whether your index's filter is the right one, what else you should have in your INCLUDE list, and even what the key column(s) should be. Erik's queries weren't playing nicely because he needed information that wasn't in the index, even though he had mentioned the column in the filter. He found a good use for the statistics too, and I would still encourage you to include the filter columns for that reason. But putting them in an INCLUDE doesn't allow them to suddenly start doing a Seek, because that's not how any index works, whether filtered or not.
I want you, reader, to understand filtered indexes really well. They're incredibly useful and, when you start to picture them like tables in their own rights, can become part of your overall database design. They're also a reason for always using the ANSI_NULLs and QUOTED_IDENTIFIER settings, because you'll get errors from filtered index unless those settings are ON, but hopefully you already make sure they're always on anyway.
Oh, and those movies were Forrest Gump, Braveheart, and The English Patient.