Performance Myths : Clustered vs. Non-Clustered Indexes - SQLPerformance.com
SentryOne - SQL Sentry
Mar 172017
 

Performance Myths : Clustered vs. Non-Clustered Indexes

I was recently scolded for suggesting that, in some cases, a non-clustered index will perform better for a particular query than the clustered index. This person stated that the clustered index is always best because it is always covering by definition, and that any non-clustered index with some or all of the same key columns was always redundant.

I will happily agree that the clustered index is always covering (and to avoid any ambiguity here, we're going to stick to disk-based tables with traditional B-tree indexes).

Warehouse.StockItemTransactions column listI disagree, though, that a clustered index is always faster than a non-clustered index. I also disagree that it is always redundant to create a non-clustered index or unique constraint consisting of the same (or some of the same) columns in the clustering key.

Let's take this example, Warehouse.StockItemTransactions, from WideWorldImporters. The clustered index is implemented through a primary key on just the StockItemTransactionID column (pretty typical when you have some kind of surrogate ID generated by an IDENTITY or a SEQUENCE).

It's a pretty common thing to require a count of the whole table (though in many cases there are better ways). This can be for casual inspection or as part of a pagination procedure. Most people will do it this way:

SELECT COUNT(*) 
  FROM Warehouse.StockItemTransactions;

With the current schema, this will use a non-clustered index:

Plan for COUNT query

We know that the non-clustered index does not contain all of the columns in the clustered index. The count operation only needs to be sure that all rows are included, without caring about which columns are present, so SQL Server will usually choose the index with the smallest number of pages (in this case, the index chosen has ~414 pages).

Now let's run the query again, this time comparing it to a hinted query that forces the use of the clustered index.

SELECT COUNT(*) 
  FROM Warehouse.StockItemTransactions;
 
SELECT COUNT(*) /* with hint */
  FROM Warehouse.StockItemTransactions WITH (INDEX(1));

We get an almost identical plan shape, but we can see a huge difference in reads (414 for the chosen index vs. 1,982 for the clustered index):

Plan and runtime metrics for COUNT queries (with and without index hints)

Duration is slightly higher for the clustered index, but the difference is negligible when we're dealing with a small amount of cached data on a fast disk. That discrepancy would be much more pronounced with more data, on a slow disk, or on a system with memory pressure.

If we look at the tooltips for the scan operations, we can see that while the number of rows and estimated CPU costs are identical, the big difference comes from the estimated I/O cost (because SQL Server knows that there are more pages in the clustered index than the non-clustered index):

Tooltips for COUNT queries (with and without index hints)

We can see this difference even more clearly if we create a new, unique index on just the ID column (making it "redundant" with the clustered index, right?):

CREATE UNIQUE INDEX NewUniqueNCIIndex 
  ON Warehouse.StockItemTransactions(StockItemTransactionID);

Tooltip for scan of unique non-clustered indexRunning a similar query with an explicit index hint produces the same plan shape, but an even lower estimated I/O cost (and even lower durations) – see image at right. And if you run the original query without the hint, you'll see that SQL Server now chooses this index too.

It might seem obvious, but a lot of people would believe that the clustered index is the best choice here. SQL Server is almost always going to heavily favor whatever method will provide the cheapest way to perform all of the I/O, and in the case of a full scan, that's going to be the "skinniest" index. This can also happen with both types of seeks (singleton and range scans), at least when the index is covering.

Now, as always, that does not in any way mean that you should go and create additional indexes on all of your tables to satisfy count queries. Not only is that an inefficient way to check table size (again, see this article), but an index to support that would have to mean that you're running that query more often than you're updating the data. Remember that every index requires space on disk, space on memory, and all of the writes against the table must also touch every index (filtered indexes aside).

Summary

I could come up with many other examples that show when a non-clustered can be useful and worth the cost of maintenance, even when duplicating the key column(s) of the clustered index. Non-clustered indexes can be created with the same key columns but in a different key order, or with different ASC/DESC on the columns themselves to better support an alternate presentation order. You can also have non-clustered indexes that only carry a small subset of the rows through the use of a filter. Finally, if you can satisfy your most common queries with skinnier, non-clustered indexes, that is better for memory consumption as well.

But really, my point of this series is merely to show a counter-example that illustrates the folly of making blanket statements like this one. I'll leave you with an explanation from Paul White who, in a DBA.SE answer, explains why such a non-clustered index can in fact perform much better than a clustered index. This is true even when both use either type of seek:

  11 Responses to “Performance Myths : Clustered vs. Non-Clustered Indexes”

  1. I recently did something like this in a series of tables that include large (but not quite large enough to be off page) text fields. To satisfy analysis queries I created a non-clustered index with the same key as the clustered index and included all columns except the varchar(xxxx) columns.

    Maybe the same effect could have been achieved by forcing the text to be stored off row, but we decided for the sake of the small percent of queries which want to search through the text that would not be desirable.

  2. Excellent article, Aaron, especially this part:-

    "I disagree, though, that a clustered index is always faster than a non-clustered index. I also disagree that it is always redundant to create a non-clustered index or unique constraint consisting of the same (or some of the same) columns in the clustering key."

    I wish more people stated both of these more often. There are SO MANY times when they're both not true, and, as usual, the "it depends" answer applies to how to go about indexing.

    I find the biggest problem is that many people seem to just want a silver bullet, and be told what to do rather than thinking about their problem and working out what the best solution is; which results in "rules" that are then abused.

    • " people seem to just want a silver bullet, and be told what to do rather than thinking about their problem and working out what the best solution is"

      This has been my observation as well and seems to have only got worse over the years. School is one big multiple choice test and Google will tell you which green thingy to push. Critical thinking and problem solving are esoteric art forms now.

      I really look forward to this series. Thanks Aaron.

  3. Exactly for COUNT(*) we have much more powerfull way – use system tables.
    You don't need to think up a query self – simly copy query (use profiler) from standard DB-Report in SSMS: "Disk Usage by Table".
    This Report (query) calculates row counts for all tables in DB in millyseconds!!!
    In all our projects we everytime create table_stat VIEW with this query and than:
    SELECT row_count FROM table_stat WHERE table_name='MyTable'

    • Of course, and I talk about this here:

      This was just an example. The concept remains true for all kinds of other queries, including COUNT with a where clause (where you can't create a filtered index for every possible combination of where clauses), and queries that aren't looking for COUNT at all…

  4. On one of my tables (used for scheduling ServerJobs) I have a nonclustered index that contains all columns of the table (as the clustered).

    This makes sense, since this index is filtered for jobs that are not started yet and the application does a SELECT * FROM ServerJobs WHERE … Without the filtered index it would have to scan the whole table every x seconds, and when the filtered index would contain only the ID column it would need to do Key Lookups. Since there are typically only very few not started jobs, this index would be usually very small (only one data page) -> even such strange constructs can be the best solution in some situations.

  5. We had a case where we had a temp table with just 3 columns and we had a covering nonclustered index. First we changed it to a clustered index; then restored the nonclustered index so we had two identical indexes and let SQL Server choose which one to use. Then we got rid of the clustered index so we are back to where we started and not sure if that is best or not. But never mind, here come memory-optimized tables.

    • When you have an clustered index and identical covering nonclustered index, SQL Server will (always) use the NCI, since it is narrower (contains e.g. 5 columns while the clustered index contains all 50 columns). HEAPs (that is what you get when you have no clustered index) are not bad – as long you do not update columns. If you do this, you could have forwarding records which will kill performance.

      For insert-only tables a heap could be ok, but when you have to create nonclustered indexes on it – why do you not just create the most used index as clustered (or at least the index starting with the ID / orderdate / insertdate column)?

  6. Any examples of differences if the table has a clustered index already vs a heap for the speed of non-clustered indexes?

  7. I can't imagine anyone saying that a clustered index will "always" beat a nonclustered for any given query. Obviously creating additional tables that exactly match the query(ies) — which is effectively what covering nonclus indexes are — can save some overhead for that specific query.

    However, far too many people believe the egregious myth that an identity column is (almost) "always" the best clustering index, and all other needs should be handled by covering nonclustered indexes. Just automatically slap an identity column on the table and all's good, since that column is "narrow, ever-ascending, etc.". That's just false and most often creates vastly more overhead *overall*. If you tune one query at a time, you'll inevitably end up with vast amounts of covering indexes, and that's often not the best choice.

    • > I can't imagine anyone saying that a clustered index will "always" beat a nonclustered for any given query.

      I couldn't either. Until it happened.

 Leave a Reply

(required)

(required)