Every product has bugs, and SQL Server is no exception. Using product features in a slightly unusual way (or combining relatively new features together) is a great way to find them. Bugs can be interesting, and even educational, but perhaps some of the joys are lost when the discovery results in your pager going off at 4am, perhaps after a particularly social night out with friends…
The bug that is the subject of this post is probably reasonably rare in the wild, but it is not a classic edge case. I know of at least one consultant that has encountered it in a production system. On a completely unrelated subject, I should take this opportunity to say "hello" to the Grumpy Old DBA (blog).
I will start with some relevant background on merge joins. If you are certain you already know everything there is to know about merge join, or just want to cut to the chase, feel free to scroll down to the section titled, "The Bug."
Merge join is not a terribly complicated thing, and can be very efficient in the right circumstances. It requires that its inputs are sorted on the join keys, and performs best in one-to-many mode (where at least of its inputs is unique on the join keys). For moderately-sized one-to-many joins, serial merge join is not a bad choice at all, provided the input sorting requirements can be met without performing an explicit sort.
Avoiding a sort is most commonly achieved by exploiting the ordering provided by an index. Merge join can also take advantage of preserved sort order from an earlier, unavoidable sort. A cool thing about merge join is that it can stop processing input rows as soon as either input runs out of rows. One last thing: merge join does not care whether the input sort order is ascending or descending (though both inputs must be the same). The following example uses a standard Numbers table to illustrate most of the points above:
CREATE TABLE #T1 (col1 integer CONSTRAINT PK1 PRIMARY KEY (col1 DESC)); CREATE TABLE #T2 (col1 integer CONSTRAINT PK2 PRIMARY KEY (col1 DESC)); INSERT #T1 SELECT n FROM dbo.Numbers WHERE n BETWEEN 10000 AND 19999; INSERT #T2 SELECT n FROM dbo.Numbers WHERE n BETWEEN 18000 AND 21999;
Notice that the indexes enforcing the primary keys on those two tables are defined as descending. The query plan for the
INSERT has a number of interesting features:
Reading left to right (as is only sensible!) the Clustered Index Insert has the "DML Request Sort" property set. This means the operator requires rows in Clustered Index key order. The clustered index (enforcing the primary key in this case) is defined as
DESC, so rows with higher values are required to arrive first. The clustered index on my Numbers table is
ASC, so the query optimizer avoids an explicit sort by seeking to the highest match in the Numbers table (21,999) first, then scanning toward the lowest match (18,000) in reverse index order. The "Plan Tree" view in SQL Sentry Plan Explorer shows the reverse (backward) scan clearly:
Backward scanning reverses the natural order of the index. A backward scan of an
ASC index key returns rows in descending key order; a backward scan of a
DESC index key returns rows in ascending key order. The "scan direction" does not indicate returned key order by itself – you have to know whether the index is
DESC to make that determination.
Using these test tables and data (
T1 has 10,000 rows numbered from 10,000 to 19,999 inclusive;
T2 has 4,000 rows numbered from 18,000 to 21,999) the following query joins the two tables together and returns results in descending order of both keys:
SELECT T1.col1, T2.col1 FROM #T1 AS T1 JOIN #T2 AS T2 ON T2.col1 = T1.col1 ORDER BY T1.col1 DESC, T2.col1 DESC;
The query returns the correct matching 2,000 rows as you would expect. The post-execution plan is as follows:
The Merge Join is not running in many-to-many mode (the top input is unique on the join keys) and the 2,000 row cardinality estimate is exactly correct. The Clustered Index Scan of table
T2 is ordered (though we have to wait a moment to discover whether that order is forward or backward) and the cardinality estimate of 4,000 rows is also exactly right. The Clustered Index Scan of table
T1 is also ordered, but only 2,001 rows were read whereas 10,000 were estimated. The plan tree view shows both Clustered Index Scans are ordered forward:
Recall that reading a
FORWARD will produce rows in reverse key order. This is exactly what is required by the
ORDER BY T1.col DESC, T2.col1 DESC clause, so no explicit sort is necessary. Pseudo-code for one-to-many Merge Join (reproduced from Craig Freedman's Merge Join blog) is:
The descending order scan of
T1 returns rows starting at 19,999 and working down towards 10,000. The descending order scan of
T2 returns rows starting at 21,999 and working down towards 18,000. All 4,000 rows in
T2 are eventually read, but the iterative merge process stops when key value 17,999 is read from
T2 runs out of rows. Merge processing therefore completes without fully reading
T1. It reads rows from 19,999 down to 17,999 inclusive; a total of 2,001 rows as shown in the execution plan above.
Feel free to re-run the test with
ASC indexes instead, also changing the
ORDER BY clause from
ASC. The execution plan produced will be very similar, and no sorts will be needed.
To summarize the points that will be important in a moment, Merge Join requires join-key sorted inputs, but it does not matter whether the keys are sorted ascending or descending.
To reproduce the bug, at least one of our tables needs to be partitioned. To keep the results manageable, this example will use just a small number of rows, so the partitioning function needs small boundaries as well:
CREATE PARTITION FUNCTION PF (integer) AS RANGE RIGHT FOR VALUES (5, 10, 15); CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY]);
The first table contains two columns, and is partitioned on the PRIMARY KEY:
CREATE TABLE dbo.T1 ( T1ID integer IDENTITY (1,1) NOT NULL, SomeID integer NOT NULL, CONSTRAINT [PK dbo.T1 T1ID] PRIMARY KEY CLUSTERED (T1ID) ON PS (T1ID) );
The second table is not partitioned. It contains a primary key and a column that will join to the first table:
CREATE TABLE dbo.T2 ( T2ID integer IDENTITY (1,1) NOT NULL, T1ID integer NOT NULL, CONSTRAINT [PK dbo.T2 T2ID] PRIMARY KEY CLUSTERED (T2ID) ON [PRIMARY] );
The Sample Data
The first table has 14 rows, all with the same value in the
SomeID column. SQL Server assigns the
IDENTITY column values, numbered 1 to 14.
INSERT dbo.T1 (SomeID) VALUES (123), (123), (123), (123), (123), (123), (123), (123), (123), (123), (123), (123), (123), (123);
The second table is simply populated with the
IDENTITY values from table one:
INSERT dbo.T2 (T1ID) SELECT T1ID FROM dbo.T1;
The data in the two tables looks like this:
The Test Query
The first query simply joins both tables, applying a single WHERE clause predicate (which happens to match all rows in this greatly simplified example):
SELECT T2.T2ID FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.T1ID = T1.T1ID WHERE T1.SomeID = 123;
The result contains all 14 rows, as expected:
Due to the small number of rows, the optimizer chooses a nested loops join plan for this query:
The results are the same (and still correct) if we force a hash or merge join:
SELECT T2.T2ID FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.T1ID = T1.T1ID WHERE T1.SomeID = 123 OPTION (HASH JOIN); SELECT T2.T2ID FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.T1ID = T1.T1ID WHERE T1.SomeID = 123 OPTION (MERGE JOIN);
The Merge Join there is one-to-many, with an explicit sort on
T1ID required for table
The Descending Index Problem
All is well until one day (for good reasons that need not concern us here) another administrator adds a descending index on the
SomeID column of table 1:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID] ON dbo.T1 (SomeID DESC);
Our query continues to produce correct results when the optimizer chooses a Nested Loops or Hash Join, but it is a different story when a Merge Join is used. The following still uses a query hint to force the Merge Join, but this is just a consequence of the low row counts in the example. The optimizer would naturally choose the same Merge Join plan with different table data.
SELECT T2.T2ID FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.T1ID = T1.T1ID WHERE T1.SomeID = 123 OPTION (MERGE JOIN);
The execution plan is:
The optimizer has chosen to use the new index, but the query now produces only five rows of output:
What happened to the other 9 rows? To be clear, this result is incorrect. The data has not changed, so all 14 rows should be returned (as they still are with a Nested Loops or Hash Join plan).
Cause and Explanation
The new nonclustered index on
SomeID is not declared as unique, so the clustered index key is silently added to all nonclustered index levels. SQL Server adds the
T1ID column (the clustered key) to the nonclustered index just as if we had created the index like this:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID] ON dbo.T1 (SomeID DESC, T1ID);
Notice the lack of a
DESC qualifier on the silently-added
T1ID key. Index keys are
ASC by default. This is not a problem in itself (though it contributes). The second thing that happens to our index automatically is that it is partitioned in the same way the base table is. So, the full index specification, if we were to write it out explicitly, would be:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID] ON dbo.T1 (SomeID DESC, T1ID ASC) ON PS (T1ID);
This is now quite a complex structure, with keys in all sorts of different orders. It is complex enough for the query optimizer to get it wrong when reasoning about the sort order provided by the index. To illustrate, consider the following simple query:
SELECT T1ID, PartitionID = $PARTITION.PF(T1ID) FROM dbo.T1 WHERE SomeID = 123 ORDER BY T1ID ASC;
The extra column will just show us which partition the current row belongs in. Otherwise, it is just a simple query that returns
T1ID values in ascending order,
WHERE SomeID = 123. Unfortunately, the results are not what is specified by the query:
The query requires that
T1ID values should be returned in ascending order, but that is not what we get. We do get values in ascending order per partition, but the partitions themselves are returned in reverse order! If the partitions were returned in ascending order (and the
T1ID values remained sorted within each partition as shown) the result would be correct.
The query plan shows that the optimizer was confused by the leading
DESC key of the index, and thought it needed to read the partitions in reverse order for correct results:
The partition seek starts at the right-most partition (4) and proceeds backwards to partition 1. You might think we could fix the issue by explicitly sorting on partition number
ASC in the
ORDER BY clause:
SELECT T1ID, PartitionID = $PARTITION.PF(T1ID) FROM dbo.T1 WHERE SomeID = 123 ORDER BY PartitionID ASC, -- New! T1ID ASC;
This query returns the same results (this is not a misprint or a copy/paste error):
The partition id is still in descending order (not ascending, as specified) and
T1ID is only sorted ascending within each partition. Such is the optimizer's confusion, it really does think (take a deep breath now) that scanning the partitioned leading-descending-key index in a forward direction, but with partitions reversed, will result in the order specified by the query.
I don't blame it to be frank, the various sort order considerations make my head hurt too.
As a final example, consider:
SELECT T1ID FROM dbo.T1 WHERE SomeID = 123 ORDER BY T1ID DESC;
The results are:
T1ID sort order within each partition is correctly descending, but the partitions themselves are listed backward (they go from 1 to 3 down the rows). If the partitions were returned in reverse order, the results would correctly be
14, 13, 12, 11, 10, 9, … 5, 4, 3, 2, 1.
Back to the Merge Join
The cause of the incorrect results with the Merge Join query is now apparent:
SELECT T2.T2ID FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.T1ID = T1.T1ID WHERE T1.SomeID = 123 OPTION (MERGE JOIN);
The Merge Join requires sorted inputs. The input from
T2 is explicitly sorted by
T1TD so that is ok. The optimizer incorrectly reasons that the index on
T1 can provide rows in
T1ID order. As we have seen, this is not the case. The Index Seek produces the same output as a query we have already seen:
SELECT T1ID FROM dbo.T1 WHERE SomeID = 123 ORDER BY T1ID ASC;
Only the first 5 rows are in
T1ID order. The next value (5) is certainly not in ascending order, and the Merge Join interprets this as end-of-stream rather than producing an error (personally I expected a retail assertion here). Anyway, the effect is that the Merge Join incorrectly finishes processing early. As a reminder, the (incomplete) results are:
This is a very serious bug in my view. A simple index seek can return results that do not respect the
ORDER BY clause. More to the point, the optimizer's internal reasoning is completely broken for partitioned non-unique nonclustered indexes with a descending leading key.
Yes, this is a slightly unusual arrangement. But, as we have seen, correct results can be suddenly replaced by incorrect results just because someone added a descending index. Remember the added index looked innocent enough: no explicit
ASC/DESC key mismatch, and no explicit partitioning.
The bug is not limited to Merge Joins. Potentially any query that involves a partitioned table and which relies on index sort order (explicit or implicit) could fall victim. This bug exists in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. Windows SQL Azure Database does not support partitioning, so the issue does not arise. SQL Server 2005 used a different implementation model for partitioning (based on
APPLY) and does not suffer from this issue either.
If you have a moment, please consider voting on my Connect item for this bug.
The fix for this issue is now available and documented in a Knowledge Base article. Please note the fix requires a code update and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect-results bug to be fixed under 4199. I asked for clarification on that and the response was:
Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is "on" by default without the trace flag in SQL Server 2014 RTM.