Every product has bugs, and SQL Server is no exception. Using product features in a slightly unusual way (or combining relatively new features together) is a great way to find them. Bugs can be interesting, and even educational, but perhaps some of the joys are lost when the discovery results in your pager going off at 4am, perhaps after a particularly social night out with friends…
The bug that is the subject of this post is probably reasonably rare in the wild, but it is not a classic edge case. I know of at least one consultant that has encountered it in a production system. On a completely unrelated subject, I should take this opportunity to say "hello" to the Grumpy Old DBA (blog).
I will start with some relevant background on merge joins. If you are certain you already know everything there is to know about merge join, or just want to cut to the chase, feel free to scroll down to the section titled, "The Bug."
Merge Join
Merge join is not a terribly complicated thing, and can be very efficient in the right circumstances. It requires that its inputs are sorted on the join keys, and performs best in one-to-many mode (where at least of its inputs is unique on the join keys). For moderately-sized one-to-many joins, serial merge join is not a bad choice at all, provided the input sorting requirements can be met without performing an explicit sort.
Avoiding a sort is most commonly achieved by exploiting the ordering provided by an index. Merge join can also take advantage of preserved sort order from an earlier, unavoidable sort. A cool thing about merge join is that it can stop processing input rows as soon as either input runs out of rows. One last thing: merge join does not care whether the input sort order is ascending or descending (though both inputs must be the same). The following example uses a standard Numbers table to illustrate most of the points above:
CREATE TABLE #T1 (col1 integer CONSTRAINT PK1 PRIMARY KEY (col1 DESC));
CREATE TABLE #T2 (col1 integer CONSTRAINT PK2 PRIMARY KEY (col1 DESC));
INSERT #T1 SELECT n FROM dbo.Numbers WHERE n BETWEEN 10000 AND 19999;
INSERT #T2 SELECT n FROM dbo.Numbers WHERE n BETWEEN 18000 AND 21999;
Notice that the indexes enforcing the primary keys on those two tables are defined as descending. The query plan for the INSERT
has a number of interesting features:
Reading left to right (as is only sensible!) the Clustered Index Insert has the "DML Request Sort" property set. This means the operator requires rows in Clustered Index key order. The clustered index (enforcing the primary key in this case) is defined as DESC
, so rows with higher values are required to arrive first. The clustered index on my Numbers table is ASC
, so the query optimizer avoids an explicit sort by seeking to the highest match in the Numbers table (21,999) first, then scanning toward the lowest match (18,000) in reverse index order. The "Plan Tree" view in SQL Sentry Plan Explorer shows the reverse (backward) scan clearly:
Backward scanning reverses the natural order of the index. A backward scan of an ASC
index key returns rows in descending key order; a backward scan of a DESC
index key returns rows in ascending key order. The "scan direction" does not indicate returned key order by itself – you have to know whether the index is ASC
or DESC
to make that determination.
Using these test tables and data (T1
has 10,000 rows numbered from 10,000 to 19,999 inclusive; T2
has 4,000 rows numbered from 18,000 to 21,999) the following query joins the two tables together and returns results in descending order of both keys:
SELECT
T1.col1,
T2.col1
FROM #T1 AS T1
JOIN #T2 AS T2
ON T2.col1 = T1.col1
ORDER BY
T1.col1 DESC,
T2.col1 DESC;
The query returns the correct matching 2,000 rows as you would expect. The post-execution plan is as follows:
The Merge Join is not running in many-to-many mode (the top input is unique on the join keys) and the 2,000 row cardinality estimate is exactly correct. The Clustered Index Scan of table T2
is ordered (though we have to wait a moment to discover whether that order is forward or backward) and the cardinality estimate of 4,000 rows is also exactly right. The Clustered Index Scan of table T1
is also ordered, but only 2,001 rows were read whereas 10,000 were estimated. The plan tree view shows both Clustered Index Scans are ordered forward:
Recall that reading a DESC
index FORWARD
will produce rows in reverse key order. This is exactly what is required by the ORDER BY T1.col DESC, T2.col1 DESC
clause, so no explicit sort is necessary. Pseudo-code for one-to-many Merge Join (reproduced from Craig Freedman's Merge Join blog) is:
The descending order scan of T1
returns rows starting at 19,999 and working down towards 10,000. The descending order scan of T2
returns rows starting at 21,999 and working down towards 18,000. All 4,000 rows in T2
are eventually read, but the iterative merge process stops when key value 17,999 is read from T1
, because T2
runs out of rows. Merge processing therefore completes without fully reading T1
. It reads rows from 19,999 down to 17,999 inclusive; a total of 2,001 rows as shown in the execution plan above.
Feel free to re-run the test with ASC
indexes instead, also changing the ORDER BY
clause from DESC
to ASC
. The execution plan produced will be very similar, and no sorts will be needed.
To summarize the points that will be important in a moment, Merge Join requires join-key sorted inputs, but it does not matter whether the keys are sorted ascending or descending.
The Bug
To reproduce the bug, at least one of our tables needs to be partitioned. To keep the results manageable, this example will use just a small number of rows, so the partitioning function needs small boundaries as well:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (5, 10, 15);
CREATE PARTITION SCHEME PS
AS PARTITION PF
ALL TO ([PRIMARY]);
The first table contains two columns, and is partitioned on the PRIMARY KEY:
CREATE TABLE dbo.T1
(
T1ID integer IDENTITY (1,1) NOT NULL,
SomeID integer NOT NULL,
CONSTRAINT [PK dbo.T1 T1ID]
PRIMARY KEY CLUSTERED (T1ID)
ON PS (T1ID)
);
The second table is not partitioned. It contains a primary key and a column that will join to the first table:
CREATE TABLE dbo.T2
(
T2ID integer IDENTITY (1,1) NOT NULL,
T1ID integer NOT NULL,
CONSTRAINT [PK dbo.T2 T2ID]
PRIMARY KEY CLUSTERED (T2ID)
ON [PRIMARY]
);
The Sample Data
The first table has 14 rows, all with the same value in the SomeID
column. SQL Server assigns the IDENTITY
column values, numbered 1 to 14.
INSERT dbo.T1
(SomeID)
VALUES
(123), (123), (123),
(123), (123), (123),
(123), (123), (123),
(123), (123), (123),
(123), (123);
The second table is simply populated with the IDENTITY
values from table one:
INSERT dbo.T2 (T1ID)
SELECT T1ID
FROM dbo.T1;
The data in the two tables looks like this:
The Test Query
The first query simply joins both tables, applying a single WHERE clause predicate (which happens to match all rows in this greatly simplified example):
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123;
The result contains all 14 rows, as expected:
Due to the small number of rows, the optimizer chooses a nested loops join plan for this query:
The results are the same (and still correct) if we force a hash or merge join:
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (HASH JOIN);
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The Merge Join there is one-to-many, with an explicit sort on T1ID
required for table T2
.
The Descending Index Problem
All is well until one day (for good reasons that need not concern us here) another administrator adds a descending index on the SomeID
column of table 1:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC);
Our query continues to produce correct results when the optimizer chooses a Nested Loops or Hash Join, but it is a different story when a Merge Join is used. The following still uses a query hint to force the Merge Join, but this is just a consequence of the low row counts in the example. The optimizer would naturally choose the same Merge Join plan with different table data.
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The execution plan is:
The optimizer has chosen to use the new index, but the query now produces only five rows of output:
What happened to the other 9 rows? To be clear, this result is incorrect. The data has not changed, so all 14 rows should be returned (as they still are with a Nested Loops or Hash Join plan).
Cause and Explanation
The new nonclustered index on SomeID
is not declared as unique, so the clustered index key is silently added to all nonclustered index levels. SQL Server adds the T1ID
column (the clustered key) to the nonclustered index just as if we had created the index like this:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID);
Notice the lack of a DESC
qualifier on the silently-added T1ID
key. Index keys are ASC
by default. This is not a problem in itself (though it contributes). The second thing that happens to our index automatically is that it is partitioned in the same way the base table is. So, the full index specification, if we were to write it out explicitly, would be:
CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID ASC)
ON PS (T1ID);
This is now quite a complex structure, with keys in all sorts of different orders. It is complex enough for the query optimizer to get it wrong when reasoning about the sort order provided by the index. To illustrate, consider the following simple query:
SELECT
T1ID,
PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID ASC;
The extra column will just show us which partition the current row belongs in. Otherwise, it is just a simple query that returns T1ID
values in ascending order, WHERE SomeID = 123
. Unfortunately, the results are not what is specified by the query:
The query requires that T1ID
values should be returned in ascending order, but that is not what we get. We do get values in ascending order per partition, but the partitions themselves are returned in reverse order! If the partitions were returned in ascending order (and the T1ID
values remained sorted within each partition as shown) the result would be correct.
The query plan shows that the optimizer was confused by the leading DESC
key of the index, and thought it needed to read the partitions in reverse order for correct results:
The partition seek starts at the right-most partition (4) and proceeds backwards to partition 1. You might think we could fix the issue by explicitly sorting on partition number ASC
in the ORDER BY
clause:
SELECT
T1ID,
PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
PartitionID ASC, -- New!
T1ID ASC;
This query returns the same results (this is not a misprint or a copy/paste error):
The partition id is still in descending order (not ascending, as specified) and T1ID
is only sorted ascending within each partition. Such is the optimizer's confusion, it really does think (take a deep breath now) that scanning the partitioned leading-descending-key index in a forward direction, but with partitions reversed, will result in the order specified by the query.
I don't blame it to be frank, the various sort order considerations make my head hurt too.
As a final example, consider:
SELECT
T1ID
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID DESC;
The results are:
Again, the T1ID
sort order within each partition is correctly descending, but the partitions themselves are listed backward (they go from 1 to 3 down the rows). If the partitions were returned in reverse order, the results would correctly be 14, 13, 12, 11, 10, 9, … 5, 4, 3, 2, 1
.
Back to the Merge Join
The cause of the incorrect results with the Merge Join query is now apparent:
SELECT
T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
ON T2.T1ID = T1.T1ID
WHERE
T1.SomeID = 123
OPTION (MERGE JOIN);
The Merge Join requires sorted inputs. The input from T2
is explicitly sorted by T1TD
so that is ok. The optimizer incorrectly reasons that the index on T1
can provide rows in T1ID
order. As we have seen, this is not the case. The Index Seek produces the same output as a query we have already seen:
SELECT
T1ID
FROM dbo.T1
WHERE
SomeID = 123
ORDER BY
T1ID ASC;
Only the first 5 rows are in T1ID
order. The next value (5) is certainly not in ascending order, and the Merge Join interprets this as end-of-stream rather than producing an error (personally I expected a retail assertion here). Anyway, the effect is that the Merge Join incorrectly finishes processing early. As a reminder, the (incomplete) results are:
Conclusion
This is a very serious bug in my view. A simple index seek can return results that do not respect the ORDER BY
clause. More to the point, the optimizer's internal reasoning is completely broken for partitioned non-unique nonclustered indexes with a descending leading key.
Yes, this is a slightly unusual arrangement. But, as we have seen, correct results can be suddenly replaced by incorrect results just because someone added a descending index. Remember the added index looked innocent enough: no explicit ASC/DESC
key mismatch, and no explicit partitioning.
The bug is not limited to Merge Joins. Potentially any query that involves a partitioned table and which relies on index sort order (explicit or implicit) could fall victim. This bug exists in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. Windows SQL Azure Database does not support partitioning, so the issue does not arise. SQL Server 2005 used a different implementation model for partitioning (based on APPLY
) and does not suffer from this issue either.
If you have a moment, please consider voting on my Connect item for this bug.
Resolution
The fix for this issue is now available and documented in a Knowledge Base article. Please note the fix requires a code update and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect-results bug to be fixed under 4199. I asked for clarification on that and the response was:
Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.
Hi, Paul!
Cool story and excellent work! Thanks!
Thank you for the excellent writeup, Paul, but of course I would expect no less from you!
Hi Paul, thank you for this article.
I was able to reproduce this bug in my production environment.
I also played with this scenario a little bit and, interestingly enough, it looks that the behavior returns back to normal if you add any WHERE-condition that references the partitioned column.
It can be a meaningless condition too, like this one:
Hi Eugene,
Yes, that's true. I hadn't noticed that; thanks for letting me know!
Possibly the ultimate in meaningless predicates:
I added a workaround to the Connect item based on your observation (not that it is an acceptable workaround, of course!)
Thanks again.
Paul
Good article
modifying optimizer plan is not recommended is really what I got from this
I got a headache by reading the example much less trying to do this on mine
but very good to know so when I have a hot shot in my team trying to prove his skill i can point him to this link
thanks
Hi Mathew,
There's a simpler example of this bug at https://www.sql.kiwi/2013/08/incorrect-results-caused-by-adding-an-index.html
Paul
Just to reconfirm, since the fix is "on" by default from SQL 2014 onward, I should not worry at all about data being incorrect with a MERGE join if I am using a higher version of SQL Server (2016, 2017). I do not need to turn on a traceflag for this. Am I correct?
That is correct, yes.