Two Partitioning Peculiarities - SQLPerformance.com

Table partitioning in SQL Server is essentially a way of making multiple physical tables (rowsets) look like a single table. This abstraction is performed entirely by the query processor, a design that makes things simpler for users, but which makes complex demands of the query optimizer. This post looks at two examples which exceed the optimizer’s abilities in SQL Server 2008 onward.

Join Column Order Matters

This first example shows how the textual order of ON clause conditions can affect the query plan produced when joining partitioned tables. To start with, we need a partitioning scheme, a partitioning function, and two tables:

CREATE PARTITION FUNCTION PF (integer) 
AS RANGE RIGHT
FOR VALUES 
	(
	10000, 20000, 30000, 40000, 50000,
	60000, 70000, 80000, 90000, 100000,
	110000, 120000, 130000, 140000, 150000
	);

CREATE PARTITION SCHEME PS 
AS PARTITION PF 
ALL TO ([PRIMARY]);
GO
CREATE TABLE dbo.T1
(
    c1 integer NOT NULL,
    c2 integer NOT NULL,
    c3 integer NOT NULL,
    
    CONSTRAINT PK_T1
    PRIMARY KEY CLUSTERED (c1, c2, c3)
    ON PS (c1)
);

CREATE TABLE dbo.T2
(
    c1 integer NOT NULL,
    c2 integer NOT NULL,
    c3 integer NOT NULL,
    
    CONSTRAINT PK_T2
    PRIMARY KEY CLUSTERED (c1, c2, c3)
    ON PS (c1)
);

Next, we load both tables with 150,000 rows. The data does not matter very much; this example uses a standard Numbers table containing all the integer values from 1 to 150,000 as a data source. Both tables are loaded with the same data.

INSERT dbo.T1 WITH (TABLOCKX)
    (c1, c2, c3)
SELECT
    N.n * 1,
    N.n * 2,
    N.n * 3
FROM dbo.Numbers AS N
WHERE
    N.n BETWEEN 1 AND 150000;

INSERT dbo.T2 WITH (TABLOCKX)
    (c1, c2, c3)
SELECT
    N.n * 1,
    N.n * 2,
    N.n * 3
FROM dbo.Numbers AS N
WHERE
    N.n BETWEEN 1 AND 150000;

Our test query performs a simple inner join of these two tables. Again, the query is not important or intended to be particularly realistic, it is used to demonstrate an odd effect when joining partitioned tables. The first form of the query uses an ON clause written in c3, c2, c1 column order:

SELECT *
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON t1.c3 = t2.c3 
    AND t1.c2 = t2.c2 
    AND t1.c1 = t2.c1;

The execution plan produced for this query (on SQL Server 2008 and later) features a parallel hash join, with an estimated cost of 2.6953:

This is a bit unexpected. Both tables have a clustered index in (c1, c2, c3) order, partitioned by c1, so we would expect a merge join, taking advantage of the index ordering. Let’s try writing the ON clause in (c1, c2, c3) order instead:

SELECT *
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON t1.c1 = t2.c1
    AND t1.c2 = t2.c2
    AND t1.c3 = t2.c3;

The execution plan now uses the expected merge join, with an estimated cost of 1.64119 (down from 2.6953). The optimizer also decides that it is not worth using parallel execution:

Noting that the merge join plan is clearly more efficient, we can attempt to force a merge join for the original ON clause order using a query hint:

SELECT *
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON t1.c3 = t2.c3 
    AND t1.c2 = t2.c2 
    AND t1.c1 = t2.c1
OPTION (MERGE JOIN);

The resulting plan does use a merge join as requested, but it also features sorts on both inputs, and goes back to using parallelism. The estimated cost of this plan is a whopping 8.71063:

Both sort operators have the same properties:

The optimizer thinks the merge join needs its inputs sorted in the strict written order of the ON clause, introducing explicit sorts as a result. The optimizer is aware that a merge join requires its inputs sorted in the same way, but it also knows that the column order does not matter. Merge join on (c1, c2, c3) is equally happy with inputs sorted on (c3, c2, c1) as it is with inputs sorted on (c2, c1, c3) or any other combination.

Unfortunately, this reasoning is broken in the query optimizer when partitioning is involved. This is an optimizer bug that has been fixed in SQL Server 2008 R2 and later, although trace flag 4199 is required to activate the fix:

SELECT *
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON t1.c3 = t2.c3 
    AND t1.c2 = t2.c2 
    AND t1.c1 = t2.c1
OPTION (QUERYTRACEON 4199);

You would normally enable this trace flag using DBCC TRACEON or as a start-up option, because the QUERYTRACEON hint is not documented for use with 4199. The trace flag is required in SQL Server 2008 R2, SQL Server 2012, and SQL Server 2014 CTP1.

Anyway, how ever the flag is enabled, the query now produces the optimal merge join whatever the ON clause ordering:

There is no fix for SQL Server 2008, the workaround is to write the ON clause in the ‘right’ order! If you encounter a query like this on SQL Server 2008, try forcing a merge join and look at the sorts to determine the ‘correct’ way to write your query’s ON clause.

This issue does not arise in SQL Server 2005 because that release implemented partitioned queries using the APPLY model:

The SQL Server 2005 query plan joins one partition from each table at a time, using an in-memory table (the Constant Scan) containing partition numbers to process. Each partition is merge joined separately on the inner side of the join, and the 2005 optimizer is smart enough to see that the ON clause column order does not matter.

This latest plan is an example of a collocated merge join, a facility that was lost when moving from SQL Server 2005 to the new partitioning implementation in SQL Server 2008. A suggestion on Connect to reinstate collocated merge joins has been closed as Won’t Fix.

Group By Order Matters

The second peculiarity I want to look at follows a similar theme, but relates to the order of columns in a GROUP BY clause rather than the ON clause of an inner join. We will need a new table to demonstrate:

CREATE TABLE dbo.T3
(
    RowID       integer IDENTITY NOT NULL,
    UserID      integer NOT NULL,
    SessionID   integer NOT NULL,
    LocationID  integer NOT NULL,
    
    CONSTRAINT PK_T3
    PRIMARY KEY CLUSTERED (RowID)
    ON PS (RowID)
);

INSERT dbo.T3 WITH (TABLOCKX)
    (UserID, SessionID, LocationID)
SELECT
    ABS(CHECKSUM(NEWID())) % 50,
    ABS(CHECKSUM(NEWID())) % 30,
    ABS(CHECKSUM(NEWID())) % 10
FROM dbo.Numbers AS N
WHERE
    N.n BETWEEN 1 AND 150000;

The table has an aligned nonclustered index, where ‘aligned’ simply means it is partitioned in the same way as the clustered index (or heap):

CREATE NONCLUSTERED INDEX nc1
ON dbo.T3 (UserID, SessionID, LocationID)
ON PS (RowID);

Our test query groups data across the three nonclustered index columns and returns a count for each group:

SELECT LocationID, UserID, SessionID, COUNT_BIG(*)
FROM dbo.T3
GROUP BY LocationID, UserID, SessionID;

The query plan scans the nonclustered index and uses a Hash Match Aggregate to count rows in each group:

There are two problems with Hash Aggregate:

It is a blocking operator. No rows are returned to the client until all rows have been aggregated.
It requires a memory grant to hold the hash table.

In many real-world scenarios, we would prefer a Stream Aggregate here because that operator is only blocking per group, and does not require a memory grant. Using this option, the client application would start receiving data earlier, would not have to wait for memory to be granted, and the SQL Server can use the memory for other purposes.

We can require the query optimizer to use a Stream Aggregate for this query by adding an OPTION (ORDER GROUP) query hint. This results in the following execution plan:

The Sort operator is fully blocking and also requires a memory grant, so this plan appears to be worse than simply using a hash aggregate. But why is the sort needed? The properties show that the rows are being sorted in the order specified by our GROUP BY clause:

This sort is expected because partition-aligning the index (in SQL Server 2008 onward) means the partition number is added as a leading column of the index. In effect, the nonclustered index keys are (partition, user, session, location) due to the partitioning. Rows in the index are still sorted by user, session, and location, but only within each partition.

If we restrict the query to a single partition, the optimizer ought to be able to use the index to feed a Stream Aggregate without sorting. In case that requires some explanation, specifying a single partition means the query plan can eliminate all other partitions from the nonclustered index scan, resulting in a stream of rows that is ordered by (user, session, location).

We can achieve this partition elimination explicitly using the $PARTITION function:

SELECT LocationID, UserID, SessionID, COUNT_BIG(*)
FROM dbo.T3
WHERE $PARTITION.PF(RowID) = 1
GROUP BY LocationID, UserID, SessionID;

Unfortunately, this query still uses a Hash Aggregate, with an estimated plan cost of 0.287878:

The scan is now just over one partition, but the (user, session, location) ordering has not helped the optimizer use a Stream Aggregate. You might object that (user, session, location) ordering is not helpful because the GROUP BY clause is (location, user, session), but the key order does not matter for a grouping operation.

Let’s add an ORDER BY clause in the order of the index keys to prove the point:

SELECT LocationID, UserID, SessionID, COUNT_BIG(*)
FROM dbo.T3
WHERE $PARTITION.PF(RowID) = 1
GROUP BY LocationID, UserID, SessionID
ORDER BY UserID, SessionID, LocationID;

Notice that the ORDER BY clause matches the nonclustered index key order, though the GROUP BY clause does not. The execution plan for this query is:

Now we have the Stream Aggregate we were after, with an estimated plan cost of 0.0423925 (compared with 0.287878 for the Hash Aggregate plan – almost 7 times more).

The other way to achieve a Stream Aggregate here is to reorder the GROUP BY columns to match the nonclustered index keys:

SELECT LocationID, UserID, SessionID, COUNT_BIG(*)
FROM dbo.T3 AS T1
WHERE $PARTITION.PF(RowID) = 1
GROUP BY UserID, SessionID, LocationID;

This query produces the same Stream Aggregate plan shown immediately above, with exactly the same cost. This sensitivity to GROUP BY column order is specific to partitioned table queries in SQL Server 2008 and later.

You may recognize that the root cause of the problem here is similar to the previous case involving a Merge Join. Both Merge Join and Stream Aggregate require input sorted on the join or aggregation keys, but neither cares about the order of those keys. A Merge Join on (x, y, z) is just as happy receiving rows ordered by (y, z, x) or (z, y, x) and the same is true for Stream Aggregate.

This optimizer limitation also applies to DISTINCT in the same circumstances. The following query results in a Hash Aggregate plan with an estimated cost of 0.286539:

SELECT DISTINCT LocationID, UserID, SessionID
FROM dbo.T3 AS T1
WHERE $PARTITION.PF(RowID) = 1;

If we write the DISTINCT columns in the order of the nonclustered index keys…

SELECT DISTINCT UserID, SessionID, LocationID
FROM dbo.T3 AS T1
WHERE $PARTITION.PF(RowID) = 1;

…we are rewarded with a Stream Aggregate plan with a cost of 0.041455:

To summarize, this is a limitation of the query optimizer in SQL Server 2008 and later (including SQL Server 2014 CTP 1) that is not resolved by using trace flag 4199 as was the case for the Merge Join example. The problem only occurs with partitioned tables with a GROUP BY or DISTINCT over three or more columns using an aligned partitioned index, where a single partition is processed.

As with the Merge Join example, this represents a backward step from the SQL Server 2005 behaviour. SQL Server 2005 did not add an implied leading key to partitioned indexes, using an APPLY technique instead. In SQL Server 2005, all the queries presented here using $PARTITION to specify a single partition result in query plans that performs partition elimination and use Stream Aggregates without any query text reordering.

The changes to partitioned table processing in SQL Server 2008 improved performance in several important areas, primarily related to the efficient parallel processing of partitions. Unfortunately, these changes had side effects which have not all been resolved in later releases.

9 thoughts on “Two Partitioning Peculiarities”

SomewhereSomehow says:

July 25, 2013 at 10:32 AM

Hi, Paul!
Thanks for interesting reading. It is always exciting to watch how writing order takes effect on query plan =)
Maybe it will be interesting and somehow relevant to the topic, some time ago I found how query tree redundant predicate simplification of the "in" clause leads to different trees and may lead to deifferent plans if we simply change "where a=b" to "where b=a".
Here is repro:
use AdventureWorks2008;
go
set showplan_xml on
go
select * from Sales.SalesOrderHeader soh
where soh.CustomerID in (select sc.CustomerID from Sales.Customer sc where sc.CustomerID = soh.CustomerID)
go
select * from Sales.SalesOrderHeader soh
where soh.CustomerID in (select sc.CustomerID from Sales.Customer sc where soh.CustomerID = sc.CustomerID)
go
set showplan_xml off
go
Funny =)
It is also reproducable in 2012.
tobi says:

July 25, 2013 at 9:34 PM

These bugs are disappointing. They indicate poor code inside the QO. I'd expect that tests for required guarantees like ordering compatibility would be resolved by a common set of well-tested utility functions. Instead, they sometimes appear to be ad-hoc solutions showing different behavior under different circumstances. There does not seem to be a coherent symbolic reasoning framework in place.

Here is another instance were a very localized algebraic rewrite triggers a bug: http://stackoverflow.com/questions/17323547/should-the-order-of-linq-query-clauses-affect-entity-framework-performance/ The QO did not even manage to simplify these simple and common expressions. It does not seem to do symbolic reasoning but to respond to very narrow, hard-coded patterns. Disappointing.
Paul White says:

July 26, 2013 at 5:49 AM

Cheer up, tobi!

Goodness me, I write just as often about the positive features in the query optimizer as I do about interesting edge-case behaviours and missed opportunities. The first one in this article has already been fixed in recent versions (so long as you enable the trace flag) and the second one is quite rare, though hopefully interesting and entertaining to read about. I certainly don't intend to depress anyone.

The optimizer does contain a quite general reasoning framework, though as with any highly complex piece of code, there will always be edge-cases. It is of course easy to criticize and assume that there isn't a "coherent symbolic reasoning framework" (whatever one of those is!) but by-and-large people don't write about the millions of successful query plan compilations they encounter, do they? It is true that much of the optimizer's work is based on elementary transformations rather than semantic analysis, but that is a very hard general problem; I am not aware of any commercial product that does that well. Remember the query optimizer faces quite different challenges from (for example) a .NET optimizing compiler.

Regarding the stack overflow question, I added some more information there. Yes, it would be nice if the EF and SQL Server teams had worked together a bit closer on that one, but the questioner is using a beta version of EF, and the database design could certainly bear some improvements that would avoid the issue as well.
Paul White says:

July 26, 2013 at 5:52 AM

Yes that's a nice one (though I would never write an IN clause that way myself). There are all sorts of ways to defeat the join-removing simplifications, but they are still a very useful optimizer feature.
wayne says:

August 13, 2013 at 9:55 AM

Paul i have often seen this join order issue and wondered why the optimiser didnt just reorder those internally.
Nick H says:

September 23, 2015 at 5:48 PM

Paul,
This white paper (https://technet.microsoft.com/en-us/library/dd578580%28v=sql.100%29.aspx) aimed at the 2008 release of the engine says that the partition number is added as an INCLUDED column in a non-clustered index (Quote: "SQL Server automatically adds the partition column to a secondary nonunique index as an included column if the CREATE INDEX statement does not already contain the partition column"). Is that a mistake (potentially in addition to some others)?
Paul White says:

September 26, 2015 at 8:59 AM

I haven't read the whole paper, but that statement seems accurate to me in context.
Nick H says:

September 26, 2015 at 4:30 PM

Paul, but there is a major difference between the partition number being "added as a leading column of the index" (per what you wrote), which would essentially make it a part of the key itself, and it being added merely as an included column. Wouldn't you agree? :)
Paul White says:

September 26, 2015 at 11:23 PM

The crucial difference is between partition *number* and the *value* of the partitioning column used to generate that partition number via the partitioning function. The partition number is effectively the leading key of the index; the specific value used for that partitioning may end up in the key or include list.

Comments are closed.