Table partitioning in SQL Server is essentially a way of making multiple physical tables (rowsets) look like a single table. This abstraction is performed entirely by the query processor, a design that makes things simpler for users, but which makes complex demands of the query optimizer. This post looks at two examples which exceed the optimizer's abilities in SQL Server 2008 onward.
Join Column Order Matters
This first example shows how the textual order of
ON clause conditions can affect the query plan produced when joining partitioned tables. To start with, we need a partitioning scheme, a partitioning function, and two tables:
CREATE PARTITION FUNCTION PF (integer) AS RANGE RIGHT FOR VALUES ( 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000 ); CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY]); GO CREATE TABLE dbo.T1 ( c1 integer NOT NULL, c2 integer NOT NULL, c3 integer NOT NULL, CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED (c1, c2, c3) ON PS (c1) ); CREATE TABLE dbo.T2 ( c1 integer NOT NULL, c2 integer NOT NULL, c3 integer NOT NULL, CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED (c1, c2, c3) ON PS (c1) );
Next, we load both tables with 150,000 rows. The data does not matter very much; this example uses a standard Numbers table containing all the integer values from 1 to 150,000 as a data source. Both tables are loaded with the same data.
INSERT dbo.T1 WITH (TABLOCKX) (c1, c2, c3) SELECT N.n * 1, N.n * 2, N.n * 3 FROM dbo.Numbers AS N WHERE N.n BETWEEN 1 AND 150000; INSERT dbo.T2 WITH (TABLOCKX) (c1, c2, c3) SELECT N.n * 1, N.n * 2, N.n * 3 FROM dbo.Numbers AS N WHERE N.n BETWEEN 1 AND 150000;
Our test query performs a simple inner join of these two tables. Again, the query is not important or intended to be particularly realistic, it is used to demonstrate an odd effect when joining partitioned tables. The first form of the query uses an
ON clause written in c3, c2, c1 column order:
SELECT * FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON t1.c3 = t2.c3 AND t1.c2 = t2.c2 AND t1.c1 = t2.c1;
The execution plan produced for this query (on SQL Server 2008 and later) features a parallel hash join, with an estimated cost of 2.6953:
This is a bit unexpected. Both tables have a clustered index in (c1, c2, c3) order, partitioned by c1, so we would expect a merge join, taking advantage of the index ordering. Let's try writing the
ON clause in (c1, c2, c3) order instead:
SELECT * FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 AND t1.c3 = t2.c3;
The execution plan now uses the expected merge join, with an estimated cost of 1.64119 (down from 2.6953). The optimizer also decides that it is not worth using parallel execution:
Noting that the merge join plan is clearly more efficient, we can attempt to force a merge join for the original
ON clause order using a query hint:
SELECT * FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON t1.c3 = t2.c3 AND t1.c2 = t2.c2 AND t1.c1 = t2.c1 OPTION (MERGE JOIN);
The resulting plan does use a merge join as requested, but it also features sorts on both inputs, and goes back to using parallelism. The estimated cost of this plan is a whopping 8.71063:
Both sort operators have the same properties:
The optimizer thinks the merge join needs its inputs sorted in the strict written order of the
ON clause, introducing explicit sorts as a result. The optimizer is aware that a merge join requires its inputs sorted in the same way, but it also knows that the column order does not matter. Merge join on (c1, c2, c3) is equally happy with inputs sorted on (c3, c2, c1) as it is with inputs sorted on (c2, c1, c3) or any other combination.
Unfortunately, this reasoning is broken in the query optimizer when partitioning is involved. This is an optimizer bug that has been fixed in SQL Server 2008 R2 and later, although trace flag 4199 is required to activate the fix:
SELECT * FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON t1.c3 = t2.c3 AND t1.c2 = t2.c2 AND t1.c1 = t2.c1 OPTION (QUERYTRACEON 4199);
You would normally enable this trace flag using
DBCC TRACEON or as a start-up option, because the
QUERYTRACEON hint is not documented for use with 4199. The trace flag is required in SQL Server 2008 R2, SQL Server 2012, and SQL Server 2014 CTP1.
Anyway, how ever the flag is enabled, the query now produces the optimal merge join whatever the
ON clause ordering:
There is no fix for SQL Server 2008, the workaround is to write the
ON clause in the 'right' order! If you encounter a query like this on SQL Server 2008, try forcing a merge join and look at the sorts to determine the 'correct' way to write your query's
This issue does not arise in SQL Server 2005 because that release implemented partitioned queries using the
The SQL Server 2005 query plan joins one partition from each table at a time, using an in-memory table (the Constant Scan) containing partition numbers to process. Each partition is merge joined separately on the inner side of the join, and the 2005 optimizer is smart enough to see that the
ON clause column order does not matter.
This latest plan is an example of a collocated merge join, a facility that was lost when moving from SQL Server 2005 to the new partitioning implementation in SQL Server 2008. A suggestion on Connect to reinstate collocated merge joins has been closed as Won't Fix.
Group By Order Matters
The second peculiarity I want to look at follows a similar theme, but relates to the order of columns in a
GROUP BY clause rather than the
ON clause of an inner join. We will need a new table to demonstrate:
CREATE TABLE dbo.T3 ( RowID integer IDENTITY NOT NULL, UserID integer NOT NULL, SessionID integer NOT NULL, LocationID integer NOT NULL, CONSTRAINT PK_T3 PRIMARY KEY CLUSTERED (RowID) ON PS (RowID) ); INSERT dbo.T3 WITH (TABLOCKX) (UserID, SessionID, LocationID) SELECT ABS(CHECKSUM(NEWID())) % 50, ABS(CHECKSUM(NEWID())) % 30, ABS(CHECKSUM(NEWID())) % 10 FROM dbo.Numbers AS N WHERE N.n BETWEEN 1 AND 150000;
The table has an aligned nonclustered index, where 'aligned' simply means it is partitioned in the same way as the clustered index (or heap):
CREATE NONCLUSTERED INDEX nc1 ON dbo.T3 (UserID, SessionID, LocationID) ON PS (RowID);
Our test query groups data across the three nonclustered index columns and returns a count for each group:
SELECT LocationID, UserID, SessionID, COUNT_BIG(*) FROM dbo.T3 GROUP BY LocationID, UserID, SessionID;
The query plan scans the nonclustered index and uses a Hash Match Aggregate to count rows in each group:
There are two problems with Hash Aggregate:
- It is a blocking operator. No rows are returned to the client until all rows have been aggregated.
- It requires a memory grant to hold the hash table.
In many real-world scenarios, we would prefer a Stream Aggregate here because that operator is only blocking per group, and does not require a memory grant. Using this option, the client application would start receiving data earlier, would not have to wait for memory to be granted, and the SQL Server can use the memory for other purposes.
We can require the query optimizer to use a Stream Aggregate for this query by adding an
OPTION (ORDER GROUP) query hint. This results in the following execution plan:
The Sort operator is fully blocking and also requires a memory grant, so this plan appears to be worse than simply using a hash aggregate. But why is the sort needed? The properties show that the rows are being sorted in the order specified by our
GROUP BY clause:
This sort is expected because partition-aligning the index (in SQL Server 2008 onward) means the partition number is added as a leading column of the index. In effect, the nonclustered index keys are (partition, user, session, location) due to the partitioning. Rows in the index are still sorted by user, session, and location, but only within each partition.
If we restrict the query to a single partition, the optimizer ought to be able to use the index to feed a Stream Aggregate without sorting. In case that requires some explanation, specifying a single partition means the query plan can eliminate all other partitions from the nonclustered index scan, resulting in a stream of rows that is ordered by (user, session, location).
We can achieve this partition elimination explicitly using the
SELECT LocationID, UserID, SessionID, COUNT_BIG(*) FROM dbo.T3 WHERE $PARTITION.PF(RowID) = 1 GROUP BY LocationID, UserID, SessionID;
Unfortunately, this query still uses a Hash Aggregate, with an estimated plan cost of 0.287878:
The scan is now just over one partition, but the (user, session, location) ordering has not helped the optimizer use a Stream Aggregate. You might object that (user, session, location) ordering is not helpful because the
GROUP BY clause is (location, user, session), but the key order does not matter for a grouping operation.
Let's add an
ORDER BY clause in the order of the index keys to prove the point:
SELECT LocationID, UserID, SessionID, COUNT_BIG(*) FROM dbo.T3 WHERE $PARTITION.PF(RowID) = 1 GROUP BY LocationID, UserID, SessionID ORDER BY UserID, SessionID, LocationID;
Notice that the
ORDER BY clause matches the nonclustered index key order, though the
GROUP BY clause does not. The execution plan for this query is:
Now we have the Stream Aggregate we were after, with an estimated plan cost of 0.0423925 (compared with 0.287878 for the Hash Aggregate plan – almost 7 times more).
The other way to achieve a Stream Aggregate here is to reorder the
GROUP BY columns to match the nonclustered index keys:
SELECT LocationID, UserID, SessionID, COUNT_BIG(*) FROM dbo.T3 AS T1 WHERE $PARTITION.PF(RowID) = 1 GROUP BY UserID, SessionID, LocationID;
This query produces the same Stream Aggregate plan shown immediately above, with exactly the same cost. This sensitivity to
GROUP BY column order is specific to partitioned table queries in SQL Server 2008 and later.
You may recognize that the root cause of the problem here is similar to the previous case involving a Merge Join. Both Merge Join and Stream Aggregate require input sorted on the join or aggregation keys, but neither cares about the order of those keys. A Merge Join on (x, y, z) is just as happy receiving rows ordered by (y, z, x) or (z, y, x) and the same is true for Stream Aggregate.
This optimizer limitation also applies to
DISTINCT in the same circumstances. The following query results in a Hash Aggregate plan with an estimated cost of 0.286539:
SELECT DISTINCT LocationID, UserID, SessionID FROM dbo.T3 AS T1 WHERE $PARTITION.PF(RowID) = 1;
If we write the
DISTINCT columns in the order of the nonclustered index keys…
SELECT DISTINCT UserID, SessionID, LocationID FROM dbo.T3 AS T1 WHERE $PARTITION.PF(RowID) = 1;
…we are rewarded with a Stream Aggregate plan with a cost of 0.041455:
To summarize, this is a limitation of the query optimizer in SQL Server 2008 and later (including SQL Server 2014 CTP 1) that is not resolved by using trace flag 4199 as was the case for the Merge Join example. The problem only occurs with partitioned tables with a
GROUP BY or
DISTINCT over three or more columns using an aligned partitioned index, where a single partition is processed.
As with the Merge Join example, this represents a backward step from the SQL Server 2005 behaviour. SQL Server 2005 did not add an implied leading key to partitioned indexes, using an
APPLY technique instead. In SQL Server 2005, all the queries presented here using
$PARTITION to specify a single partition result in query plans that performs partition elimination and use Stream Aggregates without any query text reordering.
The changes to partitioned table processing in SQL Server 2008 improved performance in several important areas, primarily related to the efficient parallel processing of partitions. Unfortunately, these changes had side effects which have not all been resolved in later releases.