ANY aggregate is not something we can write directly in Transact SQL. It is an internal only feature used by the query optimizer and execution engine.
I am personally quite fond of the
ANY aggregate, so it was a bit disappointing to learn that it is broken in quite a fundamental way. The particular flavour of 'broken' I am referring to here is the wrong-results variety.
In this post, I take a look at two particular places where the
ANY aggregate commonly shows up, demonstrate the wrong results problem, and suggest workarounds where necessary.
For background on the
ANY aggregate, please see my previous post Undocumented Query Plans: The ANY Aggregate.
1. One row per group queries
This must be one of the most common day-today query requirements, with a very well-known solution. You probably write this sort of query every day, automatically following the pattern, without really thinking about it.
The idea is to number the input set of rows using the
ROW_NUMBER window function, partitioned by the grouping column or columns. That is wrapped in a Common Table Expression or derived table, and filtered down to rows where the computed row number is equal to one. Since the
ROW_NUMBER restarts at one for each group, this gives us the required one row per group.
There is no problem with that general pattern. The type of one row per group query that is subject to the
ANY aggregate problem is the one where we do not care which particular row is selected from each group.
In that case, it is not clear which column should be used in the mandatory
ORDER BY clause of the
ROW_NUMBER window function. After all, we explicitly do not care which row is selected. One common approach is to reuse the
PARTITION BY column(s) in the
ORDER BY clause. This is where the problem might occur.
Let us look at an example using a toy data set:
CREATE TABLE #Data ( c1 integer NULL, c2 integer NULL, c3 integer NULL ); INSERT #Data (c1, c2, c3) VALUES -- Group 1 (1, NULL, 1), (1, 1, NULL), (1, 111, 111), -- Group 2 (2, NULL, 2), (2, 2, NULL), (2, 222, 222);
The requirement is to return any one complete row of data from each group, where group membership is defined by the value in column
ROW_NUMBER pattern, we might write a query like the following (notice the
ORDER BY clause of the
ROW_NUMBER window function matches the
PARTITION BY clause):
WITH Numbered AS ( SELECT D.*, rn = ROW_NUMBER() OVER ( PARTITION BY D.c1 ORDER BY D.c1) FROM #Data AS D ) SELECT N.c1, N.c2, N.c3 FROM Numbered AS N WHERE N.rn = 1;
As presented, this query executes successfully, with correct results. The results are technically non-deterministic since SQL Server could validly return any one of the rows in each group. Nevertheless, if you run this query yourself, you are quite likely to see the same result I do:
The execution plan depends on the version of SQL Server used, and does not depend on database compatibility level.
On SQL Server 2014 and earlier, the plan is:
For SQL Server 2016 or later, you will see:
Both plans are safe, but for different reasons. The Distinct Sort plan contains an
ANY aggregate, but the Distinct Sort operator implementation does not manifest the bug.
The more complex SQL Server 2016+ plan does not use the
ANY aggregate at all. The Sort puts the rows into the order needed for the row numbering operation. The Segment operator sets a flag at the start of each new group. The Sequence Project computes the row number. Finally, the Filter operator passes on only those rows that have a computed row number of one.
To get incorrect results with this data set, we need to be using SQL Server 2014 or earlier, and the
ANY aggregates need to be implemented in a Stream Aggregate or Eager Hash Aggregate operator (Flow Distinct Hash Match Aggregate does not produce the bug).
One way to encourage the optimizer to choose a Stream Aggregate instead of Distinct Sort is to add a clustered index to provide ordering by column
CREATE CLUSTERED INDEX c ON #Data (c1);
After that change, the execution plan becomes:
ANY aggregates are visible in the Properties window when the Stream Aggregate operator is selected:
The result of the query is:
This is wrong. SQL Server has returned rows that do not exist in the source data. There are no source rows where
c2 = 1 and
c3 = 1 for example. As a reminder, the source data is:
The execution plan erroneously computes separate
ANY aggregates for the
c3 columns, ignoring nulls. Each aggregate independently returns the first non-null value it encounters, giving a result where the values for
c3 come from different source rows. This is not what the original SQL query specification requested.
The same wrong result can be produced with or without the clustered index by adding an
OPTION (HASH GROUP) hint to produce a plan with an Eager Hash Aggregate instead of a Stream Aggregate.
This problem can only occur when multiple
ANY aggregates are present, and the aggregated data contains nulls. As noted, the issue only affects Stream Aggregate and Eager Hash Aggregate operators; Distinct Sort and Flow Distinct are not affected.
SQL Server 2016 onward makes an effort to avoid introducing multiple
ANY aggregates for the any one row per group row numbering query pattern when the source columns are nullable. When this happens, the execution plan will contain Segment, Sequence Project, and Filter operators instead of an aggregate. This plan shape is always safe, since no
ANY aggregates are used.
Reproducing the bug in SQL Server 2016+
The SQL Server optimizer is not perfect at detecting when a column originally constrained to be
NOT NULL might still produce a null intermediate value through data manipulations.
To reproduce this, we will start with a table where all columns are declared as
IF OBJECT_ID(N'tempdb..#Data', N'U') IS NOT NULL BEGIN DROP TABLE #Data; END; CREATE TABLE #Data ( c1 integer NOT NULL, c2 integer NOT NULL, c3 integer NOT NULL ); CREATE CLUSTERED INDEX c ON #Data (c1); INSERT #Data (c1, c2, c3) VALUES -- Group 1 (1, 1, 1), (1, 2, 2), (1, 3, 3), -- Group 2 (2, 1, 1), (2, 2, 2), (2, 3, 3);
We can produce nulls from this data set in many ways, most of which the optimizer can successfully detect, and so avoid introducing
ANY aggregates during optimization.
One way to add nulls that happens to slip under the radar is shown below:
SELECT D.c1, OA1.c2, OA2.c3 FROM #Data AS D OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1 OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2;
That query produces the following output:
The next step is to use that query specification as the source data for the standard "any one row per group" query:
WITH SneakyNulls AS ( -- Introduce nulls the optimizer can't see SELECT D.c1, OA1.c2, OA2.c3 FROM #Data AS D OUTER APPLY (SELECT D.c2 WHERE D.c2 <> 1) AS OA1 OUTER APPLY (SELECT D.c3 WHERE D.c3 <> 2) AS OA2 ), Numbered AS ( SELECT D.c1, D.c2, D.c3, rn = ROW_NUMBER() OVER ( PARTITION BY D.c1 ORDER BY D.c1) FROM SneakyNulls AS D ) SELECT N.c1, N.c2, N.c3 FROM Numbered AS N WHERE N.rn = 1;
On any version of SQL Server, that produces the following plan:
The Stream Aggregate contains multiple
ANY aggregates, and the result is wrong. Neither of the returned rows appear in the source data set:
The only fully reliable workaround until this bug is fixed is to avoid the pattern where the
ROW_NUMBER has the same column in the
ORDER BY clause as is in the
PARTITION BY clause.
When we do not care which one row is selected from each group, it is unfortunate that an
ORDER BY clause is needed at all. One way to side-step the issue is to use a run time constant like
ORDER BY @@SPID in the window function.
2. Non-deterministic update
The issue with multiple
ANY aggregates on nullable inputs is not restricted to the any one row per group query pattern. The query optimizer can introduce an internal
ANY aggregate in a number of circumstances. One of those case is a non-deterministic update.
A non-deterministic update is where the statement does not guarantee that each target row will be updated at most once. In other words, there are multiple source rows for at least one target row. The documentation explicitly warns about this:
The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.
To handle a non-deterministic update, the optimizer groups the rows by a key (index or RID) and applies
ANY aggregates to the remaining columns. The basic idea there is to choose one row from multiple candidates, and use values from that row to perform the update. There are obvious parallels to the previous
ROW_NUMBER issue, so it is no surprise that it is quite easy to demonstrate an incorrect update.
Unlike the previous issue, SQL Server currently takes no special steps to avoid multiple
ANY aggregates on nullable columns when performing a non-deterministic update. The following therefore relates to all SQL Server versions, including SQL Server 2019 CTP 3.0.
DECLARE @Target table ( c1 integer PRIMARY KEY, c2 integer NOT NULL, c3 integer NOT NULL ); DECLARE @Source table ( c1 integer NULL, c2 integer NULL, c3 integer NULL, INDEX c CLUSTERED (c1) ); INSERT @Target (c1, c2, c3) VALUES (1, 0, 0); INSERT @Source (c1, c2, c3) VALUES (1, 2, NULL), (1, NULL, 3); UPDATE T SET T.c2 = S.c2, T.c3 = S.c3 FROM @Target AS T JOIN @Source AS S ON S.c1 = T.c1; SELECT * FROM @Target AS T;
Logically, this update should always produce an error: The target table does not allow nulls in any column. Whichever matching row is chosen from the source table, an attempt to update column
c3 to null must occur.
Unfortunately, the update succeeds, and the final state of the target table is inconsistent with the supplied data:
I have reported this as a bug. The work around is to avoid writing non-deterministic
UPDATE statements, so
ANY aggregates are not needed to resolve the ambiguity.
As mentioned, SQL Server can introduce
ANY aggregates in more circumstances than the two examples given here. If this happens when the aggregated column contains nulls, there is the potential for wrong results.