Itzik Ben-Gan

Nested window functions in SQL

SentryOne Monitor : Save 40% for a Limited Time!
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Register to Download

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

The ISO/IEC 9075:2016 standard (SQL:2016) defines a feature called nested window functions. This feature allows you to nest two kinds of window functions as an argument of a window aggregate function. The idea is to allow you to refer to either a row number, or to a value of an expression, at strategic markers in windowing elements. The markers give you access to the first or last row in the partition, the first or last row in the frame, the current outer row, and the current frame row. This idea is very powerful, enabling you to apply filtering and other kinds of manipulations within your window function that are sometimes difficult to achieve otherwise. You can also use nested window functions to easily emulate other features, like RANGE-based frames. This feature is currently not available in T-SQL. I posted a suggestion to improve SQL Server by adding support for nested window functions. Make sure to add your vote if you feel that this feature could be beneficial to you.

What nested window functions are not about

At the date of this writing, there's not much information available out there about the true standard nested window functions. What makes it harder is that I don't know of any platform that implemented this feature yet. In fact, running a web search for nested window functions returns mostly coverage of and discussions about nesting grouped aggregate functions within windowed aggregate functions. For example, suppose that you want to query the Sales.OrderValues view in the TSQLV5 sample database, and return for each customer and order date, the daily total of the order values, and the running total until the current day. Such a task involves both grouping and windowing. You group the rows by the customer ID and the order date, and apply a running sum on top of the group sum of the order values, like so:

  USE TSQLV5; -- http://tsql.solidq.com/SampleDatabases/TSQLV5.zip
 
  SELECT custid, orderdate, SUM(val) AS daytotal,
    SUM(SUM(val)) OVER(PARTITION BY custid
                       ORDER BY orderdate
                       ROWS UNBOUNDED PRECEDING) AS runningsum
  FROM Sales.OrderValues
  GROUP BY custid, orderdate;

This query generates the following output, shown here in abbreviated form:

  custid  orderdate   daytotal runningsum
  ------- ----------  -------- ----------
  1       2018-08-25    814.50     814.50
  1       2018-10-03    878.00    1692.50
  1       2018-10-13    330.00    2022.50
  1       2019-01-15    845.80    2868.30
  1       2019-03-16    471.20    3339.50
  1       2019-04-09    933.50    4273.00
  2       2017-09-18     88.80      88.80
  2       2018-08-08    479.75     568.55
  2       2018-11-28    320.00     888.55
  2       2019-03-04    514.40    1402.95
  ...

Even though this technique is pretty cool, and even though web searches for nested window functions return mainly such techniques, that's not what the SQL standard means by nested window functions. Since I couldn't find any information out there on the topic, I just had to figure it out from the standard itself. Hopefully, this article will increase awareness of the true nested window functions feature, and cause people to turn to Microsoft and ask to add support for it in SQL Server.

What nested window functions are about

Nested window functions include two functions that you can nest as an argument of a window aggregate function. Those are the nested row number function, and the nested value_of expression at row function.

Nested row number function

The nested row number function allows you to refer to the row number of strategic markers in windowing elements. Here's the syntax of the function:

<agg_func>(< argument with ROW_NUMBER(<row marker>) >) OVER(<specification>)

The row markers that you can specify are:

  • BEGIN_PARTITION
  • END_PARTITION
  • BEGIN_FRAME
  • END_FRAME
  • CURRENT_ROW
  • FRAME_ROW

The first four markers are self-explanatory. As for the last two, the CURRENT_ROW marker represents the current outer row, and the FRAME_ROW represents the current inner frame row.

As an example for using the nested row number function, consider the following task. You need to query the Sales.OrderValues view, and return for each order some of its attributes, as well as the difference between the current order value and the customer average, but excluding the first and last customer orders from the average.

This task is achievable without nested window functions, but the solution involves quite a few steps:

  WITH C1 AS
  (
    SELECT custid, val,
      ROW_NUMBER() OVER( PARTITION BY custid
                         ORDER BY orderdate, orderid ) AS rownumasc,
      ROW_NUMBER() OVER( PARTITION BY custid
                         ORDER BY orderdate DESC, orderid DESC ) AS rownumdesc
    FROM Sales.OrderValues
  ),
  C2 AS
  (
    SELECT custid, AVG(val) AS avgval
    FROM C1
    WHERE 1 NOT IN (rownumasc, rownumdesc)
    GROUP BY custid
  )
  SELECT O.orderid, O.custid, O.orderdate, O.val,
    O.val - C2.avgval AS diff
  FROM Sales.OrderValues AS O
    LEFT OUTER JOIN C2
      ON O.custid = C2.custid;

Here's the output of this query, shown here in abbreviated form:

  orderid  custid  orderdate  val       diff
  -------- ------- ---------- --------  ------------
  10411    10      2018-01-10   966.80   -570.184166
  10743    4       2018-11-17   319.20   -809.813636
  11075    68      2019-05-06   498.10  -1546.297500
  10388    72      2017-12-19  1228.80   -358.864285
  10720    61      2018-10-28   550.00   -144.744285
  11052    34      2019-04-27  1332.00  -1164.397500
  10457    39      2018-02-25  1584.00   -797.999166
  10789    23      2018-12-22  3687.00   1567.833334
  10434    24      2018-02-03   321.12  -1329.582352
  10766    56      2018-12-05  2310.00   1015.105000
  ...

Using nested row number functions, the task is achievable with a single query, like so:

  SELECT orderid, custid, orderdate, val,
    val - AVG( CASE
                 WHEN ROW_NUMBER(FRAME_ROW) NOT IN
                        ( ROW_NUMBER(BEGIN_PARTITION), ROW_NUMBER(END_PARTITION) ) THEN val
               END )
            OVER( PARTITION BY custid
                  ORDER BY orderdate, orderid
                  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS diff
  FROM Sales.OrderValues;

Also, the currently supported solution requires at least one sort in the plan, and multiple passes over the data. The solution using nested row number functions has all the potential to get optimized with reliance on index order, and a reduced number of passes over the data. This, of course, is implementation dependent though.

Nested value_of expression at row function

The nested value_of expression at row function enables you to interact with a value of an expression at the same strategic row markers mentioned earlier in an argument of a window aggregate function. Here's the syntax of this function:

<agg_func>(< argument with
VALUE OF <expression> AT <row marker> [<delta>] [, <default>]
>) OVER(<specification>)

As you can see, you can specify a certain negative or positive delta with respect to the row marker, and optionally provide a default value in case a row doesn't exist at the specified position.

This capability gives you a lot of power when you need to interact with different points in windowing elements. Consider the fact that as powerful as window functions may be compared to alternative tools like subqueries, what window functions don't support is a basic concept of a correlation. Using the CURRENT_ROW marker you get access to the outer row, and this way emulate correlations. At the same time you get to benefit from all of the advantages that window functions have compared to subqueries.

As an example, suppose that you need to query the Sales.OrderValues view, and return for each order some of its attributes, as well as the difference between the current order value and the customer average, but excluding orders placed on the same date as the current order date. This requires a capability similar to a correlation. With the nested value_of expression at row function, using the CURRENT_ROW marker, this is achievable easily like so:

  SELECT orderid, custid, orderdate, val,
    val - AVG( CASE WHEN orderdate <> VALUE OF orderdate AT CURRENT_ROW THEN val END )
            OVER( PARTITION BY custid ) AS diff
  FROM Sales.OrderValues;

This query is supposed to generate the following output:

  orderid  custid  orderdate  val       diff
  -------- ------- ---------- --------  ------------
  10248    85      2017-07-04   440.00    180.000000
  10249    79      2017-07-05  1863.40   1280.452000
  10250    34      2017-07-08  1552.60   -854.228461
  10251    84      2017-07-08   654.06   -293.536666
  10252    76      2017-07-09  3597.90   1735.092728
  10253    34      2017-07-10  1444.80   -970.320769
  10254    14      2017-07-11   556.62  -1127.988571
  10255    68      2017-07-12  2490.50    617.913334
  10256    88      2017-07-15   517.80   -176.000000
  10257    35      2017-07-16  1119.90   -153.562352
  ...

If you're thinking that this task is achievable just as easily with correlated subqueries, in this simplistic case you'd be right. The same can be achieved with the following query:

  SELECT O1.orderid, O1.custid, O1.orderdate, O1.val,
    O1.val - ( SELECT AVG(O2.val)
               FROM Sales.OrderValues AS O2
               WHERE O2.custid = O1.custid
                 AND O2.orderdate <> O1.orderdate ) AS diff
  FROM Sales.OrderValues AS O1;

However, remember that a subquery operates on an independent view of the data, whereas a window function operates on the set that is provided as input to the logical query processing step that handles the SELECT clause. Usually, the underlying query has extra logic like joins, filters, grouping, and such. With subqueries, you either need to prepare a preliminary CTE, or repeat the logic of the underlying query also in the subquery. With window functions, there's no need to repeat any of the logic.

For example, say that you were supposed to operate only on shipped orders (where the shipped date is not NULL) that were handled by employee 3. The solution with the window function needs to add the filter predicates only once, like so:

   SELECT orderid, custid, orderdate, val,
    val - AVG( CASE WHEN orderdate <> VALUE OF orderdate AT CURRENT_ROW THEN val END )
            OVER( PARTITION BY custid ) AS diff
  FROM Sales.OrderValues
  WHERE empid = 3 AND shippeddate IS NOT NULL;

This query is supposed to generate the following output:

  orderid  custid  orderdate  val      diff
  -------- ------- ---------- -------- -------------
  10251    84      2017-07-08   654.06   -459.965000
  10253    34      2017-07-10  1444.80    531.733334
  10256    88      2017-07-15   517.80  -1022.020000
  10266    87      2017-07-26   346.56          NULL
  10273    63      2017-08-05  2037.28  -3149.075000
  10283    46      2017-08-16  1414.80    534.300000
  10309    37      2017-09-19  1762.00  -1951.262500
  10321    38      2017-10-03   144.00          NULL
  10330    46      2017-10-16  1649.00    885.600000
  10332    51      2017-10-17  1786.88    495.830000
  ...

The solution with the subquery needs to add the filter predicates twice—once in the outer query and once in the subquery—like so:

  SELECT O1.orderid, O1.custid, O1.orderdate, O1.val,
    O1.val - ( SELECT AVG(O2.val)
               FROM Sales.OrderValues AS O2
               WHERE O2.custid = O1.custid
                 AND O2.orderdate <> O1.orderdate
                 AND empid = 3
                 AND shippeddate IS NOT NULL) AS diff
  FROM Sales.OrderValues AS O1
  WHERE empid = 3 AND shippeddate IS NOT NULL;

It's either this, or adding a preliminary CTE that takes care of all of the filtering and any other logic. Anyway you look at it, with subqueries, there are more complexity layers involved.

The other benefit in nested window functions is that if we had support for those in T-SQL, it would have been easy to emulate the missing full support for the RANGE window frame unit. The RANGE option is supposed to enable you to define dynamic frames that are based on an offset from the ordering value in the current row. For example, suppose that you need to compute for each customer order from the Sales.OrderValues view the moving average value of the last 14 days. According to the SQL standard, you can achieve this using the RANGE option and the INTERVAL type, like so:

  SELECT orderid, custid, orderdate, val,
    AVG(val) OVER( PARTITION BY custid
                   ORDER BY orderdate
                   RANGE BETWEEN INTERVAL '13' DAY PRECEDING
                             AND CURRENT ROW ) AS movingavg14days
  FROM Sales.OrderValues;

This query is supposed to generate the following output:

  orderid  custid  orderdate  val     movingavg14days
  -------- ------- ---------- ------- ---------------
  10643    1       2018-08-25  814.50      814.500000
  10692    1       2018-10-03  878.00      878.000000
  10702    1       2018-10-13  330.00      604.000000
  10835    1       2019-01-15  845.80      845.800000
  10952    1       2019-03-16  471.20      471.200000
  11011    1       2019-04-09  933.50      933.500000
  10308    2       2017-09-18   88.80       88.800000
  10625    2       2018-08-08  479.75      479.750000
  10759    2       2018-11-28  320.00      320.000000
  10926    2       2019-03-04  514.40      514.400000
  10365    3       2017-11-27  403.20      403.200000
  10507    3       2018-04-15  749.06      749.060000
  10535    3       2018-05-13 1940.85     1940.850000
  10573    3       2018-06-19 2082.00     2082.000000
  10677    3       2018-09-22  813.37      813.370000
  10682    3       2018-09-25  375.50      594.435000
  10856    3       2019-01-28  660.00      660.000000
  ...

At the date of this writing, this syntax is not supported in T-SQL. If we had support for nested window functions in T-SQL, you would have been able to emulate this query with the following code:

  SELECT orderid, custid, orderdate, val,
    AVG( CASE WHEN DATEDIFF(day, orderdate, VALUE OF orderdate AT CURRENT_ROW) 
                     BETWEEN 0 AND 13
                THEN val END )
      OVER( PARTITION BY custid
            ORDER BY orderdate
            RANGE UNBOUNDED PRECEDING ) AS movingavg14days
  FROM Sales.OrderValues;

What's not to like?

Cast your vote

The standard nested window functions seem like a very powerful concept that enables a lot of flexibility in interacting with different points in windowing elements. I'm quite surprised that I cannot find any coverage of the concept other than in the standard itself, and that I don't see many platforms implementing it. Hopefully this article will increase awareness for this feature. If you feel that it could be useful for you to have it available in T-SQL, make sure to cast your vote!