The challenge is on! Community call for creating the fastest number series generator

December 9, 2020 by Itzik Ben-Gan in T-SQL Queries | 72 Comments

In Part 5 of my series on table expressions I provided the following solution for generating a series of numbers using CTEs, a table value constructor and cross joins:

DECLARE @low AS BIGINT = 1001, @high AS BIGINT = 1010;

WITH
  L0 AS ( SELECT 1 AS c FROM (VALUES(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  L4 AS ( SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B ),
  L5 AS ( SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L5 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;

There are many practical use cases for such a tool, including generating a series of date and time values, creating sample data, and more. Recognizing the common need, some platforms provide a built-in tool, such as PostgreSQL’s generate_series function. At the time of writing, T-SQL doesn’t provide such a built-in tool, but one can always hope and vote for such a tool to be added in the future.

In a comment to my article, Marcos Kirchner mentioned that he tested my solution with varying table value constructor cardinalities, and got different execution times for the different cardinalities.

I always used my solution with a base table value constructor cardinality of 2, but Marcos’ comment made me think. This tool is so useful that we as a community should join forces to try and create the fastest version that we possibly can. Testing different base table cardinalities is just one dimension to try. There could be many others. I’ll present the performance tests that I’ve done with my solution. I mainly experimented with different table value constructor cardinalities, with serial versus parallel processing, and with row mode versus batch mode processing. However, it could be that an entirely different solution is even faster than my best version. So, the challenge is on! I’m calling all jedi, padawan, wizard and apprentice alike. What’s the best performing solution that you can conjure? Do you have it within you to beat the fastest solution posted thus far? If so, share yours as a comment to this article, and feel free to improve any solution posted by others.

Requirements:

Implement your solution as an inline table-valued function (iTVF) named dbo.GetNumsYourName with parameters @low AS BIGINT and @high AS BIGINT. As an example, see the ones I submit at the end of this article.
You can create supporting tables in the user database if needed.
You can add hints as needed.
As mentioned, the solution should support delimiters of the BIGINT type, but you can assume a maximum series cardinality of 4,294,967,296.
To evaluate the performance of your solution and compare it with others, I’ll test it with the range 1 through 100,000,000, with Discard results after execution enabled in SSMS.

Good luck to all of us! May the best community win. ;)

Different cardinalities for base table value constructor

I experimented with varying cardinalities of the base CTE, starting with 2 and advancing in a logarithmic scale, squaring the previous cardinality in each step: 2, 4, 16 and 256.

Before you start experimenting with different base cardinalities it could be helpful to have a formula that given the base cardinality and maximum range cardinality would tell you how many levels of CTEs you need. As a preliminary step, it’s easier to first come up with a formula that given the base cardinality and number of levels of CTEs, computes what’s the maximum resultant range cardinality. Here’s such a formula expressed in T-SQL:

DECLARE @basecardinality AS INT = 2, @levels AS INT = 5;

SELECT POWER(1.*@basecardinality, POWER(2., @levels));

With the above sample input values this expression yields a maximum range cardinality of 4,294,967,296.

Then, the inverse formula for computing the number of CTE levels needed involves nesting two log functions, like so:

DECLARE @basecardinality AS INT = 2, @seriescardinality AS BIGINT = 4294967296;

SELECT CEILING(LOG(LOG(@seriescardinality, @basecardinality), 2));

With the above sample input values this expression yields 5. Note that this number is in addition to the base CTE that has the table value constructor, which I named L0 (for level 0) in my solution.

Don’t ask me how I got to these formulas. The story I’m sticking to is that Gandalf uttered them to me in Elvish in my dreams.

Let’s proceed to performance testing. Make sure that you enable Discard results after execution in your SSMS Query Options dialog under Grid, Results. Use the following code to run a test with base CTE cardinality of 2 (requires 5 additional levels of CTEs):

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c FROM (VALUES(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  L4 AS ( SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B ),
  L5 AS ( SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L5 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;

I got the plan shown in Figure 1 for this execution.

Figure 1: Plan for base CTE cardinality of 2

The plan is serial, and all operators in the plan use row mode processing by default. If you’re getting a parallel plan by default, e.g., when encapsulating the solution in an iTVF and using a large range, for now force a serial plan with a MAXDOP 1 hint.

Observe how the unpacking of the CTEs resulted in 32 instances of the Constant Scan operator, each representing a table with two rows.

I got the following performance statistics for this execution:

CPU time = 30188 ms,  elapsed time = 32844 ms.

Use the following code to test the solution with a base CTE cardinality of 4, which per our formula requires four levels of CTEs:

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c FROM (VALUES(1),(1),(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  L4 AS ( SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L4 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;

I got the plan shown in Figure 2 for this execution.

Figure 2: Plan for base CTE cardinality of 4

The unpacking of the CTEs resulted in 16 Constant Scan operators, each representing a table of 4 rows.

I got the following performance statistics for this execution:

CPU time = 23781 ms,  elapsed time = 25435 ms.

This is a decent improvement of 22.5 percent over the previous solution.

Examining wait stats reported for the query, the dominant wait type is SOS_SCHEDULER_YIELD. Indeed, the wait count curiously dropped by 22.8 percent compared to the first solution (wait count 15,280 versus 19,800).

Use the following code to test the solution with a base CTE cardinality of 16, which per our formula requires three levels of CTEs:

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c 
          FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L3 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;

I got the plan shown in Figure 3 for this execution.

Figure 3: Plan for base CTE cardinality of 16

This time the unpacking of the CTEs resulted in 8 Constant Scan operators, each representing a table with 16 rows.

I got the following performance statistics for this execution:

CPU time = 22968 ms,  elapsed time = 24409 ms.

This solution further reduces the elapsed time, albeit by just a few additional percent, amounting to a reduction of 25.7 percent compared to the first solution. Again, the wait count of the SOS_SCHEDULER_YIELD wait type keeps dropping (12,938).

Advancing in our logarithmic scale, the next test involves a base CTE cardinality of 256. It’s long and ugly, but give it a try:

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c 
          FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L2 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;

I got the plan shown in Figure 4 for this execution.

Figure 4: Plan for base CTE cardinality of 256

This time the unpacking of the CTEs resulted in only four Constant Scan operators, each with 256 rows.

I got the following performance numbers for this execution:

CPU time = 23516 ms,  elapsed time = 25529 ms.

This time it seems like the performance degraded a bit compared to the previous solution with a base CTE cardinality of 16. Indeed, the wait count of the SOS_SCHEDULER_YIELD wait type increased a bit to 13,176. So, it would seem that we found our golden number—16!

Parallel versus serial plans

I experimented with forcing a parallel plan using the hint ENABLE_PARALLEL_PLAN_PREFERENCE, but it ended up hurting performance. In fact, when implementing the solution as an iTVF, I got a parallel plan on my machine by default for large ranges, and had to force a serial plan with a MAXDOP 1 hint to get optimal performance.

Batch processing

The main resource used in the plans for my solutions is CPU. Given that batch processing is all about improving CPU efficiency, especially when dealing with large numbers of rows, it’s worthwhile to try this option. The main activity here that can benefit from batch processing is the row number computation. I tested my solutions in SQL Server 2019 Enterprise edition. SQL Server chose row mode processing for all previously shown solutions by default. Apparently, this solution didn’t pass the heuristics required to enable batch mode on rowstore. There are a couple of ways to get SQL Server to use batch processing here.

Option 1 is to involve a table with a columnstore index in the solution. You can achieve this by creating a dummy table with a columnstore index and introduce a dummy left join in the outermost query between our Nums CTE and that table. Here’s the dummy table definition:

CREATE TABLE dbo.BatchMe(col1 INT NOT NULL, INDEX idx_cs CLUSTERED COLUMNSTORE);

Then revise the outer query against Nums to use FROM Nums LEFT OUTER JOIN dbo.BatchMe ON 1 = 0. Here’s an example with a base CTE cardinality of 16:

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c 
          FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L3 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums LEFT OUTER JOIN dbo.BatchMe ON 1 = 0
ORDER BY rownum;

I got the plan shown in Figure 5 for this execution.

Figure 5: Plan with batch processing

Observe the use of the batch mode Window Aggregate operator to compute the row numbers. Also observe that the plan doesn’t involve the dummy table. The optimizer optimized it out.

The upside of option 1 is that it works in all SQL Server editions and is relevant in SQL Server 2016 or later, since the batch mode Window Aggregate operator was introduced in SQL Server 2016. The downside is the need to create the dummy table and include it in the solution.

Option 2 to get batch processing for our solution, provided that you’re using SQL Server 2019 Enterprise edition, is to use the undocumented self-explanatory hint OVERRIDE_BATCH_MODE_HEURISTICS (details in Dmitry Pilugin’s article), like so:

DECLARE @low AS BIGINT = 1, @high AS BIGINT = 100000000;

WITH
  L0 AS ( SELECT 1 AS c 
          FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),
                      (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ),
  L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ),
  L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ),
  L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ),
  Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L3 )
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum
OPTION(USE HINT('OVERRIDE_BATCH_MODE_HEURISTICS'));

The upside of option 2 is that you don’t need to create a dummy table and involve it in your solution. The downsides are that you need to use Enterprise edition, use at minimum SQL Server 2019 where batch mode on rowstore was introduced, and the solution involves using an undocumented hint. For these reasons, I prefer option 1.

Here are the performance numbers that I got for the various base CTE cardinalities:

Cardinality 2:   CPU time = 21594 ms,  elapsed time = 22743 ms (down from 32844).

Cardinality 4:   CPU time = 18375 ms,  elapsed time = 19394 ms (down from 25435).

Cardinality 16:  CPU time = 17640 ms,  elapsed time = 18568 ms (down from 24409).

Cardinality 256: CPU time = 17109 ms,  elapsed time = 18507 ms (down from 25529).

Figure 6 has a performance comparison between the different solutions:

Figure 6: Performance comparison

You can observe a decent performance improvement of 20-30 percent over the row mode counterparts.

Curiously, with batch mode processing the solution with base CTE cardinality of 256 did best. However, it’s just a tiny bit faster than the solution with base CTE cardinality of 16. The difference is so minor, and the latter has a clear advantage in terms of code brevity, that I’d stick to 16.

So, my tuning efforts ended up yielding an improvement of 43.5 percent from the original solution with the base cardinality of 2 using row mode processing.