Aaron Bertrand

SQL Server v.Next : STRING_AGG() performance

December 28, 2016 by in SQL Performance | 14 Comments
SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Itzik is a T-SQL trainer, a co-founder of SolidQ, and blogs about T-SQL fundamentals and query tuning.

Itzik’s Posts

While SQL Server on Linux has stolen almost all of the headlines about v.Next, there are some other interesting advancements coming in the next version of our favorite database platform. On the T-SQL front, we finally have a built-in way to perform grouped string concatenation: STRING_AGG().

Let's say we have the following simple table structure:

CREATE TABLE dbo.Objects
(
  [object_id]   int, 
  [object_name] nvarchar(261),
  CONSTRAINT PK_Objects PRIMARY KEY([object_id])
);

CREATE TABLE dbo.Columns
(
  [object_id] int NOT NULL
    FOREIGN KEY REFERENCES dbo.Objects([object_id]), 
  column_name sysname,
  CONSTRAINT PK_Columns PRIMARY KEY ([object_id],column_name)
);

For performance tests, we're going to populate this using sys.all_objects and sys.all_columns. But for a simple demonstration first, let's add the following rows:

INSERT dbo.Objects([object_id],[object_name])
  VALUES(1,N'Employees'),(2,N'Orders');

INSERT dbo.Columns([object_id],column_name)
  VALUES(1,N'EmployeeID'),(1,N'CurrentStatus'),
        (2,N'OrderID'),(2,N'OrderDate'),(2,N'CustomerID');

If the forums are any indication, it is a very common requirement to return a row for each object, along with a comma-separated list of column names. (Extrapolate that to whatever entity types you model this way – product names associated with an order, part names involved in the assembly of a product, subordinates reporting to a manager, etc.) So, for example, with the above data we'd want output like this:

object       columns
---------    ----------------------------
Employees    EmployeeID,CurrentStatus
Orders       OrderID,OrderDate,CustomerID

The way we would accomplish this in current versions of SQL Server is probably to use FOR XML PATH, as I demonstrated to be the most efficient outside of CLR in this earlier post. In this example, it would look like this:

SELECT [object]  = o.[object_name],
       [columns] = STUFF(
                    (SELECT N',' + c.column_name
                       FROM dbo.Columns AS c
                       WHERE c.[object_id] = o.[object_id]
                       FOR XML PATH, TYPE
                    ).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
FROM dbo.Objects AS o;

Predictably, we get the same output demonstrated above. In SQL Server v.Next, we will be able to express this more simply:

SELECT [object]  = o.[object_name],
       [columns] = STRING_AGG(c.column_name, N',')
FROM dbo.Objects AS o
INNER JOIN dbo.Columns AS c
ON o.[object_id] = c.[object_id]
GROUP BY o.[object_name];

Again, this produces the exact same output. And we were able to do this with a native function, avoiding both the expensive FOR XML PATH scaffolding, and the STUFF() function used to remove the first comma (this happens automatically).

What About Order?

One of the problems with many of the kludge solutions to grouped concatenation is that the ordering of the comma-separated list should be considered arbitrary and non-deterministic.

For the XML PATH solution, I demonstrated in another earlier post that adding an ORDER BY is trivial and guaranteed. So in this example, we could order the column list by column name alphabetically instead of leaving it to SQL Server to sort (or not):

SELECT [object]  = [object_name],
       [columns] = STUFF(
                    (SELECT N',' +c.column_name
                       FROM dbo.Columns AS c
                       WHERE c.[object_id] = o.[object_id]
                       ORDER BY c.column_name -- only change
                       FOR XML PATH, TYPE
                    ).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
FROM dbo.Objects AS o;

Output:

object       columns
---------    ----------------------------
Employees    CurrentStatus,EmployeeID
Order        CustomerID,OrderDate,OrderID

CTP 1.1 adds WITHIN GROUP to STRING_AGG(), so using the new approach, we can say:

SELECT [object]  = o.[object_name],
       [columns] = STRING_AGG(c.column_name, N',')
                   WITHIN GROUP (ORDER BY c.column_name) -- only change
FROM dbo.Objects AS o
INNER JOIN dbo.Columns AS c
ON o.[object_id] = c.[object_id]
GROUP BY o.[object_name];

Now we get the same results. Note that, just like a normal ORDER BY clause, you can add multiple ordering columns or expressions inside WITHIN GROUP ().

All Right, Performance Already!

Using quad-core 2.6 GHz processors, 8 GB of memory, and SQL Server CTP1.1 (14.0.100.187), I created a new database, re-created these tables, and added rows from sys.all_objects and sys.all_columns. I made sure to only include objects that had at least one column:

INSERT dbo.Objects([object_id], [object_name]) -- 656 rows
  SELECT [object_id], QUOTENAME(s.name) + N'.' + QUOTENAME(o.name) 
    FROM sys.all_objects AS o
    INNER JOIN sys.schemas AS s 
    ON o.[schema_id] = s.[schema_id]
    WHERE EXISTS
    (
      SELECT 1 FROM sys.all_columns 
      WHERE [object_id] = o.[object_id]
    );

INSERT dbo.Columns([object_id], column_name) -- 8,085 rows 
  SELECT [object_id], name 
    FROM sys.all_columns AS c  
    WHERE EXISTS
    (
      SELECT 1 FROM dbo.Objects 
      WHERE [object_id] = c.[object_id]
    );

On my system, this yielded 656 objects and 8,085 columns (your system may yield slightly different numbers).

The Plans

First, let's compare the plans and Table I/O tabs for our two unordered queries, using Plan Explorer. Here are the overall runtime metrics:

Runtime metrics for XML PATH (top) and STRING_AGG() (bottom)Runtime metrics for XML PATH (top) and STRING_AGG() (bottom)

The graphical plan and Table I/O from the FOR XML PATH query:

Plan for XML PATH, no order
Table I/O for XML PATH, no orderPlan and Table I/O for XML PATH, no order

And from the STRING_AGG version:

Plan for STRING_AGG, no ordering
Table I/O for STRING_AGG, no orderingPlan and Table I/O for STRING_AGG, no ordering

For the latter, the clustered index seek seems a little troubling to me. This seemed like a good case for testing out the seldom-used FORCESCAN hint (and no, this would certainly not help out the FOR XML PATH query):

SELECT [object]  = o.[object_name],
       [columns] = STRING_AGG(c.column_name, N',')
FROM dbo.Objects AS o
INNER JOIN dbo.Columns AS c WITH (FORCESCAN) -- added hint
ON o.[object_id] = c.[object_id]
GROUP BY o.[object_name];

Now the plan and Table I/O tab look a lot better, at least on first glance:

Plan for STRING_AGG(), no ordering, with FORCESCAN
Table I/O for STRING_AGG(), no ordering, with FORCESCANPlan and Table I/O for STRING_AGG(), no ordering, with FORCESCAN

The ordered versions of the queries generate roughly the same plans. For the FOR XML PATH version, a sort is added:

Added sort in FOR XML PATH versionAdded sort in FOR XML PATH version

For STRING_AGG(), a scan is chosen in this case, even without the FORCESCAN hint, and no additional sort operation is required – so the plan looks identical to the FORCESCAN version.

At Scale

Looking at a plan and one-off runtime metrics might give us some idea about whether STRING_AGG() performs better than the existing FOR XML PATH solution, but a larger test might make more sense. What happens when we perform the grouped concatenation 5,000 times?

SELECT SYSDATETIME();
GO

DECLARE @x nvarchar(max);
SELECT @x = STRING_AGG(c.column_name, N',')
  FROM dbo.Objects AS o
  INNER JOIN dbo.Columns AS c
  ON o.[object_id] = c.[object_id]
  GROUP BY o.[object_name];
GO 5000
SELECT [string_agg, unordered] = SYSDATETIME();
GO

DECLARE @x nvarchar(max);
SELECT @x = STRING_AGG(c.column_name, N',')
  FROM dbo.Objects AS o
  INNER JOIN dbo.Columns AS c WITH (FORCESCAN)
  ON o.[object_id] = c.[object_id]
  GROUP BY o.[object_name];
GO 5000
SELECT [string_agg, unordered, forcescan] = SYSDATETIME();

GO
DECLARE @x nvarchar(max);
SELECT @x = STUFF((SELECT N',' +c.column_name
  FROM dbo.Columns AS c
  WHERE c.[object_id] = o.[object_id]
  FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
FROM dbo.Objects AS o;
GO 5000
SELECT [for xml path, unordered] = SYSDATETIME();

GO
DECLARE @x nvarchar(max);
SELECT @x = STRING_AGG(c.column_name, N',')
  WITHIN GROUP (ORDER BY c.column_name)
  FROM dbo.Objects AS o
  INNER JOIN dbo.Columns AS c
  ON o.[object_id] = c.[object_id]
  GROUP BY o.[object_name];
GO 5000
SELECT [string_agg, ordered] = SYSDATETIME();

GO
DECLARE @x nvarchar(max);
SELECT @x = STUFF((SELECT N',' +c.column_name
  FROM dbo.Columns AS c
  WHERE c.[object_id] = o.[object_id]
  ORDER BY c.column_name
  FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
FROM dbo.Objects AS o
ORDER BY o.[object_name];
GO 5000
SELECT [for xml path, ordered] = SYSDATETIME();

After running this script five times, I averaged the duration numbers and here are the results:

Duration (milliseconds) for various grouped concatenation approachesDuration (milliseconds) for various grouped concatenation approaches

We can see that our FORCESCAN hint really did make things worse – while we shifted the cost away from the clustered index seek, the sort was actually much worse, even though the estimated costs deemed them relatively equivalent. More importantly, we can see that STRING_AGG() does offer a performance benefit, whether or not the concatenated strings need to be ordered in a specific way. As with STRING_SPLIT(), which I looked at back in March, I am quite impressed that this function scales well prior to "v1."

I have further tests planned, perhaps for a future post:

  • When all the data comes from a single table, with and without an index that supports ordering
  • Similar performance tests on Linux

In the meantime, if you have specific use cases for grouped concatenation, please share them below (or e-mail me at abertrand@sentryone.com). I'm always open to making sure my tests are as real-world as possible.