Aaron Bertrand

Grouped Concatenation in SQL Server

August 6, 2014 by in T-SQL Queries | 22 Comments
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

Grouped concatenation is a common problem in SQL Server, with no direct and intentional features to support it (like XMLAGG in Oracle, STRING_AGG or ARRAY_TO_STRING(ARRAY_AGG()) in PostgreSQL, and GROUP_CONCAT in MySQL). It has been requested, but no success yet, as evidenced in these Connect items:

** UPDATE January 2017 **: STRING_AGG() will be in SQL Server 2017; read about it here, here, and here.

What is Grouped Concatenation?

For the uninitiated, grouped concatenation is when you want to take multiple rows of data and compress them into a single string (usually with delimiters like commas, tabs, or spaces). Some might call this a "horizontal join." A quick visual example demonstrating how we would compress a list of pets belonging to each family member, from the normalized source to the "flattened" output:

Grouped Concatenation simple example

There have been many ways to solve this problem over the years; here are just a few, based on the following sample data:

    CREATE TABLE dbo.FamilyMemberPets
    (
      Name SYSNAME,
      Pet SYSNAME,
      PRIMARY KEY(Name,Pet)
    );
    
    INSERT dbo.FamilyMemberPets(Name,Pet) VALUES
    (N'Madeline',N'Kirby'),
    (N'Madeline',N'Quigley'),
    (N'Henry',   N'Piglet'),
    (N'Lisa',    N'Snowball'),
    (N'Lisa',    N'Snowball II');

    I am not going to demonstrate an exhaustive list of every grouped concatenation approach ever conceived, as I want to focus on a few aspects of my recommended approach, but I do want to point out a few of the more common ones:

    Scalar UDF
    CREATE FUNCTION dbo.ConcatFunction
    (
      @Name SYSNAME
    )
    RETURNS NVARCHAR(MAX)
    WITH SCHEMABINDING 
    AS 
    BEGIN
      DECLARE @s NVARCHAR(MAX);
    
      SELECT @s = COALESCE(@s + N', ', N'') + Pet
        FROM dbo.FamilyMemberPets
    	WHERE Name = @Name
    	ORDER BY Pet;
    
      RETURN (@s);
    END
    GO
    
    SELECT Name, Pets = dbo.ConcatFunction(Name)
      FROM dbo.FamilyMemberPets
      GROUP BY Name
      ORDER BY Name;

    Note: there is a reason we don't do this:

    SELECT DISTINCT Name, Pets = dbo.ConcatFunction(Name)
      FROM dbo.FamilyMemberPets
      ORDER BY Name;

    With DISTINCT, the function is run for every single row, then duplicates are removed; with GROUP BY, the duplicates are removed first.

    Common Language Runtime (CLR)

    This uses the GROUP_CONCAT_S function found at http://groupconcat.codeplex.com/:

    SELECT Name, Pets = dbo.GROUP_CONCAT_S(Pet, 1)
      FROM dbo.FamilyMemberPets
      GROUP BY Name
      ORDER BY Name;
    Recursive CTE

    There are several variations on this recursion; this one pulls out a set of distinct names as the anchor:

    ;WITH x as 
    (
      SELECT Name, Pet = CONVERT(NVARCHAR(MAX), Pet),
        r1 = ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Pet)
      FROM dbo.FamilyMemberPets
    ),
    a AS 
    (
      SELECT Name, Pet, r1 FROM x WHERE r1 = 1
    ),
    r AS
    (
      SELECT Name, Pet, r1 FROM a WHERE r1 = 1
      UNION ALL
      SELECT x.Name, r.Pet + N', ' + x.Pet, x.r1
        FROM x INNER JOIN r
    	ON r.Name = x.Name
    	AND x.r1 = r.r1 + 1
    )
    SELECT Name, Pets = MAX(Pet)
      FROM r
      GROUP BY Name 
      ORDER BY Name
      OPTION (MAXRECURSION 0);
    Cursor

    Not much to say here; cursors are usually not the optimal approach, but this may be your only choice if you are stuck on SQL Server 2000:

    DECLARE @t TABLE(Name SYSNAME, Pets NVARCHAR(MAX),
      PRIMARY KEY (Name));
    
    INSERT @t(Name, Pets)
      SELECT Name, N'' 
      FROM dbo.FamilyMemberPets GROUP BY Name;
    
    DECLARE @name SYSNAME, @pet SYSNAME, @pets NVARCHAR(MAX);
    
    DECLARE c CURSOR LOCAL FAST_FORWARD
      FOR SELECT Name, Pet 
      FROM dbo.FamilyMemberPets
      ORDER BY Name, Pet;
    
    OPEN c;
    
    FETCH c INTO @name, @pet;
    
    WHILE @@FETCH_STATUS = 0
    BEGIN
      UPDATE @t SET Pets += N', ' + @pet
        WHERE Name = @name;
    
      FETCH c INTO @name, @pet;
    END
    
    CLOSE c; DEALLOCATE c;
    
    SELECT Name, Pets = STUFF(Pets, 1, 1, N'') 
      FROM @t
      ORDER BY Name;
    GO
    Quirky Update

    Some people *love* this approach; I don't comprehend the attraction at all.

    DECLARE @Name SYSNAME, @Pets NVARCHAR(MAX);
    
    DECLARE @t TABLE(Name SYSNAME, Pet SYSNAME, Pets NVARCHAR(MAX),
      PRIMARY KEY (Name, Pet));
    
    INSERT @t(Name, Pet)
      SELECT Name, Pet FROM dbo.FamilyMemberPets
      ORDER BY Name, Pet;
    
    UPDATE @t SET @Pets = Pets = COALESCE(
        CASE COALESCE(@Name, N'') 
          WHEN Name THEN @Pets + N', ' + Pet
          ELSE Pet END, N''), 
    	@Name = Name;
    
    SELECT Name, Pets = MAX(Pets)
      FROM @t
      GROUP BY Name
      ORDER BY Name;
    FOR XML PATH

    Quite easily my preferred method, at least in part because it is the only way to *guarantee* order without using a cursor or CLR. That said, this is a very raw version that fails to address a couple of other inherent problems I will discuss further on:

    SELECT Name, Pets = STUFF((SELECT N', ' + Pet 
      FROM dbo.FamilyMemberPets AS p2
       WHERE p2.name = p.name 
       ORDER BY Pet
       FOR XML PATH(N'')), 1, 2, N'')
    FROM dbo.FamilyMemberPets AS p
    GROUP BY Name
    ORDER BY Name;

I've seen a lot of people mistakenly assume that the new CONCAT() function introduced in SQL Server 2012 was the answer to these feature requests. That function is only meant to operate against columns or variables in a single row; it cannot be used to concatenate values across rows.

More on FOR XML PATH

FOR XML PATH('') on its own is not good enough – it has known problems with XML entitization. For example, if you update one of the pet names to include an HTML bracket or an ampersand:

UPDATE dbo.FamilyMemberPets
  SET Pet = N'Qui>gle&y'
  WHERE Pet = N'Quigley';

These get translated to XML-safe entities somewhere along the way:

Qui>gle&y

So I always use PATH, TYPE).value(), as follows:

SELECT Name, Pets = STUFF((SELECT N', ' + Pet 
  FROM dbo.FamilyMemberPets AS p2
   WHERE p2.name = p.name 
   ORDER BY Pet
   FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.FamilyMemberPets AS p
GROUP BY Name
ORDER BY Name;

I also always use NVARCHAR, because you never know when some underlying column will contain Unicode (or later be changed to do so).

You may see the following varieties inside .value(), or even others:

... TYPE).value(N'.', ...
... TYPE).value(N'(./text())[1]', ...

These are interchangeable, all ultimately representing the same string; the performance differences between them (more below) were negligible and possibly completely nondeterministic.

Another issue you may come across is certain ASCII characters that are not possible to represent in XML; for example, if the string contains the character 0x001A (CHAR(26)), you will get this error message:

Msg 6841, Level 16, State 1, Line 51
FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001A) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

This seems pretty complicated to me, but hopefully you don't have to worry about it because you're not storing data like this or at least you're not trying to use it in grouped concatenation. If you are, you may have to fall back to one of the other approaches.

Performance

The above sample data makes it easy to prove that these methods all do what we expect, but it is hard to compare them meaningfully. So I populated the table with a much larger set:

TRUNCATE TABLE dbo.FamilyMemberPets;

INSERT dbo.FamilyMemberPets(Name,Pet)
  SELECT o.name, c.name
  FROM sys.all_objects AS o
  INNER JOIN sys.all_columns AS c
  ON o.[object_id] = c.[object_id]
  ORDER BY o.name, c.name;

For me, this was 575 objects, with 7,080 total rows; the widest object had 142 columns. Now again, admittedly, I did not set out to compare every single approach conceived in the history of SQL Server; just the few highlights I posted above. Here were the results:

Top 5 Grouped Concatenation Performers

You may notice a couple of contenders missing; the UDF using DISTINCT and the recursive CTE were so off the charts that they would skew the scale. Here are the results of all seven approaches in tabular form:

Approach Duration
(milliseconds)
FOR XML PATH 108.58
CLR 80.67
Quirky Update 278.83
UDF (GROUP BY) 452.67
UDF (DISTINCT) 5,893.67
Cursor 2,210.83
Recursive CTE 70,240.58

Average duration, in milliseconds, for all approaches

Also note that the variations on FOR XML PATH were tested independently but showed very minor differences so I just combined them for the average. If you really want to know, the .[1] notation worked out fastest in my tests; YMMV.

Conclusion

If you are not in a shop where CLR is a roadblock in any way, and especially if you're not just dealing with simple names or other strings, you should definitely consider the CodePlex project. Don't try and re-invent the wheel, don't try unintuitive tricks and hacks to make CROSS APPLY or other constructs work just a little faster than the non-CLR approaches above. Just take what works and plug it in. And heck, since you get the source code too, you can improve upon it or extend it if you like.

If CLR is an issue, then FOR XML PATH is likely your best option, but you'll still need to watch out for tricky characters. If you are stuck on SQL Server 2000, your only feasible option is the UDF (or similar code not wrapped in a UDF).

Next Time

A couple of things I want to explore in a follow-on post: removing duplicates from the list, ordering the list by something other than the value itself, cases where putting any of these approaches into a UDF can be painful, and practical use cases for this functionality.