Grouped concatenation is a common problem in SQL Server, with no direct and intentional features to support it (like XMLAGG in Oracle, STRING_AGG or ARRAY_TO_STRING(ARRAY_AGG()) in PostgreSQL, and GROUP_CONCAT in MySQL). It has been requested, but no success yet, as evidenced in these Connect items:
- Connect #247118 : SQL needs version of MySQL group_Concat function (Postponed)
- Connect #728969 : Ordered Set Functions – WITHIN GROUP Clause (Closed as Won't Fix)
What is Grouped Concatenation?
For the uninitiated, grouped concatenation is when you want to take multiple rows of data and compress them into a single string (usually with delimiters like commas, tabs, or spaces). Some might call this a "horizontal join." A quick visual example demonstrating how we would compress a list of pets belonging to each family member, from the normalized source to the "flattened" output:
There have been many ways to solve this problem over the years; here are just a few, based on the following sample data:
CREATE TABLE dbo.FamilyMemberPets ( Name SYSNAME, Pet SYSNAME, PRIMARY KEY(Name,Pet) ); INSERT dbo.FamilyMemberPets(Name,Pet) VALUES (N'Madeline',N'Kirby'), (N'Madeline',N'Quigley'), (N'Henry', N'Piglet'), (N'Lisa', N'Snowball'), (N'Lisa', N'Snowball II');
I am not going to demonstrate an exhaustive list of every grouped concatenation approach ever conceived, as I want to focus on a few aspects of my recommended approach, but I do want to point out a few of the more common ones:
CREATE FUNCTION dbo.ConcatFunction ( @Name SYSNAME ) RETURNS NVARCHAR(MAX) WITH SCHEMABINDING AS BEGIN DECLARE @s NVARCHAR(MAX); SELECT @s = COALESCE(@s + N', ', N'') + Pet FROM dbo.FamilyMemberPets WHERE Name = @Name ORDER BY Pet; RETURN (@s); END GO SELECT Name, Pets = dbo.ConcatFunction(Name) FROM dbo.FamilyMemberPets GROUP BY Name ORDER BY Name;
Note: there is a reason we don't do this:
SELECT DISTINCT Name, Pets = dbo.ConcatFunction(Name) FROM dbo.FamilyMemberPets ORDER BY Name;
DISTINCT, the function is run for every single row, then duplicates are removed; with
GROUP BY, the duplicates are removed first.
Common Language Runtime (CLR)
This uses the
GROUP_CONCAT_S function found at http://groupconcat.codeplex.com/:
SELECT Name, Pets = dbo.GROUP_CONCAT_S(Pet, 1) FROM dbo.FamilyMemberPets GROUP BY Name ORDER BY Name;
There are several variations on this recursion; this one pulls out a set of distinct names as the anchor:
;WITH x as ( SELECT Name, Pet = CONVERT(NVARCHAR(MAX), Pet), r1 = ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Pet) FROM dbo.FamilyMemberPets ), a AS ( SELECT Name, Pet, r1 FROM x WHERE r1 = 1 ), r AS ( SELECT Name, Pet, r1 FROM a WHERE r1 = 1 UNION ALL SELECT x.Name, r.Pet + N', ' + x.Pet, x.r1 FROM x INNER JOIN r ON r.Name = x.Name AND x.r1 = r.r1 + 1 ) SELECT Name, Pets = MAX(Pet) FROM r GROUP BY Name ORDER BY Name OPTION (MAXRECURSION 0);
Not much to say here; cursors are usually not the optimal approach, but this may be your only choice if you are stuck on SQL Server 2000:
DECLARE @t TABLE(Name SYSNAME, Pets NVARCHAR(MAX), PRIMARY KEY (Name)); INSERT @t(Name, Pets) SELECT Name, N'' FROM dbo.FamilyMemberPets GROUP BY Name; DECLARE @name SYSNAME, @pet SYSNAME, @pets NVARCHAR(MAX); DECLARE c CURSOR LOCAL FAST_FORWARD FOR SELECT Name, Pet FROM dbo.FamilyMemberPets ORDER BY Name, Pet; OPEN c; FETCH c INTO @name, @pet; WHILE @@FETCH_STATUS = 0 BEGIN UPDATE @t SET Pets += N', ' + @pet WHERE Name = @name; FETCH c INTO @name, @pet; END CLOSE c; DEALLOCATE c; SELECT Name, Pets = STUFF(Pets, 1, 1, N'') FROM @t ORDER BY Name; GO
Some people *love* this approach; I don't comprehend the attraction at all.
DECLARE @Name SYSNAME, @Pets NVARCHAR(MAX); DECLARE @t TABLE(Name SYSNAME, Pet SYSNAME, Pets NVARCHAR(MAX), PRIMARY KEY (Name, Pet)); INSERT @t(Name, Pet) SELECT Name, Pet FROM dbo.FamilyMemberPets ORDER BY Name, Pet; UPDATE @t SET @Pets = Pets = COALESCE( CASE COALESCE(@Name, N'') WHEN Name THEN @Pets + N', ' + Pet ELSE Pet END, N''), @Name = Name; SELECT Name, Pets = MAX(Pets) FROM @t GROUP BY Name ORDER BY Name;
FOR XML PATH
Quite easily my preferred method, at least in part because it is the only way to *guarantee* order without using a cursor or CLR. That said, this is a very raw version that fails to address a couple of other inherent problems I will discuss further on:
SELECT Name, Pets = STUFF((SELECT N', ' + Pet FROM dbo.FamilyMemberPets AS p2 WHERE p2.name = p.name ORDER BY Pet FOR XML PATH(N'')), 1, 2, N'') FROM dbo.FamilyMemberPets AS p GROUP BY Name ORDER BY Name;
I've seen a lot of people mistakenly assume that the new
CONCAT() function introduced in SQL Server 2012 was the answer to these feature requests. That function is only meant to operate against columns or variables in a single row; it cannot be used to concatenate values across rows.
More on FOR XML PATH
FOR XML PATH('') on its own is not good enough – it has known problems with XML entitization. For example, if you update one of the pet names to include an HTML bracket or an ampersand:
UPDATE dbo.FamilyMemberPets SET Pet = N'Qui>gle&y' WHERE Pet = N'Quigley';
These get translated to XML-safe entities somewhere along the way:
So I always use
PATH, TYPE).value(), as follows:
SELECT Name, Pets = STUFF((SELECT N', ' + Pet FROM dbo.FamilyMemberPets AS p2 WHERE p2.name = p.name ORDER BY Pet FOR XML PATH(N''), TYPE).value(N'.', N'nvarchar(max)'), 1, 2, N'') FROM dbo.FamilyMemberPets AS p GROUP BY Name ORDER BY Name;
I also always use
NVARCHAR, because you never know when some underlying column will contain Unicode (or later be changed to do so).
You may see the following varieties inside
.value(), or even others:
... TYPE).value(N'.', ... ... TYPE).value(N'(./text())', ...
These are interchangeable, all ultimately representing the same string; the performance differences between them (more below) were negligible and possibly completely nondeterministic.
Another issue you may come across is certain ASCII characters that are not possible to represent in XML; for example, if the string contains the character
CHAR(26)), you will get this error message:
FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001A) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.
This seems pretty complicated to me, but hopefully you don't have to worry about it because you're not storing data like this or at least you're not trying to use it in grouped concatenation. If you are, you may have to fall back to one of the other approaches.
The above sample data makes it easy to prove that these methods all do what we expect, but it is hard to compare them meaningfully. So I populated the table with a much larger set:
TRUNCATE TABLE dbo.FamilyMemberPets; INSERT dbo.FamilyMemberPets(Name,Pet) SELECT o.name, c.name FROM sys.all_objects AS o INNER JOIN sys.all_columns AS c ON o.[object_id] = c.[object_id] ORDER BY o.name, c.name;
For me, this was 575 objects, with 7,080 total rows; the widest object had 142 columns. Now again, admittedly, I did not set out to compare every single approach conceived in the history of SQL Server; just the few highlights I posted above. Here were the results:
You may notice a couple of contenders missing; the UDF using
DISTINCT and the recursive CTE were so off the charts that they would skew the scale. Here are the results of all seven approaches in tabular form:
|FOR XML PATH||108.58|
|UDF (GROUP BY)||452.67|
Average duration, in milliseconds, for all approaches
Also note that the variations on
FOR XML PATH were tested independently but showed very minor differences so I just combined them for the average. If you really want to know, the
. notation worked out fastest in my tests; YMMV.
If you are not in a shop where CLR is a roadblock in any way, and especially if you're not just dealing with simple names or other strings, you should definitely consider the CodePlex project. Don't try and re-invent the wheel, don't try unintuitive tricks and hacks to make
CROSS APPLY or other constructs work just a little faster than the non-CLR approaches above. Just take what works and plug it in. And heck, since you get the source code too, you can improve upon it or extend it if you like.
If CLR is an issue, then
FOR XML PATH is likely your best option, but you'll still need to watch out for tricky characters. If you are stuck on SQL Server 2000, your only feasible option is the UDF (or similar code not wrapped in a UDF).
A couple of things I want to explore in a follow-on post: removing duplicates from the list, ordering the list by something other than the value itself, cases where putting any of these approaches into a UDF can be painful, and practical use cases for this functionality.