SQL Server Grouped Concatenation: Ordering & Duplicates

In my last post, I showed some efficient approaches to grouped concatenation. This time around, I wanted to talk about a couple of additional facets of this problem that we can accomplish easily with the FOR XML PATH approach: ordering the list, and removing duplicates.

There are a few ways that I have seen people want the comma-separated list to be ordered. Sometimes they want the item in the list to be ordered alphabetically; I showed that already in my previous post. But sometimes they want it sorted by some other attribute that's actually not being introduced in the output; for example, maybe I want to order the list by most recent item first. Let's take a simple example, where we have an Employees table and a CoffeeOrders table. Let's just populate one person's orders for a few days:

CREATE TABLE dbo.Employees
(
  EmployeeID INT PRIMARY KEY,
  Name NVARCHAR(128)
);

INSERT dbo.Employees(EmployeeID, Name) VALUES(1, N'Jack');

CREATE TABLE dbo.CoffeeOrders
(
  EmployeeID INT NOT NULL REFERENCES dbo.Employees(EmployeeID),
  OrderDate DATE NOT NULL,
  OrderDetails NVARCHAR(64)
);

INSERT dbo.CoffeeOrders(EmployeeID, OrderDate, OrderDetails)
  VALUES(1,'20140801',N'Large double double'),
        (1,'20140802',N'Medium double double'),
        (1,'20140803',N'Large Vanilla Latte'),
        (1,'20140804',N'Medium double double');

If we use the existing approach without specifying an ORDER BY, we get an arbitrary ordering (in this case, it is most likely the case that you will see the rows in the order they were inserted, but don't depend on that with larger data sets, more indexes, etc.):

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Results (remember, you may get *different* results unless you specify an ORDER BY):

Name | Orders
Jack | Large double double, Medium double double, Large Vanilla Latte, Medium double double

If we want to order the list alphabetically, it's simple; we just add ORDER BY c.OrderDetails:

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  ORDER BY c.OrderDetails  -- only change
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Results:

Name | Orders
Jack | Large double double, Large Vanilla Latte, Medium double double, Medium double double

We can also order by a column that does not appear in the result set; for example, we can order by most recent coffee order first:

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  ORDER BY c.OrderDate DESC  -- only change
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Results:

Name | Orders
Jack | Medium double double, Large Vanilla Latte, Medium double double, Large double double

Another thing we often want to do is remove duplicates; after all, there is little reason to see "Medium double double" twice. We can eliminate that by using GROUP BY:

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  GROUP BY c.OrderDetails  -- removed ORDER BY and added GROUP BY here
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Now, this *happens* to order the output alphabetically, but again you can't rely on this:

Name | Orders
Jack | Large double double, Large Vanilla Latte, Medium double double

If you want to guarantee that ordering this way, you can simply add an ORDER BY again:

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  GROUP BY c.OrderDetails
  ORDER BY c.OrderDetails  -- added ORDER BY
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Results are the same (but I'll repeat, this is just a coincidence in this case; if you want this order, always say so):

Name | Orders
Jack | Large double double, Large Vanilla Latte, Medium double double

But what if we want to eliminate duplicates *and* sort the list by most recent coffee order first? Your first inclination might be to keep the GROUP BY and just change the ORDER BY, like this:

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
  FROM dbo.CoffeeOrders AS c
  WHERE c.EmployeeID = e.EmployeeID
  GROUP BY c.OrderDetails
  ORDER BY c.OrderDate DESC  -- changed ORDER BY
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

That won't work, since the OrderDate is not grouped or aggregated as part of the query:

Msg 8127, Level 16, State 1, Line 64
Column "dbo.CoffeeOrders.OrderDate" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause.

A workaround, which admittedly makes the query a little uglier, is to group the orders separately first, and then only take the rows with the max date for that coffee order per employee:

;WITH grouped AS
(
  SELECT EmployeeID, OrderDetails, OrderDate = MAX(OrderDate)
   FROM dbo.CoffeeOrders
   GROUP BY EmployeeID, OrderDetails
)
SELECT e.Name, Orders = STUFF((SELECT N', ' + g.OrderDetails
  FROM grouped AS g
  WHERE g.EmployeeID = e.EmployeeID
  ORDER BY g.OrderDate DESC
  FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name;

Results:

Name | Orders
Jack | Medium double double, Large Vanilla Latte, Large double double

This accomplishes both of our goals: we've eliminated duplicates, and we've ordered the list by something that's not actually in the list.

Performance

You might be wondering how badly these methods perform against a more robust data set. I'm going to populate our table with 100,000 rows, see how they do without any additional indexes, and then run the same queries again with a little bit of index tuning to support our queries. So first, getting 100,000 rows spread across 1,000 employees:

-- clear out our tiny sample data
DELETE dbo.CoffeeOrders;
DELETE dbo.Employees;

-- create 1000 fake employees
INSERT dbo.Employees(EmployeeID, Name) 
SELECT TOP (1000) 
  EmployeeID = ROW_NUMBER() OVER (ORDER BY t.[object_id]),
  Name = LEFT(t.name + c.name, 128)
FROM sys.all_objects AS t
INNER JOIN sys.all_columns AS c
ON t.[object_id] = c.[object_id];

-- create 100 fake coffee orders for each employee
-- we may get duplicates in here for name
INSERT dbo.CoffeeOrders(EmployeeID, OrderDate, OrderDetails)
SELECT e.EmployeeID, 
  OrderDate = DATEADD(DAY, ROW_NUMBER() OVER 
    (PARTITION BY e.EmployeeID ORDER BY c.[guid]), '20140630'),
  LEFT(c.name, 64)
 FROM dbo.Employees AS e
 CROSS APPLY 
 (
   SELECT TOP (100) name, [guid] = NEWID() 
     FROM sys.all_columns 
     WHERE [object_id] < e.EmployeeID
     ORDER BY NEWID()
 ) AS c;

Now let's just run each of our queries twice, and see what the timing is like on the second try (we'll take a leap of faith here, and assume that - in an ideal world - we'll be working with a primed cache). I ran these in SQL Sentry Plan Explorer, since it's the easiest way I know of to time and compare a bunch of individual queries:

Duration and other runtime metrics for different FOR XML PATH approaches

These timings (duration is in milliseconds) really aren't that bad at all IMHO, when you think about what's actually being done here. The most complicated plan, at least visually, seemed to be the one where we removed duplicates and sorted by most recent order:

Execution plan for grouped and sorted query

But even the most expensive operator here - the XML table-valued function - seems to be all CPU (even though I will freely admit that I'm not sure how much of the actual work is exposed in the query plan details):

Operator properties for the XML table-valued function

"All CPU" is typically okay, since most systems are I/O-bound and/or memory-bound, not CPU-bound. As I say quite often, in most systems I'll trade some of my CPU headroom for memory or disk any day of the week (one of the reasons I like OPTION (RECOMPILE) as a solution to pervasive parameter sniffing issues).

That said, I do strongly encourage you to test these approaches against similar results you can get from the GROUP_CONCAT CLR approach on CodePlex, as well as performing the aggregation and sorting at the presentation tier (particularly if you are keeping the normalized data in some sort of caching layer).

10 thoughts on “Grouped Concatenation : Ordering and Removing Duplicates”

Sviridov Konstantin says:

August 21, 2014 at 8:21 AM

You could use aggregate in order by clause.

SELECT e.Name, Orders = STUFF((SELECT N', ' + c.OrderDetails
FROM dbo.CoffeeOrders AS c
WHERE c.EmployeeID = e.EmployeeID
GROUP BY c.OrderDetails
ORDER BY Max(c.OrderDate) DESC — changed ORDER BY
FOR XML PATH, TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N")
FROM dbo.Employees AS e
GROUP BY e.EmployeeID, e.Name
Aaron Bertrand says:

August 21, 2014 at 1:47 PM

Hi Sviridov, thanks, that is an alternative too; while it gets the same results and generates the same plan, in my testing the duration and CPU were always a bit higher than the CTE variation I proposed. YMMV. :-)
-=Todd Fox=- says:

August 28, 2015 at 4:06 PM

I know I'm late to the party, but is there any significant performance increase using the GROUP BY function in the sub query over simply SELECT DISTINCT N', ' + c.OrderDetails ?
Please forgive any holes in my understanding here.
Aaron Bertrand says:

August 28, 2015 at 4:15 PM

There can be, yes. Primarily because GROUP BY will work first, whereas DISTINCT applies afterward. So depending on the data you could be performing the grouped concatenation many more times, and then only weeding out duplicates after all that work is done. GROUP BY *should* weed the duplicates out first, but in the worst possible case, they'll be the same.
Balaji says:

November 14, 2015 at 4:12 PM

This approach works for Text or single digit number, but it does not work with double digit duplicate numbers. Can you please help?
Aaron Bertrand says:

November 16, 2015 at 1:48 AM

Can you elaborate? Show some sample data and desired results?
Orlando Colamatteo says:

December 17, 2015 at 12:03 PM

All of the scenarios shown can be done using the GROUP_CONCAT SQLCLR objects on CodePlex *except* sorting by something other than the column being concatenated. However, that functionality is in development.

For output sorted by the grouped string you can use GROUP_CONCAT_S:

SELECT e.Name,
Orders = dbo.GROUP_CONCAT_S(c.OrderDetails, 1) — sorted list
FROM dbo.Employees AS e
JOIN dbo.CoffeeOrders AS c ON e.EmployeeID = c.EmployeeID
GROUP BY e.Name
ORDER BY e.Name;

For unique values you can declare DISTINCT on the column being passed into the Aggregate, like this:

SELECT e.Name,
Orders = dbo.GROUP_CONCAT_S(DISTINCT c.OrderDetails, 1) — DISTINCT, sorted list
FROM dbo.Employees AS e
JOIN dbo.CoffeeOrders AS c ON e.EmployeeID = c.EmployeeID
GROUP BY e.Name
ORDER BY e.Name;

Full disclosure, I developed the GROUP_CONCAT objects on CodePlex.
G says:

February 6, 2017 at 10:49 AM

too good
Kevin says:

May 18, 2017 at 2:39 PM

Great article. Easy to follow and very thorough! Thanks this helped me out in my job today.
Pingback: SQL Server 2017 STRING_AGG - SQL Masters Consulting

Comments are closed.