Aaron Bertrand

Performance Surprises and Assumptions : Arbitrary TOP 1

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Itzik is a T-SQL trainer, a co-founder of SolidQ, and blogs about T-SQL fundamentals and query tuning.

Itzik’s Posts

In a recent thread on StackExchange, a user had the following issue:

I want a query that returns the first person in the table with a GroupID = 2. If nobody with a GroupID = 2 exists, I want the first person with a RoleID = 2.

Let's discard, for now, the fact that "first" is terribly defined. In actuality, the user didn't care which person they got, whether it came randomly, arbitrarily, or through some explicit logic in addition to their main criteria. Ignoring that, let's say you have a basic table:

CREATE TABLE dbo.Users
(
  UserID  INT PRIMARY KEY,
  GroupID INT,
  RoleID  INT
);

In the real world there are probably other columns, additional constraints, maybe foreign keys to other tables, and certainly other indexes. But let's keep this simple, and come up with a query.

Likely Solutions

With that table design, solving the problem seems straightforward, right? The first attempt you would probably make is:

SELECT TOP (1) UserID, GroupID, RoleID
  FROM dbo.Users
  WHERE GroupID = 2 OR RoleID = 2
  ORDER BY CASE GroupID WHEN 2 THEN 1 ELSE 2 END;

This uses TOP and a conditional ORDER BY to treat those users with a GroupID = 2 as higher priority. The plan for this query is pretty simple, with most of the cost happening in a sort operation. Here are runtime metrics against an empty table:

TOP1_A

This looks to be about as good as you can do – a simple plan that only scans the table once, and other than a pesky sort that you should be able to live with, no problem, right?

Well, another answer in the thread offered up this more complex variation:

SELECT TOP (1) UserID, GroupID, RoleID FROM 
(
  SELECT TOP (1) UserID, GroupID, RoleID, o = 1
  FROM dbo.Users
  WHERE GroupId = 2 

  UNION ALL

  SELECT TOP (1) UserID, GroupID, RoleID, o = 2
  FROM dbo.Users
  WHERE RoleID = 2
) 
AS x ORDER BY o;

On first glance, you would probably think that this query is extremely less efficient, as it requires two clustered index scans. You'd definitely be right about that; here is the plan and runtime metrics against an empty table:

TOP1_B

But now, let's add data

In order to test these queries, I wanted to use some realistic data. So first I populated 1,000 rows from sys.all_objects, with modulo operations against the object_id to get some decent distribution:

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (1000) ABS([object_id]), ABS([object_id]) % 7, ABS([object_id]) % 4
FROM sys.all_objects
ORDER BY [object_id]; 

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 126
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 248
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 26 overlap

Now when I run the two queries, here are the runtime metrics:

TOP1_C

The UNION ALL version comes in with slightly less I/O (4 reads vs. 5), lower duration, and lower estimated overall cost, while the conditional ORDER BY version has lower estimated CPU cost. The data here is pretty small to make any conclusions about; I just wanted it as a stake in the ground. Now, let's change the distribution so that most rows meet at least one of the criteria (and sometimes both):

DROP TABLE dbo.Users;
GO

CREATE TABLE dbo.Users
(
  UserID INT PRIMARY KEY,
  GroupID INT,
  RoleID INT
);
GO

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (1000) ABS([object_id]), ABS([object_id]) % 2 + 1, 
  SUBSTRING(RTRIM([object_id]),7,1) % 2 + 1
FROM sys.all_objects
WHERE ABS([object_id]) > 9999999
ORDER BY [object_id]; 

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 500
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 475
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 221 overlap

This time, the conditional order by has the highest estimated costs in both CPU and I/O:

TOP1_D

But again, at this data size, there is relatively inconsequential impact to duration and reads, and aside from the estimated costs (which are largely made up anyway), it is hard to declare a winner here.

So, let's add a lot more data

While I rather enjoy building sample data from the catalog views, since everybody has those, this time I'm going to draw on the table Sales.SalesOrderHeaderEnlarged from AdventureWorks2012, expanded using this script from Jonathan Kehayias. On my system, this table has 1,258,600 rows. The following script will insert a million of those rows into our dbo.Users table:

-- DROP and CREATE, as before

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (1000000) SalesOrderID, SalesOrderID % 7, SalesOrderID % 4
FROM Sales.SalesOrderHeaderEnlarged;

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 142,857
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 250,000
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 35,714 overlap

Okay, now when we run the queries, we see a problem: the ORDER BY variation has gone parallel and has obliterated both reads and CPU, yielding a nearly 120X difference in duration:

TOP1_E

Eliminating parallelism (using MAXDOP) did not help:

TOP1_F

(The UNION ALL plan still looks the same.)

And if we change the skew to be even, where 95% of the rows meet at least one criteria:

-- DROP and CREATE, as before

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (475000) SalesOrderID, 2, SalesOrderID % 7
FROM Sales.SalesOrderHeaderEnlarged
WHERE SalesOrderID % 2 = 1
UNION ALL
SELECT TOP (475000) SalesOrderID, SalesOrderID % 7, 2
FROM Sales.SalesOrderHeaderEnlarged
WHERE SalesOrderID % 2 = 0;

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (50000) SalesOrderID, 1, 1
FROM Sales.SalesOrderHeaderEnlarged AS h
WHERE NOT EXISTS (SELECT 1 FROM dbo.Users
  WHERE UserID = h.SalesOrderID);

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 542,851
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 542,851
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 135,702 overlap

The queries still show that the sort is prohibitively expensive:

TOP1_G

And with MAXDOP = 1 it was much worse (just look at duration):

TOP1_H

Finally, how about 95% skew in either direction (e.g. most rows satisfy the GroupID criteria, or most rows satisfy the RoleID criteria)? This script will ensure at least 95% of the data has GroupID = 2:

-- DROP and CREATE, as before

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (950000) SalesOrderID, 2, SalesOrderID % 7
FROM Sales.SalesOrderHeaderEnlarged;

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (50000) SalesOrderID, SalesOrderID % 7, 2
FROM Sales.SalesOrderHeaderEnlarged AS h
WHERE NOT EXISTS (SELECT 1 FROM dbo.Users
  WHERE UserID = h.SalesOrderID);

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 957,143
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 185,714
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 142,857 overlap

Results are quite similar (I'm just going to stop trying the MAXDOP thing from now on):

TOP1_I

And then if we skew the other way, where at least 95% of the data has RoleID = 2:

-- DROP and CREATE, as before

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (950000) SalesOrderID, 2, SalesOrderID % 7
FROM Sales.SalesOrderHeaderEnlarged;

INSERT dbo.Users(UserID, GroupID, RoleID)
SELECT TOP (50000) SalesOrderID, SalesOrderID % 7, 2
FROM Sales.SalesOrderHeaderEnlarged AS h
WHERE NOT EXISTS (SELECT 1 FROM dbo.Users
  WHERE UserID = h.SalesOrderID);

SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2; -- 185,714
SELECT COUNT(*) FROM dbo.Users WHERE RoleID = 2;  -- 957,143
SELECT COUNT(*) FROM dbo.Users WHERE GroupID = 2 AND RoleID = 2; -- 142,857 overlap

Results:

TOP1_J

Conclusion

In not one single case that I could manufacture did the "simpler" ORDER BY query – even with one less clustered index scan – outperform the more complex UNION ALL query. Sometimes you have to be very wary about what SQL Server has to do when you introduce operations like sorts into your query semantics, and not rely on simplicity of the plan alone (never mind any bias you might have based on previous scenarios).

Your first instinct might often be correct, but I bet there are times when there is a better option that looks, on the surface, like it couldn't possibly work out better. As in this example. I'm getting quite a bit better about questioning assumptions I've made from observations, and not making blanket statements like "scans never perform well" and "simpler queries always run faster." If you eliminate the words never and always from your vocabulary, you may find yourself putting more of those assumptions and blanket statements to the test, and ending up much better off.