Even with its warts and limitations, Transact-SQL is a beautiful language, allowing for flexible declarative expression about what you’re asking the database engine to do for you.
Itzik Ben-Gan has shown time after time that you can get the same results in many different ways—look what he demonstrated recently with window ordering. I have also discussed this when dealing with anti-semi-joins, which can be solved using APPLY
or EXCEPT
or NOT EXISTS
or LEFT JOIN
. In some of those cases, different queries led to different performance because query semantics changed (either obviously or, at least, according to the engine). But writing a query differently—even one more verbose—can have any of the following impacts, some subjective, some not:
- More or less intuitive code
- Zero or negligible performance differences
- Meaningful performance differences
- Trade one resource for another (e.g., use more memory but less CPU, or more CPU but less I/O)
And in each of those cases, you can decide which version of the query is most beneficial for you based on your priorities. Maybe you have a memory- or CPU-bound system, or maybe you prefer one syntax over another for subjective reasons, or maybe you think one form will be easier for future maintainers and newcomers.
As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.
A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.
For some context, Stack Overflow is a place where I tend to cater to people with a wide variety of experience with T-SQL or queries in general. Sometimes solutions are necessarily complex, or they need less-used syntax that is not universally understood, so it can take more explanation and a better breakdown of the code for the user to benefit. One of the ways I like to help with this breakdown is to isolate different aspects of the query in a derived table or, more commonly, a common table expression (CTE) because it can be a lightbulb moment to think about that part independently. With that in mind, let’s look at a boiled-down version of the question:
- Given this table dbo.tablename, I want to return a single row for each name and division combination, but only where there is both a row with source = 'comp' and a row where source = 'manual':
name | division | source |
---|---|---|
host1 | abc | comp |
host2 | xy | manual |
host3 | zyx | comp |
host3 | zyx | manual |
host2 | xy | manual |
I’ve highlighted the only rows they want to consider for aggregation, with the desired output being a single row:
name | division |
---|---|
host3 | zyx |
In T-SQL, setting up this sample would look like this:
CREATE TABLE dbo.tablename
(
name varchar(128),
division varchar(128),
source varchar(128)
);
INSERT dbo.tablename(name, division, source) VALUES
('host1', 'abc', 'comp'),
('host2', 'xy', 'manual'),
('host3', 'zyx', 'comp'),
('host3', 'zyx', 'manual'),
('host2', 'xy', 'manual');
To get the desired result, the first (and later accepted) answer used this syntax, which is perfectly adequate:
SELECT name, division
FROM dbo.tablename
WHERE source in ('comp', 'manual')
GROUP BY name, division
HAVING COUNT(DISTINCT source) > 1; -- or = 2
Since I know users commonly have difficulty with the HAVING
clause, I offered a different approach, one that breaks the logic down, as I mentioned earlier.
Another way to think about it is to calculate the counts inside a CTE and then filter:
;WITH cte AS
(
SELECT name, division, SourceCount = COUNT(DISTINCT source)
FROM dbo.tablename
WHERE source IN ('comp', 'manual')
GROUP BY name, division
)
SELECT name, division FROM cte
WHERE SourceCount = 2;
And yes, my CTEs always start with ;WITH
– see why
Or, if you don’t like CTEs:
SELECT name, division FROM
(
SELECT name, division, SourceCount = COUNT(DISTINCT source)
FROM dbo.tablename
WHERE source IN ('comp', 'manual')
GROUP BY name, division
) AS q WHERE SourceCount = 2;
Yes, it’s more typing, but the intention is to think about the counting and grouping separate from the filtering, like how in an INNER JOIN
you can logically think about the joining conditions (in the ON
clause) separate from the filter conditions (in the WHERE
clause).
As for performance, they all perform the same because SQL Server is smart and can generate the same plan. I inserted 50,000 rows into the table and ran all three queries; each had a duration of 30 – 33ms, a memory grant of 1,584 KB, and an estimated subtree cost of 0.5972860. Here is the plan shape in all three cases:
The plan would look different if the table had a clustered index; or let’s try an index designed to support this query specifically:
CREATE INDEX testing ON dbo.tablename (source) INCLUDE (name, division);
Now the time is down to 26 – 28ms, the memory grant is still 1.5MB, and the estimated subtree cost has dropped by a whopping amount, to 0.5769890. Here is the new plan (again, identical for all three queries):
This is not a complex example but illustrates that we can often find various ways to get to a final destination using the most expressive format we like. Variations in syntax that are identical in results and underlying meaning can help give someone that “lightbulb” moment and provide a more natural tendency to test “identical” variations for cases where the performance might be different.