Aaron Bertrand

T-SQL Can Be Expressive Without Sacrificing Performance

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul White is an independent SQL Server consultant specializing in performance tuning, execution plans, and the query optimizer.

Paul’s Posts

Even with its warts and limitations, Transact-SQL is a beautiful language, allowing for flexible declarative expression about what you’re asking the database engine to do for you.

Itzik Ben-Gan has shown time after time that you can get the same results in many different ways—look what he demonstrated recently with window ordering. I have also discussed this when dealing with anti-semi-joins, which can be solved using APPLY or EXCEPT or NOT EXISTS or LEFT JOIN. In some of those cases, different queries led to different performance because query semantics changed (either obviously or, at least, according to the engine). But writing a query differently—even one more verbose—can have any of the following impacts, some subjective, some not:

  • More or less intuitive code
  • Zero or negligible performance differences
  • Meaningful performance differences
  • Trade one resource for another (e.g., use more memory but less CPU, or more CPU but less I/O)

And in each of those cases, you can decide which version of the query is most beneficial for you based on your priorities. Maybe you have a memory- or CPU-bound system, or maybe you prefer one syntax over another for subjective reasons, or maybe you think one form will be easier for future maintainers and newcomers.

As an analogy, there are many routes you can take from New York City to Dallas. Some may be faster than others, some may be fewer miles but take longer, some are more fuel-efficient due to average speed limits, some more scenic, and some more toll-friendly. The beauty is that if you and I are independently planning the same trip, we can choose our routes based on our individual priorities. I may not like interstates, or I may prefer to drive more westerly until the sun starts setting, and you may want to see a particular tourist attraction, visit an uncle, or stop in a certain city.

A query is similar. Usually, performance is of utmost importance, but even that isn’t always true. When two or more queries give the same answer and have identical (or “close enough”) performance, the choice can come down to other factors, as mentioned above. I recently answered a question on Stack Overflow where the user was asking how to filter a grouping where an aggregate condition was true.

For some context, Stack Overflow is a place where I tend to cater to people with a wide variety of experience with T-SQL or queries in general. Sometimes solutions are necessarily complex, or they need less-used syntax that is not universally understood, so it can take more explanation and a better breakdown of the code for the user to benefit. One of the ways I like to help with this breakdown is to isolate different aspects of the query in a derived table or, more commonly, a common table expression (CTE) because it can be a lightbulb moment to think about that part independently. With that in mind, let’s look at a boiled-down version of the question:

    Given this table dbo.tablename, I want to return a single row for each name and division combination, but only where there is both a row with source = 'comp' and a row where source = 'manual':

    name division source
    host1 abc comp
    host2 xy manual
    host3 zyx comp
    host3 zyx manual
    host2 xy manual

    I’ve highlighted the only rows they want to consider for aggregation, with the desired output being a single row:

    name division
    host3 zyx

In T-SQL, setting up this sample would look like this:

  CREATE TABLE dbo.tablename
  (
    name     varchar(128),
    division varchar(128),
    source   varchar(128)
  );

  INSERT dbo.tablename(name, division, source) VALUES
  ('host1',  'abc',  'comp'),
  ('host2',  'xy',   'manual'),
  ('host3',  'zyx',  'comp'),
  ('host3',  'zyx',  'manual'),
  ('host2',  'xy',   'manual');

To get the desired result, the first (and later accepted) answer used this syntax, which is perfectly adequate:

  SELECT name, division  
  FROM dbo.tablename
  WHERE source in ('comp', 'manual')
  GROUP BY name, division
  HAVING COUNT(DISTINCT source) > 1; -- or = 2

Since I know users commonly have difficulty with the HAVING clause, I offered a different approach, one that breaks the logic down, as I mentioned earlier.

Another way to think about it is to calculate the counts inside a CTE and then filter:

  ;WITH cte AS
  (
    SELECT name, division, SourceCount = COUNT(DISTINCT source)
    FROM dbo.tablename
    WHERE source IN ('comp', 'manual')
    GROUP BY name, division
  )
  SELECT name, division FROM cte 
    WHERE SourceCount = 2;

And yes, my CTEs always start with ;WITHsee why

Or, if you don’t like CTEs:

  SELECT name, division FROM
  (
    SELECT name, division, SourceCount = COUNT(DISTINCT source)
    FROM dbo.tablename
    WHERE source IN ('comp', 'manual')
    GROUP BY name, division
  ) AS q WHERE SourceCount = 2;

Yes, it’s more typing, but the intention is to think about the counting and grouping separate from the filtering, like how in an INNER JOIN you can logically think about the joining conditions (in the ON clause) separate from the filter conditions (in the WHERE clause).

As for performance, they all perform the same because SQL Server is smart and can generate the same plan. I inserted 50,000 rows into the table and ran all three queries; each had a duration of 30 ­– 33ms, a memory grant of 1,584 KB, and an estimated subtree cost of 0.5972860. Here is the plan shape in all three cases:

Identical plan for three different queries

The plan would look different if the table had a clustered index; or let’s try an index designed to support this query specifically:

  CREATE INDEX testing ON dbo.tablename (source) INCLUDE (name, division);

Now the time is down to 26 – 28ms, the memory grant is still 1.5MB, and the estimated subtree cost has dropped by a whopping amount, to 0.5769890. Here is the new plan (again, identical for all three queries):

Plan for index-supported version

This is not a complex example but illustrates that we can often find various ways to get to a final destination using the most expressive format we like. Variations in syntax that are identical in results and underlying meaning can help give someone that “lightbulb” moment and provide a more natural tendency to test “identical” variations for cases where the performance might be different.