Paul White

Row Goals, Part 3: Anti Joins

Pragmatic Works Software is now part of SentryOne
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Register to Download

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

This post is part of a series of articles about row goals. You can find the other parts here:

This part covers when and why the optimizer introduces a row goal for an anti join.

Introduction

An anti join is also known as an anti semi join. It returns each row from join input A for which no match can be found on input B.

For an anti join:

  • The optimizer may add an inner-side row goal to an apply (correlated nested loops join) anti join only.
  • A row goal is not added for non-correlated nested loops anti join, hash anti join, or merge anti join.
  • As always, any row goal is only added if it is lower than the estimate without a row goal applied.
  • Redundant inner-side TOP clauses and DISTINCT/GROUP BY operations can be simplified away.

Expanding on the first bullet above, the main difference between apply semi join and apply anti join row goals is:

  • An apply semi join always features a row goal (so long as it is less than the estimate without the goal).
  • An apply anti join may feature a row goal, but only if a logical anti join is transformed into an apply during cost-based optimization.

I do apologise that these rules aren't simpler, but I didn't make them. Hopefully some discussion and examples will make it all clearer.

No Anti Join Row Goal by Default

The optimizer assumes that people write a semi join (indirectly e.g. using EXISTS) with the expectation that the row being searched for will be found. An apply semi join row goal is set by the optimizer to help find that expected matching row quickly.

For anti join (expressed e.g. using NOT EXISTS) the optimizer's assumption is that a matching row will not be found. An apply anti join row goal is not set by the optimizer, because it expects to have to check all rows to confirm there is no match.

If there does turn out to be a matching row, the apply anti join might take longer to locate this row than it would if a row goal had been used. Nevertheless, the anti join will still terminate its search as soon as the (unexpected) match is encountered.

Apply Anti Join Row Goal Conditions

SQL Server does not provide us with a way to write an anti join directly, so we have to use workarounds like NOT EXISTS, NOT IN/ANY/SOME, or EXCEPT. Each of these forms results in a subquery representation in the parsed tree at the beginning of query compilation. This subquery is always unrolled into an apply, and then transformed into a logical anti join where possible (the details are the same as for semi join discussed in part 2). This all happens before even a trivial plan is considered.

In order for an anti join to get a row goal it must enter cost-based optimization as a logical anti join (meaning the transformation from an apply above must have been successful). Then, the cost-based optimizer must choose to implement the logical anti join as an apply. For this to happen, the optimizer must first choose to explore the apply option; then it must select that as the cheapest option (for that part of the plan).

An anti join row goal is set by any one of the cost-based optimization rules that can transform a join to an apply. An anti join that enters cost-based optimization as an apply (because the transform to logical anti join failed) will not have a row goal applied.

The cost-based optimizer will only explore and select the join-to-apply option if there is a efficient way to locate any matching inner side rows (e.g. using an index). The optimizer will not explore the option if the join inner side subtree lacks anything useful for the apply predicate to "latch on to". This might be an index, temporary index (via an eager index spool), or other logical key. Adding the row goal helps the optimizer assess the cost of the join-to-apply option given that a maximum of one row needs to be located.

Note that an apply anti join may appear in an execution plan without a row goal. This occurs when the initial transform from apply to join fails, as is relatively common. When this happens, the anti join starts life in the cost-based optimizer as an apply, and so never has a row goal added by one of the join-to-apply rules.

Of course a row goal may also be introduced on the inner side of this apply via a different mechanism (not associated with the apply), for example by a separate Top operator.

To summarize:

  • An anti join can only gain a row goal during cost-based optimization (CBO).
  • Rules that translate an anti join to an apply add a row goal.
  • The anti join must enter CBO as a join, not an apply.
  • To enter CBO as a join, earlier phases must be able to rewrite the subquery as a join (via an apply stage).
  • CBO only explores the join to apply transform in promising cases.

Example

It is a bit more tricky to demonstrate all this for apply anti join than was the case for apply semi join. The reasons for this will be covered in part 4.

Meanwhile, here is an AdventureWorks example showing how an apply anti join with row goal arises, using the same undocumented trace flags as for semi join. Trace flag 8608 is added to show the initial memo structure at the start of cost-based optimization.

SELECT P.ProductID 
FROM Production.Product AS P
WHERE 
    NOT EXISTS 
    (
        SELECT 1
        FROM Production.TransactionHistoryArchive AS THA 
        WHERE THA.ProductID = P.ProductID
 
        UNION ALL
 
        SELECT 1
        FROM Production.TransactionHistory AS TH 
        WHERE TH.ProductID = P.ProductID
    )
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8607, QUERYTRACEON 8608, QUERYTRACEON 8612, QUERYTRACEON 8621);

The exists subquery is first transformed to an apply:

Subquery to apply

The apply is then successfully rewritten as a join (in this case):

Apply to Join

The initial memo shows the logical anti join presented to the cost-based optimizer (only the root group shown for brevity):

Initial Memo Root Group

The optimizer output tree is:

Trace flag output for anti join with row goal

Notice the input logical anti join has been implemented as an apply anti join. A row goal of 1 has been set for the Concatenation operator, and each of the two table accesses.

The estimated execution plan shows an apply anti join with one row estimated on the inner side of the anti join for each outer row:

Apply row goal execution plan

For those using SQL Server 2017 CU3 and SSMS 17.5, the Properties window for each of the Concatenation and Index Seek operators above will show an EstimateRowsWithoutRowGoal value corresponding to the cardinality value shown in the trace flag output. This attribute is not visible in tooltips, and is simply omitted for operators without a row goal. I personally think it would be useful to add some subtle shading to regions of a plan affected by a row goal, so we would not need to click on each operator individually.

Observant readers may have noticed that the optimizer output tree showed the Archive table being accessed first, then the History table. The graphical plan has them the other way around. A post-optimization rewrite considers reordering the inputs to the Concatenation operator on the basis of expected cost. In this case, the tables are reordered in the final plan since the History table access is expected to be cheaper (0.0850383 units) than the Archive table access (0.085441 units). This optimization only takes place in the presence of a single-row row goal as I described in UNION ALL Optimization. You also need to be running SQL Server 2008 R2 or earlier, patched versions of SQL Server 2014 and 2016 with trace flag 4199 enabled, or SQL Server 2017.

Summary

An apply semi join always comes with a row goal on its inner side. The optimizer assumes that a matching row exists, so it can be worth setting a row goal to locate that row quickly. The row goal also reflects the fact that a maximum of one row will be encountered per loop iteration.

For apply anti join, a row goal is not set by default. The optimizer's assumption for anti join is that a matching row will not be found, so it would be counter-productive to set a row goal. A part of a plan that needs to consume all rows does not benefit from row goal logic.

Nevertheless, a row goal may still be set for an apply anti join during cost-based optimization. This only happens when the optimizer considers rewriting a logical anti join to an apply. A logical anti join (written using indirect T-SQL syntax) that cannot be rewritten as a regular (non-correlated) anti join before cost-based optimization begins will not have a row goal set.

For the rules that can transform a logical anti join to an apply anti join to match, there must appear to be a reasonable way to locate any matching rows on the inner side on the join. In other words, there must be some potential benefit to pushing the join predicate down.

The final part of this series will take a close look at an anti join execution plan shape that I consider to be an anti-pattern.