Aaron Bertrand

Dirty Secrets of the CASE Expression

June 12, 2014 by in T-SQL Queries | 38 Comments
SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul White is an independent SQL Server consultant specializing in performance tuning, execution plans, and the query optimizer.

Paul’s Posts

The CASE expression is one of my favorite constructs in T-SQL. It is quite flexible, and is sometimes the only way to control the order in which SQL Server will evaluate predicates.
However, it is often misunderstood.

What is the T-SQL CASE Expression?

In T-SQL, CASE is an expression that evaluates one or more possible expressions and returns the first appropriate expression. The term expression may be a bit overloaded here, but basically it is anything that can be evaluated as a single, scalar value, such as a variable, a column, a string literal, or even the output of a built-in or scalar function.

There are two forms of CASE in T-SQL:

  • Simple CASE expression – when you only need to evaluate equality:

    CASE <input> WHEN <eval> THEN <return> … [ELSE <return>] END

  • Searched CASE expression – when you need to evaluate more complex expressions, such as inequality, LIKE, or IS NOT NULL:

    CASE WHEN <input_bool> THEN <return> … [ELSE <return>] END

The return expression is always a single value, and the output data type is determined by data type precedence.

As I said, the CASE expression is often misunderstood; here are some examples:

CASE is an expression, not a statement

Likely not important to most people, and perhaps this is just my pedantic side, but a lot of people call it a CASE statement – including Microsoft, whose documentation uses statement and expression interchangeably at times. I find this mildly annoying (like row/record and column/field) and, while it's mostly semantics, but there is an important distinction between an expression and a statement: an expression returns a result. When people think of CASE as a statement, it leads to experiments in code shortening like this:

SELECT CASE [status]
    WHEN 'A' THEN
        StatusLabel      = 'Authorized',
        LastEvent        = AuthorizedTime
    WHEN 'C' THEN
        StatusLabel      = 'Completed',
        LastEvent        = CompletedTime
    END
FROM dbo.some_table;

Or this:

SELECT CASE WHEN @foo = 1 THEN
    (SELECT foo, bar FROM dbo.fizzbuzz)
ELSE
    (SELECT blat, mort FROM dbo.splunge)
END;

This type of control-of-flow logic may be possible with CASE statements in other languages (like VBScript), but not in Transact-SQL's CASE expression. To use CASE within the same query logic, you would have to use a CASE expression for each output column:

SELECT 
  StatusLabel = CASE [status]
      WHEN 'A' THEN 'Authorized' 
      WHEN 'C' THEN 'Completed' END,
  LastEvent = CASE [status]
      WHEN 'A' THEN AuthorizedTime
      WHEN 'C' THEN CompletedTime END
FROM dbo.some_table;

CASE will not always short circuit

The official documentation once implied that the entire expression will short-circuit, meaning it will evaluate the expression from left-to-right, and stop evaluating when it hits a match:

The CASE statement [sic!] evaluates its conditions sequentially and stops with the first condition whose condition is satisfied.

However, this isn't always true. And to its credit, in a more current version, the page went on to try to explain one scenario where this isn't guaranteed. But it only gets part of the story:

In some situations, an expression is evaluated before a CASE statement [sic!] receives the results of the expression as its input. Errors in evaluating these expressions are possible. Aggregate expressions that appear in WHEN arguments to a CASE statement [sic!] are evaluated first, then provided to the CASE statement [sic!]. For example, the following query produces a divide by zero error when producing the value of the MAX aggregate. This occurs prior to evaluating the CASE expression.

The divide by zero example is pretty easy to reproduce, and I demonstrated it in this answer on dba.stackexchange.com:

DECLARE @i INT = 1;
SELECT CASE WHEN @i = 1 THEN 1 ELSE MIN(1/0) END;

Result:

Msg 8134, Level 16, State 1
Divide by zero error encountered.

There are trivial workarounds (such as ELSE (SELECT MIN(1/0)) END), but this comes as a real surprise to many who haven't memorized the above sentences from Books Online. I was first made aware of this specific scenario in a conversation on a private e-mail distribution list by Itzik Ben-Gan (@ItzikBenGan), who in turn was initially notified by Jaime Lafargue. I reported the bug in Connect #690017 : CASE / COALESCE won't always evaluate in textual order; it was swiftly closed as "By Design." Paul White (blog | @SQL_Kiwi) subsequently filed Connect #691535 : Aggregates Don't Follow the Semantics Of CASE, and it was closed as "Fixed." The fix, in this case, was clarification in the Books Online article; namely, the snippet I copied above.

This behavior can yield itself in some other, less obvious scenarios, too. For example, Connect #780132 : FREETEXT() does not honor order of evaluation in CASE statements (no aggregates involved) shows that, well, CASE evaluation order is not guaranteed to be left-to-right when using certain full-text functions either. On that item, Paul White commented that he also observed something similar using the new LAG() function introduced in SQL Server 2012. I don't have a repro handy, but I do believe him, and I don't think we've unearthed all of the edge cases where this may occur.

So, when aggregates or non-native services like Full-Text Search are involved, please do not make any assumptions about short circuiting in a CASE expression.

RAND() can be evaluated more than once

I often see people writing a simple CASE expression, like this:

SELECT CASE @variable 
  WHEN 1 THEN 'foo'
  WHEN 2 THEN 'bar'
END

It is important to understand that this will be executed as a searched CASE expression, like this:

SELECT CASE  
  WHEN @variable = 1 THEN 'foo'
  WHEN @variable = 2 THEN 'bar'
END

The reason it is important to understand that the expression being evaluated will be evaluated multiple times, is because it can actually be evaluated multiple times. When this is a variable, or a constant, or a column reference, this is unlikely to be a real problem; however, things can change quickly when it's a non-deterministic function. Consider that this expression yields a SMALLINT between 1 and 3; go ahead and run it many times, and you will always get one of those three values:

SELECT CONVERT(SMALLINT, 1+RAND()*3);

Now, put this into a simple CASE expression, and run it a dozen times – eventually you will get a result of NULL:

SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3)
  WHEN 1 THEN 'one'
  WHEN 2 THEN 'two'
  WHEN 3 THEN 'three'
END;

How does this happen? Well, the entire CASE expression is expanded to a searched expression, as follows:

SELECT [result] = CASE 
  WHEN CONVERT(SMALLINT, 1+RAND()*3) = 1 THEN 'one'
  WHEN CONVERT(SMALLINT, 1+RAND()*3) = 2 THEN 'two'
  WHEN CONVERT(SMALLINT, 1+RAND()*3) = 3 THEN 'three'
  ELSE NULL -- this is always implicitly there
END;

In turn, what happens is that each WHEN clause evaluates and invokes RAND() independently – and in each case it could yield a different value. Let's say we enter the expression, and we check the first WHEN clause, and the result is 3; we skip that clause and move on. It is conceivable that the next two clauses will both return 1 when RAND() is evaluated again – in which case none of the conditions are evaluated to true, so the ELSE takes over.

Other expressions can be evaluated more than once

This problem is not limited to the RAND() function. Imagine the same style of non-determinism coming from these moving targets:

SELECT 
  [crypt_gen]   = 1+ABS(CRYPT_GEN_RANDOM(10) % 20),
  [newid]       = LEFT(NEWID(),2),
  [checksum]    = ABS(CHECKSUM(NEWID())%3);

These expressions can obviously yield a different value if evaluated multiple times. And with a searched CASE expression, there will be times when every re-evaluation happens to fall out of the search specific to the current WHEN, and ultimately hit the ELSE clause. To protect yourself from this, one option is to always hard-code your own explicit ELSE; just be careful about the fallback value you choose to return, because this will have some skew effect if you are looking for even distribution. Another option is to just change the last WHEN clause to ELSE, but this will still lead to uneven distribution. The preferred option, in my opinion, is to try and coerce SQL Server to evaluate the condition once (though this isn't always possible within a single query). For example, compare these two results:

-- Query A: expression referenced directly in CASE; no ELSE:
SELECT x, COUNT(*) FROM
(
  SELECT x = CASE ABS(CHECKSUM(NEWID())%3) 
  WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' END 
  FROM sys.all_columns
) AS y GROUP BY x;

-- Query B: additional ELSE clause:
SELECT x, COUNT(*) FROM
(
  SELECT x = CASE ABS(CHECKSUM(NEWID())%3) 
  WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' ELSE '2' END 
  FROM sys.all_columns
) AS y GROUP BY x;

-- Query C: Final WHEN converted to ELSE:
SELECT x, COUNT(*) FROM
(
  SELECT x = CASE ABS(CHECKSUM(NEWID())%3) 
  WHEN 0 THEN '0' WHEN 1 THEN '1' ELSE '2' END 
  FROM sys.all_columns
) AS y GROUP BY x;

-- Query D: Push evaluation of NEWID() to subquery:
SELECT x, COUNT(*) FROM
(
  SELECT x = CASE x WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' END 
  FROM 
  (
    SELECT x = ABS(CHECKSUM(NEWID())%3) FROM sys.all_columns
  ) AS x
) AS y GROUP BY x;

Distribution:

Value Query A Query B Query C Query D
NULL 2,572
0 2,923 2,900 2,928 2,949
1 1,946 1,959 1,927 2,896
2 1,295 3,877 3,881 2,891

Distribution of values with different query techniques

In this case, I am relying on the fact that SQL Server chose to evaluate the expression in the subquery and not introduce it to the searched CASE expression, but this is merely to demonstrate that distribution can be coerced to be more even. In reality, this may not always be the choice the optimizer makes, so please don't learn from this little trick. :-)

CHOOSE() is also affected

You will observe that if you replace the CHECKSUM(NEWID()) expression with the RAND() expression, you'll get entirely different results; most notably, the latter will only ever return one value. This is because RAND(), like GETDATE() and some other built-in functions, is given special treatment as a runtime constant, and only evaluated once per reference for the entire row. Note that it can still return NULL just like the first query in the preceding code sample.

This problem is also not limited to the CASE expression; you can see similar behavior with other built-in functions that use the same underlying semantics. For example, CHOOSE is merely syntactic sugar for a more elaborate searched CASE expression, and this will also yield NULL occasionally:

SELECT [choose] = CHOOSE(CONVERT(SMALLINT, 1+RAND()*3),'one','two','three');

IIF() is a function that I expected to fall into this same trap, but this function is really just a searched CASE expression with only two possible outcomes, and no ELSE – so it is tough, without nesting and introducing other functions, to envision a scenario where this can break unexpectedly. While in the simple case it is decent shorthand for CASE, it is also tough to do anything useful with it if you need more than two possible outcomes. :-)

COALESCE() is also affected

Finally, we should examine that COALESCE can have similar issues. Let's consider that these expressions are equivalent:

SELECT COALESCE(@variable, 'constant');

SELECT CASE WHEN @variable IS NOT NULL THEN @variable ELSE 'constant' END);

In this case, @variable would be evaluated twice (as would any function or subquery, as described in this Connect item).

I was really able to get some puzzled looks when I brought the following example up in a recent forum discussion. Let's say I want to populate a table with a distribution of values from 1-5, but whenever a 3 is encountered, I want to use -1 instead. Not a very real-world scenario, but easy to construct and follow. One way to write this expression is:

SELECT COALESCE(NULLIF(CONVERT(SMALLINT,1+RAND()*5),3),-1);

(In English, working from the inside out: convert the result of the expression 1+RAND()*5 to a smallint; if the result of that conversion is 3, set it to NULL; if the result of that is NULL, set it to -1. You could write this with a more verbose CASE expression, but concise seems to be king.)

If you run that a bunch of times, you should see a range of values from 1-5, as well as -1. You will see some instances of 3, and you may have also noticed that you occasionally see NULL, though you might not expect either of those results. Let's check the distribution:

USE tempdb;
GO
CREATE TABLE dbo.dist(TheNumber SMALLINT);
GO
INSERT dbo.dist(TheNumber) SELECT COALESCE(NULLIF(CONVERT(SMALLINT,1+RAND()*5),3),-1);
GO 10000
SELECT TheNumber, occurences = COUNT(*) FROM dbo.dist 
  GROUP BY TheNumber ORDER BY TheNumber;
GO
DROP TABLE dbo.dist;

Results (your results will certainly vary, but the basic trend should be similar):

TheNumber occurences
NULL 1,654
-1 2,002
1 1,290
2 1,266
3 1,287
4 1,251
5 1,250

Distribution of TheNumber using COALESCE

Breaking down a searched CASE expression

Are you scratching your head yet? How do the values NULL and 3 show up, and why is the distribution for NULL and -1 substantially higher? Well, I'll answer the former directly, and invite hypotheses for the latter.

The expression roughly expands to the following, logically, since RAND() is evaluated twice inside NULLIF, and then multiply that by two evaluations for each branch of the COALESCE function. I don't have a debugger handy, so this isn't necessarily *exactly* what is done inside of SQL Server, but it should be equivalent enough to explain the point:

SELECT 
  CASE WHEN 
      CASE WHEN CONVERT(SMALLINT,1+RAND()*5) = 3 THEN NULL 
      ELSE CONVERT(SMALLINT,1+RAND()*5) 
      END 
    IS NOT NULL 
    THEN
      CASE WHEN CONVERT(SMALLINT,1+RAND()*5) = 3 THEN NULL 
      ELSE CONVERT(SMALLINT,1+RAND()*5) 
      END
    ELSE -1
  END
END

So you can see that being evaluated multiple times can quickly become a Choose Your Own Adventure™ book, and how both NULL and 3 are possible outcomes that don't seem possible when examining the original statement. An interesting side note: this doesn't happen quite the same if you take the above distribution script and replace COALESCE with ISNULL. In that case, there is no possibility for a NULL output; the distribution is roughly as follows:

TheNumber occurences
-1 1,966
1 1,585
2 1,644
3 1,573
4 1,598
5 1,634

Distribution of TheNumber using ISNULL

Again, your actual results will certainly vary, but shouldn't by much. The point is that we can still see that 3 falls through the cracks quite often, but ISNULL magically eliminates the potential for NULL to make it all the way through.

I talked about some of the other differences between COALESCE and ISNULL in a tip, entitled "Deciding between COALESCE and ISNULL in SQL Server." When I wrote that, I was heavily in favor of using COALESCE except in the case where the first argument was a subquery (again, due to this bug "feature gap"). Now I'm not so sure I feel as strongly about that.

Simple CASE expressions can become nested over linked servers

One of the few limitations of the CASE expression is that is restricted to 10 nest levels. In this example over on dba.stackexchange.com, Paul White demonstrates (using Plan Explorer) that a simple expression like this:

SELECT CASE column_name
  WHEN '1' THEN 'a' 
  WHEN '2' THEN 'b'
  WHEN '3' THEN 'c'
  ...
END
FROM ...

Gets expanded by the parser to the searched form:

SELECT CASE 
  WHEN column_name = '1' THEN 'a' 
  WHEN column_name = '2' THEN 'b'
  WHEN column_name = '3' THEN 'c'
  ...
END
FROM ...

But can actually be transmitted over a linked server connection as the following, much more verbose query:

SELECT 
  CASE WHEN column_name = '1' THEN 'a' ELSE
    CASE WHEN column_name = '2' THEN 'b' ELSE
      CASE WHEN column_name = '3' THEN 'c' ELSE 
      ... 
      ELSE NULL
      END
    END
  END
FROM ...

In this situation, even though the original query only had a single CASE expression with 10+ possible outcomes, when sent to the linked server, it had 10+ nested CASE expressions. As such, as you might expect, it returned an error:

Msg 8180, Level 16, State 1
Statement(s) could not be prepared.
Msg 125, Level 15, State 4
Case expressions may only be nested to level 10.

In some cases, you can rewrite it as Paul suggested, with an expression like this (assuming column_name is a varchar column):

SELECT CASE CONVERT(VARCHAR(MAX), SUBSTRING(column_name, 1, 255))
  WHEN 'a' THEN '1'
  WHEN 'b' THEN '2'
  WHEN 'c' THEN '3'
  ...
END
FROM ...

In some cases, only the SUBSTRING may be required to alter the location where the expression is evaluated; in others, only the CONVERT. I did not perform exhaustive testing, but this may have to do with the linked server provider, options like Collation Compatible and Use Remote Collation, and the version of SQL Server at either end of the pipe.

Long story short, it is important to remember that your CASE expression can be re-written for you without warning, and that any workaround you use may later be overruled by the optimizer, even if it works for you now.

CASE Expression Final Thoughts and Additional Resources

I hope I've given some food for thought on some of the lesser-known aspects of the CASE expression, and some insight into situations where CASE – and some of the functions that use the same underlying logic – return unexpected results. Some other interesting scenarios where this type of problem has cropped up: