CASE expression is one of my favorite constructs in T-SQL. It is quite flexible, and is sometimes the only way to control the order in which SQL Server will evaluate predicates.
However, it is often misunderstood. Not surprisingly, I have a few examples.
CASE is an expression, not a statement
Likely not important to most people, and perhaps this is just my pedantic side, but a lot of people call it a
CASE statement – including Microsoft, whose documentation uses statement and expression interchangeably at times. I find this mildly annoying (like row/record and column/field), and it's mostly semantics, but there is an important distinction: an expression returns a result. When people think of
CASE as a statement, it leads to experiments in code shortening like this:
SELECT CASE [status] WHEN 'A' THEN StatusLabel = 'Authorized', LastEvent = AuthorizedTime WHEN 'C' THEN StatusLabel = 'Completed', LastEvent = CompletedTime END FROM dbo.some_table;
SELECT CASE WHEN @foo = 1 THEN (SELECT foo, bar FROM dbo.fizzbuzz) ELSE (SELECT blat, mort FROM dbo.splunge) END;
This type of control-of-flow logic may be possible with
CASE statements in other languages (like VBScript), but not in Transact-SQL's
CASE expression. To use
CASE within the same query logic, you would have to use a
CASE expression for each output column:
SELECT StatusLabel = CASE [status] WHEN 'A' THEN 'Authorized' WHEN 'C' THEN 'Completed' END, LastEvent = CASE [status] WHEN 'A' THEN AuthorizedTime WHEN 'C' THEN CompletedTime END FROM dbo.some_table;
CASE will not always short circuit
The official documentation implies that the entire expression will short-circuit, meaning it will evaluate the expression from left-to-right, and stop evaluating when it hits a match:
However, this isn't always true. And to its credit, at least in the 2014 docs, the page goes on to try to explain one scenario where this isn't guaranteed. But it only gets part of the story:
The divide by zero example is pretty easy to reproduce, and I demonstrated it in this answer on dba.stackexchange.com:
DECLARE @i INT = 1; SELECT CASE WHEN @i = 1 THEN 1 ELSE MIN(1/0) END;
Divide by zero error encountered.
There are trivial workarounds (such as
ELSE (SELECT MIN(1/0)) END), but this comes as a real surprise to many who haven't memorized the above sentences from Books Online. I was first made aware of this specific scenario in a conversation on a private e-mail distribution list by Itzik Ben-Gan (@ItzikBenGan), who in turn was initially notified by Jaime Lafargue. I reported the bug in Connect #690017 : CASE / COALESCE won't always evaluate in textual order; it was swiftly closed as "By Design." Paul White (blog | @SQL_Kiwi) subsequently filed Connect #691535 : Aggregates Don't Follow the Semantics Of CASE, and it was closed as "Fixed." The fix, in this case, was clarification in the Books Online article; namely, the snippet I copied above.
This behavior can yield itself in some other, less obvious scenarios, too. For example, Connect #780132 : FREETEXT() does not honor order of evaluation in CASE statements (no aggregates involved) shows that, well,
CASE evaluation order is not guaranteed to be left-to-right when using certain full-text functions either. On that item, Paul White commented that he also observed something similar using the new
LAG() function introduced in SQL Server 2012. I don't have a repro handy, but I do believe him, and I don't think we've unearthed all of the edge cases where this may occur.
So, when aggregates or non-native services like Full-Text Search are involved, please do not make any assumptions about short circuiting in a
Expressions can be evaluated more than once
I often see people writing a simple
CASE expression, like this:
SELECT CASE @variable WHEN 1 THEN 'foo' WHEN 2 THEN 'bar' END
It is important to understand that this will actually be executed as a searched
CASE expression, like this:
SELECT CASE WHEN @variable = 1 THEN 'foo' WHEN @variable = 2 THEN 'bar' END
The reason it is important to understand that the expression being evaluated will actually be referenced multiple times, is because it can actually be evaluated multiple times. When this is a variable, or a constant, or a column reference, this is unlikely to be a real problem; however, things can change quickly when it's a non-deterministic function. Consider that this expression yields a
SMALLINT between 1 and 3; go ahead and run it many times, and you will always get one of those three values:
SELECT CONVERT(SMALLINT, 1+RAND()*3);
Now, put this into a simple
CASE expression, and run it a dozen times – eventually you will get a result of
SELECT [result] = CASE CONVERT(SMALLINT, 1+RAND()*3) WHEN 1 THEN 'one' WHEN 2 THEN 'two' WHEN 3 THEN 'three' END;
How does this happen? Well, the entire
CASE expression is actually expanded to a searched expression, as follows:
SELECT [result] = CASE WHEN CONVERT(SMALLINT, 1+RAND()*3) = 1 THEN 'one' WHEN CONVERT(SMALLINT, 1+RAND()*3) = 2 THEN 'two' WHEN CONVERT(SMALLINT, 1+RAND()*3) = 3 THEN 'three' ELSE NULL -- this is always implicitly there END;
So, what happens is that each
WHEN clause evaluates and invokes
RAND() independently – and in each case it could yield a different value. Let's say we enter the expression, and we check the first
WHEN clause, and the result is 3. So we skip that clause and move on. It is conceivable that the next two clauses will both return 1 when
RAND() is evaluated again – in which case none of the conditions are evaluated to true, so the
ELSE takes over.
This problem is not limited to the
RAND() function. Imagine the same style of non-determinism coming from these moving targets:
SELECT [crypt_gen] = 1+ABS(CRYPT_GEN_RANDOM(10) % 20), [newid] = LEFT(NEWID(),2), [checksum] = ABS(CHECKSUM(NEWID())%3);
These expressions can obviously yield a different value if evaluated multiple times. And with a searched
CASE expression, there will be times when every re-evaluation happens to fall out of the search specific to the current
WHEN, and ultimately hit the
ELSE clause. To protect yourself from this, one option is to always hard-code your own explicit
ELSE; just be careful about the fallback value you choose to return, because this will have some skew effect if you are looking for even distribution. Another option is to just change the last
WHEN clause to
ELSE; however this will still lead to uneven distribution. The preferred option, in my opinion, is to try and coerce SQL Server to evaluate the condition once (though this isn't always possible within a single query). For example, compare these two results:
-- Query A: expression referenced directly in CASE; no ELSE: SELECT x, COUNT(*) FROM ( SELECT x = CASE ABS(CHECKSUM(NEWID())%3) WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' END FROM sys.all_columns ) AS y GROUP BY x; -- Query B: additional ELSE clause: SELECT x, COUNT(*) FROM ( SELECT x = CASE ABS(CHECKSUM(NEWID())%3) WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' ELSE '2' END FROM sys.all_columns ) AS y GROUP BY x; -- Query C: Final WHEN converted to ELSE: SELECT x, COUNT(*) FROM ( SELECT x = CASE ABS(CHECKSUM(NEWID())%3) WHEN 0 THEN '0' WHEN 1 THEN '1' ELSE '2' END FROM sys.all_columns ) AS y GROUP BY x; -- Query D: Push evaluation of NEWID() to subquery: SELECT x, COUNT(*) FROM ( SELECT x = CASE x WHEN 0 THEN '0' WHEN 1 THEN '1' WHEN 2 THEN '2' END FROM ( SELECT x = ABS(CHECKSUM(NEWID())%3) FROM sys.all_columns ) AS x ) AS y GROUP BY x;
|Value||Query A||Query B||Query C||Query D|
Distribution of values with different query techniques
In this case I am relying on the fact that SQL Server chose to evaluate the expression in the subquery and not introduce it to the searched
CASE expression, but this is merely to demonstrate that distribution can be coerced to be more even. In reality this may not always be the choice the optimizer makes, so please don't learn from this little trick. :-)
You will observe that if you replace the
CHECKSUM(NEWID()) expression with the
RAND() expression, you'll get entirely different results; most notably, the latter will only ever return one value. This is because
GETDATE() and some other built-in functions, is given special treatment as a runtime constant, and only evaluated once per reference for the entire row. Note that it can still return
NULL just like the first query in the preceding code sample.
This problem is also not limited to the
CASE expression; you can see similar behavior with other built-in functions that use the same underlying semantics. For example,
CHOOSE is merely syntactic sugar for a more elaborate searched
CASE expression, and this will also yield
SELECT [choose] = CHOOSE(CONVERT(SMALLINT, 1+RAND()*3),'one','two','three');
IIF() is a function that I expected to fall into this same trap, but this function is really just a searched
CASE expression with only two possible outcomes, and no
ELSE – so it is tough, without nesting and introducing other functions, to envision a scenario where this can break unexpectedly. While in the simple case it is decent shorthand for
CASE, it is also tough to do anything useful with it if you need more than two possible outcomes. :-)
Finally, we should examine that
COALESCE can have similar issues. Let's consider that these expressions are equivalent:
SELECT COALESCE(@variable, 'constant'); SELECT CASE WHEN @variable IS NOT NULL THEN @variable ELSE 'constant' END);
In this case,
@variable would be evaluated twice (as would any function or subquery, as described in this Connect item).
I was really able to get some puzzled looks when I brought the following example up in a recent forum discussion. Let's say I want to populate a table with a distribution of values from 1-5, but whenever a 3 is encountered, I want to use -1 instead. Not a very real-world scenario, but easy to construct and follow. One way to write this expression is:
(In English working from the inside out: convert the result of the expression
1+RAND()*5 to a smallint; if the result of that conversion is 3, set it to
NULL; if the result of that is
NULL, set it to -1. You could write this with a more verbose
CASE expression, but concise seems to be king.)
If you run that a bunch of times, you should see a range of values from 1-5, as well as -1. You will see some instances of 3, and you may have also noticed that you occasionally see
NULL, though you might not expect either of those results. Let's check the distribution:
USE tempdb; GO CREATE TABLE dbo.dist(TheNumber SMALLINT); GO INSERT dbo.dist(TheNumber) SELECT COALESCE(NULLIF(CONVERT(SMALLINT,1+RAND()*5),3),-1); GO 10000 SELECT TheNumber, occurences = COUNT(*) FROM dbo.dist GROUP BY TheNumber ORDER BY TheNumber; GO DROP TABLE dbo.dist;
Results (your results will certainly vary, but the basic trend should be similar):
Distribution of TheNumber using COALESCE
Are you scratching your head yet? How do the values
NULL and 3 show up, and why is the distribution for
NULL and -1 substantially higher? Well, I'll answer the former directly, and invite hypotheses for the latter.
The expression roughly expands to the following, logically, since
RAND() is evaluated twice inside
NULLIF, and then multiply that by two evaluations for each branch of the
COALESCE function. I don't have a debugger handy, so this isn't necessarily *exactly* what is done inside of SQL Server, but it should be equivalent enough to explain the point:
SELECT CASE WHEN CASE WHEN CONVERT(SMALLINT,1+RAND()*5) = 3 THEN NULL ELSE CONVERT(SMALLINT,1+RAND()*5) END IS NOT NULL THEN CASE WHEN CONVERT(SMALLINT,1+RAND()*5) = 3 THEN NULL ELSE CONVERT(SMALLINT,1+RAND()*5) END ELSE -1 END END
So you can see that being evaluated multiple times can quickly become a Choose Your Own Adventure™ book, and how both
NULL and 3 are possible outcomes that don't seem possible when examining the original statement. An interesting side note: this doesn't happen quite the same if you take the above distribution script and replace
ISNULL. In that case, there is no possibility for a
NULL output; the distribution is roughly as follows:
Distribution of TheNumber using ISNULL
Again, your actual results will certainly vary, but shouldn't by much. The point is that we can still see that 3 falls through the cracks quite often, but
ISNULL magically eliminates the potential for
NULL to make it all the way through.
I talked about some of the other differences between
ISNULL in a tip, entitled "Deciding between COALESCE and ISNULL in SQL Server." When I wrote that, I was heavily in favor of using
COALESCE except in the case where the first argument was a subquery (again, due to this
bug "feature gap"). Now I'm not so sure I feel as strongly about that.
Simple CASE expressions can become nested over linked servers
One of the few limitations of the
CASE expression is that is restricted to 10 nest levels. In this example over on dba.stackexchange.com, Paul White demonstrates (using SQL Sentry Plan Explorer) that a simple expression like this:
SELECT CASE column_name WHEN '1' THEN 'a' WHEN '2' THEN 'b' WHEN '3' THEN 'c' ... END FROM ...
Gets expanded by the parser to the searched form:
SELECT CASE WHEN column_name = '1' THEN 'a' WHEN column_name = '2' THEN 'b' WHEN column_name = '3' THEN 'c' ... END FROM ...
But can actually be transmitted over a linked server connection as the following, much more verbose query:
SELECT CASE WHEN column_name = '1' THEN 'a' ELSE CASE WHEN column_name = '2' THEN 'b' ELSE CASE WHEN column_name = '3' THEN 'c' ELSE ... ELSE NULL END END END FROM ...
In this situation, even though the original query only had a single
CASE expression with 10+ possible outcomes, when sent to the linked server, it had 10+ nested
CASE expressions. As such, as you might expect, it returned an error:
Statement(s) could not be prepared.
Msg 125, Level 15, State 4
Case expressions may only be nested to level 10.
In some cases, you can rewrite it as Paul suggested, with an expression like this (assuming
column_name is a string):
SELECT CASE CONVERT(VARCHAR(MAX), SUBSTRING(column_name, 1, 255)) WHEN 'a' THEN '1' WHEN 'b' THEN '2' WHEN 'c' THEN '3' ... END FROM ...
In some cases, only the
SUBSTRING may be required to alter the location where the expression is evaluated; in others, only the
CONVERT. I did not perform exhaustive testing, but this may have to do with the linked server provider, options like Collation Compatible and Use Remote Collation, and the version of SQL Server at either end of the pipe.
Long story short, it is important to remember that your
CASE expression can be re-written for you without warning, and that any workaround you use may later be overruled by the optimizer, even if it works for you now.
I hope I've given some food for thought on some of the lesser-known aspects of the
CASE expression, and some insight into situations where
CASE – and some of the functions that use the same underlying logic – return unexpected results. Some other interesting scenarios where this type of problem has cropped up: