Itzik Ben-Gan

NULL Complexities – Part 2

January 8, 2020 by in T-SQL Queries | No Comments
SentryOne Monitor : Save 40% for a Limited Time!
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Register to Download

Featured Author

Itzik is a T-SQL trainer, a co-founder of SolidQ, and blogs about T-SQL fundamentals and query tuning.

Itzik’s Posts

This article is the second in a series on NULL complexities. Last month I introduced the NULL as SQL's marker for any kind of missing value. I explained that SQL doesn't provide you with the ability to distinguish between missing and applicable (A-values) and missing and inapplicable (I-values) markers. I also explained how comparisons involving NULLs work with constants, variables, parameters and columns. This month I continue the discussion by covering NULL treatment inconsistencies in different T-SQL elements.

I'll continue using the sample database TSQLV5 like last month in some of my examples. You can find the script that creates and populates this database here, and its ER diagram here.

NULL treatment inconsistencies

As you've already gathered, NULL treatment is not trivial. Some of the confusion and complexity has to do with the fact that the treatment of NULLs can be inconsistent between different elements of T-SQL for similar operations. In the upcoming sections I describe NULL handling in linear versus aggregate computations, ON/WHERE/HAVING clauses, CHECK constraint versus CHECK option, IF/WHILE/CASE elements, the MERGE statement, distinctness and grouping, as well as ordering and uniqueness.

Linear versus aggregate computations

T-SQL, and same goes for standard SQL, uses different NULL handling logic when applying an actual aggregate function such as SUM, MIN and MAX across rows versus when applying the same computation as a linear one across columns. To demonstrate this difference, I'll use two sample tables called #T1 and #T2 which you create and populate by running the following code:

DROP TABLE IF EXISTS #T1, #T2;
 
SELECT * INTO #T1 FROM ( VALUES(10, 5, NULL) ) AS D(col1, col2, col3);
 
SELECT * INTO #T2 FROM ( VALUES(10),(5),(NULL) ) AS D(col1);

The table #T1 has three columns called col1, col2 and col3. It currently has one row with the column values 10, 5 and NULL, respectively:

SELECT * FROM #T1;
col1        col2        col3
----------- ----------- -----------
10          5           NULL

The table #T2 has one column called col1. It currently has three rows with the values 10, 5 and NULL in col1:

SELECT * FROM #T2;
col1
-----------
10
5
NULL

When applying what is ultimately an aggregate computation such as addition as a linear one across columns, the presence of any NULL input yields a NULL result. The following query demonstrates this behavior:

SELECT col1 + col2 + col3 AS total
FROM #T1;

This query generates the following output:

total
-----------
NULL

Conversely, actual aggregate functions, which are applied across rows, are designed to ignore NULL inputs. The following query demonstrates this behavior using the SUM function:

SELECT SUM(col1) AS total
FROM #T2;

This query generates the following output:

total
-----------
15

Warning: Null value is eliminated by an aggregate or other SET operation.

Notice the warning mandated by the SQL standard indicating the presence of NULL inputs which were ignored. You can suppress such warnings by turning off the ANSI_WARNINGS session option.

Similarly, when applied to an input expression, the COUNT function counts the number of rows with non-NULL input values (as opposed to COUNT(*) which simply counts the number of rows). For example, replacing SUM(col1) with COUNT(col1) in the above query returns the count of 2.

Curiously, if you apply a COUNT aggregate to a column that is defined as not allowing NULLs, the optimizer converts the expression COUNT() to COUNT(*). This enables the use of any index for the purpose of counting as opposed to requiring the use of an index that contains the column in question. That's one more reason beyond ensuring the consistency and integrity of your data that should encourage you to enforce constraints such as NOT NULL and others. Such constraints allow the optimizer more flexibility in considering more optimal alternatives, and avoiding unnecessary work.

Based on this logic, the AVG function divides the sum of non-NULL values by the count of non-NULL values. Consider the following query as an example:

SELECT AVG(1.0 * col1) AS avgall
FROM #T2;

Here the sum of the non-NULL col1 values 15 is divided by the count of non-NULL values 2. You multiply col1 by the numeric literal 1.0 to force implicit conversion of the integer input values to numeric ones to get numeric division and not integer division. This query generates the following output:

avgall
---------
7.500000

Similarly, the MIN and MAX aggregates ignore NULL inputs. Consider the following query:

SELECT MIN(col1) AS mincol1, MAX(col1) AS maxcol1
FROM #T2;

This query generates the following output:

mincol1     maxcol1
----------- -----------
5           10

Attempting to apply linear calculations but emulating aggregate function semantics (ignore NULLs) is not pretty. Emulating SUM, COUNT and AVG is not too complex, but it does require you to check every input for NULLs, like so:

SELECT col1, col2, col3,
  CASE
    WHEN COALESCE(col1, col2, col3) IS NULL THEN NULL
    ELSE COALESCE(col1, 0) + COALESCE(col2, 0) + COALESCE(col3, 0)
  END AS sumall,
  CASE WHEN col1 IS NOT NULL THEN 1 ELSE 0 END
    + CASE WHEN col2 IS NOT NULL THEN 1 ELSE 0 END
    + CASE WHEN col3 IS NOT NULL THEN 1 ELSE 0 END AS cntall,
  CASE
    WHEN COALESCE(col1, col2, col3) IS NULL THEN NULL
    ELSE 1.0 * (COALESCE(col1, 0) + COALESCE(col2, 0) + COALESCE(col3, 0))
           / (CASE WHEN col1 IS NOT NULL THEN 1 ELSE 0 END
                + CASE WHEN col2 IS NOT NULL THEN 1 ELSE 0 END
                + CASE WHEN col3 IS NOT NULL THEN 1 ELSE 0 END)
  END AS avgall
FROM #T1;

This query generates the following output:

col1        col2        col3        sumall      cntall      avgall
----------- ----------- ----------- ----------- ----------- ---------------
10          5           NULL        15          2           7.500000000000

Attempting to apply a minimum or maximum as a linear computation to more than two input columns is quite tricky even before you add the logic to ignore NULLs since it involves nesting multiple CASE expressions either directly or indirectly (when you reuse column aliases). For instance, here's a query computing the maximum among col1, col2 and col3 in #T1, without the part that ignores NULLs:

SELECT col1, col2, col3, 
  CASE WHEN col1 IS NULL OR col2 IS NULL OR col3 IS NULL THEN NULL ELSE max2 END AS maxall
FROM #T1
  CROSS APPLY (VALUES(CASE WHEN col1 >= col2 THEN col1 ELSE col2 END)) AS A1(max1)
  CROSS APPLY (VALUES(CASE WHEN max1 >= col3 THEN max1 ELSE col3 END)) AS A2(max2);

This query generates the following output:

col1        col2        col3        maxall
----------- ----------- ----------- -----------
10          5           NULL        NULL

If you examine the query plan, you will find the following expanded expression computing the final result:

[Expr1005] = Scalar Operator(CASE WHEN CASE WHEN [#T1].[col1] IS NOT NULL THEN [#T1].[col1] ELSE 
  CASE WHEN [#T1].[col2] IS NOT NULL THEN [#T1].[col2] 
    ELSE [#T1].[col3] END END IS NULL THEN NULL ELSE 
  CASE WHEN CASE WHEN [#T1].[col1]>=[#T1].[col2] THEN [#T1].[col1] 
    ELSE [#T1].[col2] END>=[#T1].[col3] THEN 
  CASE WHEN [#T1].[col1]>=[#T1].[col2] THEN [#T1].[col1] 
    ELSE [#T1].[col2] END ELSE [#T1].[col3] END END)

And that's when there are only three columns involved. Imagine having a dozen columns involved!

Now add to this the logic to ignore NULLs:

SELECT col1, col2, col3, max2 AS maxall
FROM #T1
  CROSS APPLY (VALUES(CASE WHEN col1 >= col2 OR col2 IS NULL THEN col1 ELSE col2 END)) AS A1(max1)
  CROSS APPLY (VALUES(CASE WHEN max1 >= col3 OR col3 IS NULL THEN max1 ELSE col3 END)) AS A2(max2);

This query generates the following output:

col1        col2        col3        maxall
----------- ----------- ----------- -----------
10          5           NULL        10

Oracle has a pair of functions called GREATEST and LEAST that apply minimum and maximum calculations, respectively, as linear ones to the input values. These functions return a NULL given any NULL input like most linear calculations do. There's an open feedback item asking to get similar functions in T-SQL. If Microsoft does add such functions to T-SQL, it would be great to have an option controlling whether to ignore NULLs or not.

In the meanwhile, there's a much more elegant technique compared to the aforementioned ones that computes any kind of aggregate as a linear one across columns using actual aggregate function semantics ignoring NULLs. You use a combination of the CROSS APPLY operator and a derived table query against a table-value constructor that rotates columns to rows and applies the aggregate as an actual aggregate function. Here's an example demonstrating the MIN and MAX computations, but you can use this technique with any aggregate function that you like:

SELECT col1, col2, col3, maxall, minall
FROM #T1 CROSS APPLY
  (SELECT MAX(mycol), MIN(mycol)
   FROM (VALUES(col1),(col2),(col3)) AS D1(mycol)) AS D2(maxall, minall);

This query generates the following output:

col1        col2        col3        maxall      minall
----------- ----------- ----------- ----------- -----------
10          5           NULL        10          5

What if you want the opposite? What if you need to compute an aggregate across rows, but produce a NULL if there's any NULL input? For instance, suppose that you need to sum all col1 values from #T1, but return NULL if any of the inputs is NULL. This can be achieved with the following technique:

SELECT SUM(col1) * NULLIF(MIN(CASE WHEN col1 IS NULL THEN 0 ELSE 1 END), 0) AS sumall
FROM #T2;

You apply a MIN aggregate to a CASE expression that returns zeros for NULL inputs and ones for non-NULL inputs. If there's any NULL input, the result of the MIN function is 0, otherwise it's 1. Then using the NULLIF function you convert a 0 result to a NULL. You then multiply the result of the NULLIF function by the original sum. If there's any NULL input, you multiply the original sum by a NULL yielding a NULL. If there's no NULL input, you multiply the result of the original sum by 1, yielding the original sum.

Back to linear computations yielding a NULL for any NULL input, the same logic applies to string concatenation using the + operator, as the following query demonstrates:

USE TSQLV5;
 
SELECT empid, country, region, city,
  country + N',' + region + N',' + city AS emplocation
FROM HR.Employees;

This query generates the following output:

empid       country         region          city            emplocation
----------- --------------- --------------- --------------- ----------------
1           USA             WA              Seattle         USA,WA,Seattle
2           USA             WA              Tacoma          USA,WA,Tacoma
3           USA             WA              Kirkland        USA,WA,Kirkland
4           USA             WA              Redmond         USA,WA,Redmond
5           UK              NULL            London          NULL
6           UK              NULL            London          NULL
7           UK              NULL            London          NULL
8           USA             WA              Seattle         USA,WA,Seattle
9           UK              NULL            London          NULL

You want to concatenate the location parts of employees into one string, using a comma as a separator. But you want to ignore NULL inputs. Instead, when any of the inputs is a NULL, you get a NULL as the result. Some turn off the CONCAT_NULL_YIELDS_NULL session option, which causes a NULL input to get converted to an empty string for concatenation purposes, but this option is not recommended since it applies nonstandard behavior. Moreover, you'll be left with multiple consecutive separators when there are NULL inputs, which is typically not the desired behavior. Another option is to explicitly replace NULL inputs with an empty string using the ISNULL or COALESCE functions, but this usually results in lengthy verbose code. A much more elegant option is to use the CONCAT_WS function, which was introduced in SQL Server 2017. This function concatenates the inputs, ignoring NULLs, using the separator provided as the first input. Here's the solution query using this function:

SELECT empid, country, region, city,
  CONCAT_WS(N',', country, region, city) AS emplocation
FROM HR.Employees;

This query generates the following output:

empid       country         region          city            emplocation
----------- --------------- --------------- --------------- ----------------
1           USA             WA              Seattle         USA,WA,Seattle
2           USA             WA              Tacoma          USA,WA,Tacoma
3           USA             WA              Kirkland        USA,WA,Kirkland
4           USA             WA              Redmond         USA,WA,Redmond
5           UK              NULL            London          UK,London
6           UK              NULL            London          UK,London
7           UK              NULL            London          UK,London
8           USA             WA              Seattle         USA,WA,Seattle
9           UK              NULL            London          UK,London

ON/WHERE/HAVING

When using the WHERE, HAVING and ON query clauses for filtering/matching purposes, it's important to remember that they use three-valued predicate logic. When you have three-valued-logic involved, you want to accurately identify how the clause handles TRUE, FALSE and UNKNOWN cases. These three clauses are designed to accept TRUE cases, and reject FALSE and UNKNOWN cases.

To demonstrate this behavior I'll use a table called Contacts which you create and populate by running the following code:.

DROP TABLE IF EXISTS dbo.Contacts;
GO
 
CREATE TABLE dbo.Contacts
(
  id INT NOT NULL 
    CONSTRAINT PK_Contacts PRIMARY KEY,
  name VARCHAR(10) NOT NULL,
  hourlyrate NUMERIC(12, 2) NULL
    CONSTRAINT CHK_Contacts_hourlyrate CHECK(hourlyrate > 0.00)
);
 
INSERT INTO dbo.Contacts(id, name, hourlyrate) VALUES
  (1, 'A', 100.00),(2, 'B', 200.00),(3, 'C', NULL);

Notice that contacts 1 and 2 have applicable hourly rates, and contact 3 doesn't, so its hourly rate is set to NULL. Consider the following query looking for contacts with a positive hourly rate:

SELECT id, name, hourlyrate
FROM dbo.Contacts
WHERE hourlyrate > 0.00;

This predicate evaluates to TRUE for contacts 1 and 2, and to UNKNOWN for contact 3, hence the output contains only contacts 1 and 2:

id          name       hourlyrate
----------- ---------- -----------
1           A          100.00
2           B          200.00

The thinking here is that when you're certain that the predicate is true, you want to return the row, otherwise you want to discard it. This might seem trivial at first, until you realize that some language elements that also use predicates work differently.

CHECK constraint versus CHECK option

A CHECK constraint is a tool that you use to enforce integrity in a table based on a predicate. The predicate is evaluated when you attempt to insert or update rows in the table. Unlike query filtering and matching clauses which accept TRUE cases and reject FALSE and UNKNOWN cases, a CHECK constraint is designed to accept TRUE and UNKNOWN cases and reject FALSE cases. The thinking here is that when you're certain that the predicate is false, you want to reject the attempted change, otherwise you want to allow it.

If you examine the definition of our Contacts table, you will notice that it has the following CHECK constraint, rejecting contacts with nonpositive hourly rates:

CONSTRAINT CHK_Contacts_hourlyrate CHECK(hourlyrate > 0.00)

Notice that the constraint uses the same predicate like the one you used in the previous query filter.

Try to add a contact with a positive hourly rate:

INSERT INTO dbo.Contacts(id, name, hourlyrate) VALUES (4, 'D', 150.00);

This attempt succeeds.

Try to add a contact with a NULL hourly rate:

INSERT INTO dbo.Contacts(id, name, hourlyrate) VALUES (5, 'E', NULL);

This attempt succeeds as well, since a CHECK constraint is designed to accept TRUE and UNKNOWN cases. That's the case where a query filter and a CHECK constraint are designed to work differently.

Try to add a contact with a nonpositive hourly rate:

INSERT INTO dbo.Contacts(id, name, hourlyrate) VALUES (6, 'F', -100.00);

This attempt fails with the following error:

Msg 547, Level 16, State 0, Line 454
The INSERT statement conflicted with the CHECK constraint "CHK_Contacts_hourlyrate". The conflict occurred in database "TSQLV5", table "dbo.Contacts", column 'hourlyrate'.

T-SQL also allows you to enforce integrity of modifications through views using a CHECK option. Some think of this option as serving a similar purpose to a CHECK constraint so long as you apply the modification through the view. For example, consider the following view, which uses a filter based on the predicate hourlyrate > 0.00 and is defined with the CHECK option:

CREATE OR ALTER VIEW dbo.MyContacts
AS
SELECT id, name, hourlyrate
FROM dbo.Contacts
WHERE hourlyrate > 0.00
WITH CHECK OPTION;

As it turns out, unlike a CHECK constraint, the view CHECK option is designed to accept TRUE cases and reject both FALSE and UNKNOWN cases. So it's actually designed to behave more like the query filter normally does also for the purpose of enforcing integrity.

Try inserting a row with a positive hourly rate through the view:

INSERT INTO dbo.MyContacts(id, name, hourlyrate) VALUES (7, 'G', 300.00);

This attempt succeeds.

Try inserting a row with a NULL hourly rate through the view:

INSERT INTO dbo.MyContacts(id, name, hourlyrate) VALUES (8, 'H', NULL);

This attempt fails with the following error:

Msg 550, Level 16, State 1, Line 473
The attempted insert or update failed because the target view either specifies WITH CHECK OPTION or spans a view that specifies WITH CHECK OPTION and one or more rows resulting from the operation did not qualify under the CHECK OPTION constraint.

The thinking here is that once you add the CHECK option to the view, you only want to allow modifications resulting in rows that are returned by the view. That's a bit different than the thinking with a CHECK constraint—reject changes for which you are certain that the predicate is false. This can be a bit confusing. If you want the view to allow modifications that set the hourly rate to NULL, you need the query filter to allow those as well by adding OR hourlyrate IS NULL. You just need to realize that a CHECK constraint and a CHECK option are designed to work differently with respect to the UNKNOWN case. The former accepts it whereas the latter rejects it.

Query the Contacts table after all of the above changes:

SELECT id, name, hourlyrate
FROM dbo.Contacts;

You should get the following output at this point:

id          name       hourlyrate
----------- ---------- -----------
1           A          100.00
2           B          200.00
3           C          NULL
4           D          150.00
5           E          NULL
7           G          300.00

IF/WHILE/CASE

The IF, WHILE and CASE language elements work with predicates.

The IF statement is designed as follows:

IF <predicate>
  <statement or BEGIN-END block when TRUE>
ELSE
  <statement or BEGIN-END block when FALSE or UNKNOWN>

It's intuitive to expect to have a TRUE block following the IF clause and a FALSE block following the ELSE clause, but you need to realize that the ELSE clause actually gets activated when the predicate is FALSE or UNKNOWN. Theoretically, a three-valued-logic language could have had an IF statement with a separation of the three cases. Something like this:

IF <predicate>
  WHEN TRUE
    <statement or BEGIN-END block when TRUE>
  WHEN FALSE
    <statement or BEGIN-END block when FALSE>
  WHEN UNKNOWN
    <statement or BEGIN-END block when UNKNOWN>

And even allow combinations of logical outcomes so that if you wanted to combine FALSE and UNKNOWN into one section, you could use something like this:

IF <predicate>
  WHEN TRUE
    <statement or BEGIN-END block when TRUE>
  WHEN FALSE OR UNKNOWN
    <statement or BEGIN-END block when FALSE OR UNKNOWN>

In the meanwhile, you can emulate such constructs by nesting IF-ELSE statements, and explicitly looking for NULLs in the operands with the IS NULL operator.

The WHILE statement only has a TRUE block. It's designed as follows:

WHILE <predicate>
  <statement or BEGIN-END block when TRUE>

The statement or BEGIN-END block forming the loop's body is activated while the predicate is TURE. As soon as the predicate is FALSE or UNKNOWN, control passes to the statement following the WHILE loop.

Unlike IF and WHILE, which are statements executing code, CASE is an expression returning a value. The syntax of a searched CASE expression is as follows:

CASE
  WHEN <predicate 1> THEN <expression 1 when TRUE>
  WHEN <predicate 2> THEN <expression 2 when TRUE >
  ...
  WHEN <predicate n> THEN <expression n when TRUE >
  ELSE <else expression when all are FALSE or UNKNOWN>
END

A CASE expression is designed to return the expression following the THEN clause that corresponds to the first WHEN predicate that evaluates to TRUE. If there is an ELSE clause, it's activated if no WHEN predicate is TRUE (all are FALSE or UNKNOWN). Absent an explicit ELSE clause, an implicit ELSE NULL is used. If you want to handle an UNKNOWN case separately, you can explicitly look for NULLs in the predicate's operands using the IS NULL operator.

A simple CASE expression uses implicit equality-based comparisons between the source expression and the compared expressions:

CASE <source expression>
  WHEN <comp expression 1> THEN <result expression 1 when TRUE>
  WHEN <comp expression 2> THEN <result expression 2 when TRUE >
  ...
  WHEN <comp expression n> THEN <result expression n when TRUE >
  ELSE <else result expression when all are FALSE or UNKNOWN>
END

The simple CASE expression is designed similar to the searched CASE expression in terms of the handling of the three-valued logic, but since the comparisons use an implicit equality-based comparison, you cannot handle the UNKNOWN case separately. An attempt to use a NULL in one of the compared expressions in the WHEN clauses is meaningless since the comparison will not result in TRUE even when the source expression is NULL. Consider the following example:

DECLARE @input AS INT = NULL;
 
SELECT CASE @input WHEN NULL THEN 'Input is NULL' ELSE 'Input is not NULL' END;

This gets converted implicitly to the following:

DECLARE @input AS INT = NULL;
 
SELECT CASE WHEN @input = NULL THEN 'Input is NULL' ELSE 'Input is not NULL' END;

Consequently, the outcome is:

Input is not NULL

To detect a NULL input, you need to use the searched CASE expression syntax and the IS NULL operator, like so:

DECLARE @input AS INT = NULL;
 
SELECT CASE WHEN @input IS NULL THEN 'Input is NULL' ELSE 'Input is not NULL' END;

This time the outcome is:

Input is NULL

MERGE

The MERGE statement is used to merge data from a source into a target. You use a merge predicate to identify the following cases and apply an action against the target:

  • A source row is matched by a target row (activated when a match is found for the source row where the merge predicate is TRUE): apply UPDATE or DELETE against target
  • A source row is not matched by a target row (activated when no matches are found for the source row where the merge predicate is TRUE, rather for all the predicate is FALSE or UNKNOWN): apply an INSERT against target
  • A target row is not matched by a source row (activated when no matches are found for the target row where the merge predicate is TRUE, rather for all the predicate is FALSE or UNKNOWN): apply UPDATE or DELETE against target

All three scenarios separate TRUE to one group and FALSE or UNKNOWN to another. You don't get separate sections for handling TRUE, handling FALSE and handling UNKNOWN cases.

To demonstrate this, I'll use a table called T3 which you create and populate by running the following code:

DROP TABLE IF EXISTS dbo.T3;
GO
 
CREATE TABLE dbo.T3(col1 INT NULL, col2 INT NULL, CONSTRAINT UNQ_T3 UNIQUE(col1));
 
INSERT INTO dbo.T3(col1) VALUES(1),(2),(NULL);

Consider the following MERGE statement:

MERGE INTO dbo.T3 AS TGT
USING (VALUES(1, 100), (3, 300)) AS SRC(col1, col2)
  ON SRC.col1 = TGT.col1
WHEN MATCHED THEN UPDATE
  SET TGT.col2 = SRC.col2
WHEN NOT MATCHED THEN INSERT(col1, col2) VALUES(SRC.col1, SRC.col2)
WHEN NOT MATCHED BY SOURCE THEN UPDATE
  SET col2 = -1;
 
SELECT col1, col2 FROM dbo.T3;

The source row where col1 is 1 is matched by the target row where col1 is 1 (predicate is TRUE) and therefore the target row's col2 is set to 100.

The source row where col1 is 3 is not matched by any target row (for all the predicate is FALSE or UNKNOWN) and therefore a new row is inserted into T3 with 3 as the col1 value and 300 as the col2 value.

The target rows where col1 is 2 and where col1 is NULL are not matched by any source row (for all rows the predicate is FALSE or UNKNOWN) and therefore in both cases col2 in the target rows is set to -1.

The query against T3 returns the following output after executing the above MERGE statement:

col1        col2
----------- -----------
1           100
2           -1
NULL        -1
3           300

Keep table T3 around; it's used later.

Distinctness and grouping

Unlike comparisons that are done using equality and inequality operators, comparisons done for distinctness and grouping purposes group NULLs together. One NULL is considered to be not distinct from another NULL, but a NULL is considered to be distinct from a non-NULL value. Consequently, applying a DISTINCT clause removes duplicate occurrences of NULLs. The following query demonstrates this:

SELECT DISTINCT country, region FROM HR.Employees;

This query generates the following output:

country         region
--------------- ---------------
UK              NULL
USA             WA

There are multiple employees with the country USA and the region NULL, and after the removal of duplicates the result shows only one occurrence of the combination.

Like distinctness, grouping also groups NULLs together, as the following query demonstrates:

SELECT country, region, COUNT(*) AS numemps
FROM HR.Employees
GROUP BY country, region;

This query generates the following output:

country         region          numemps
--------------- --------------- -----------
UK              NULL            4
USA             WA              5

Again, all four employees with the country UK and region NULL were grouped together.

Ordering

Ordering treats multiple NULLs as having the same ordering value. The SQL standard leaves it to the implementation to choose whether to order NULLs first or last compared to non-NULL values. Microsoft chose to consider NULLs as having lower ordering values compared to non-NULLs in SQL Server, so when using ascending order direction, T-SQL orders NULLs first. The following query demonstrates this:

SELECT id, name, hourlyrate
FROM dbo.Contacts
ORDER BY hourlyrate;

This query generates the following output:

id          name       hourlyrate
----------- ---------- -----------
3           C          NULL
5           E          NULL
1           A          100.00
4           D          150.00
2           B          200.00
7           G          300.00

Next month I'll add more on this topic, discussing standard elements that give you control over NULL ordering behavior and the workarounds for those elements in T-SQL.

Uniqueness

When enforcing uniqueness on a NULLable column using either a UNIQUE constraint or a unique index, T-SQL treats NULLs just like non-NULL values. It rejects duplicate NULLs as if one NULL isn't unique from another NULL.

Recall that our table T3 has a UNIQUE constraint defined on col1. Here's its definition:

CONSTRAINT UNQ_T3 UNIQUE(col1)

Query T3 to see its current contents:

SELECT * FROM dbo.T3;

If you ran all of the modifications against T3 from the earlier examples in this article, you should get the following output:

col1        col2
----------- -----------
1           100
2           -1
NULL        -1
3           300

Attempt to add a second row with a NULL in col1:

INSERT INTO dbo.T3(col1, col2) VALUES(NULL, 400);

You get the following error:

Msg 2627, Level 14, State 1, Line 558
Violation of UNIQUE KEY constraint 'UNQ_T3'. Cannot insert duplicate key in object 'dbo.T3'. The duplicate key value is (<NULL>).

This behavior is actually nonstandard. Next month I'll describe the standard specification, and how to emulate it in T-SQL.

Conclusion

In this second part of the series on NULL complexities I focused on NULL treatment inconsistencies among different T-SQL elements. I covered linear versus aggregate computations, filtering and matching clauses, the CHECK constraint versus the CHECK option, IF, WHILE and CASE elements, the MERGE statement, distinctness and grouping, ordering, and uniqueness. The inconsistencies I covered further stress how important it is to correctly understand the treatment of NULLs in the platform you're using, to make sure that you write correct and robust code. Next month I'll continue the series by covering the SQL standard NULL treatment options that aren't available in T-SQL, and provide workarounds that are supported in T-SQL.