Fundamentals of table expressions, Part 2 – Derived tables, logical considerations

Last month I provided a background to table expressions in T-SQL. I explained the context from relational theory and the SQL standard. I explained how a table in SQL is an attempt to represent a relation from relational theory. I also explained that a relational expression is an expression operating on one or more relations as inputs and resulting in a relation. Similarly, in SQL, a table expression is an expression operating on one or more input tables, and resulting in a table. The expression can be a query, but doesn’t have to be. For example, the expression can be a table value constructor, as I will explain later in this article. I also explained that in this series, I focus on four specific types of named table expressions that T-SQL supports: derived tables, common table expressions (CTEs), views, and inline table-valued functions (TVFs).

If you’ve been working with T-SQL for some time, you probably stumbled into quite a few cases where you either had to use table expressions, or it was somehow more convenient compared to alternative solutions that don’t use them. Here are just a few examples for use cases that come to mind:

Create a modular solution by breaking down complex tasks into steps, each represented by a different table expression.
Mixing results of grouped queries and detail, in case you decide not to use window functions for this purpose.
Logical query processing handles query clauses in the following order: FROM>WHERE>GROUP BY>HAVING>SELECT>ORDER BY. As a result, in the same level of nesting, column aliases that you define in the SELECT clause are only available to the ORDER BY clause. They are not available to the rest of the query clauses. With table expressions you can reuse aliases that you define in an inner query in any clause of the outer query, and this way avoid repetition of lengthy/complex expressions.
Window functions can appear only in a query’s SELECT and ORDER BY clauses. With table expressions, you can assign an alias to an expression based on a window function, and then use that alias in a query against the table expression.
A PIVOT operator involves three elements: grouping, spreading and aggregation. This operator identifies the grouping element implicitly by elimination. Using a table expression, you can project exactly the three elements that are supposed to be involved, and have the outer query use the table expression as the PIVOT operator’s input table, thus controlling which element is the grouping element.
Modifications with TOP do not support an ORDER BY clause. You can control which rows get chosen indirectly by defining a table expression based on a SELECT query with the TOP or OFFSET-FETCH filter and an ORDER BY clause, and apply the modification against the table expression.

This is far from being an exhaustive list. I will demonstrate some of the above use cases and others in this series. I just wanted to mention some use cases here to illustrate how important table expressions are in our T-SQL code, and why it’s worthwhile to invest in understanding their fundamentals well.

In this month’s article I focus on the logical treatment of derived tables specifically.

In my examples I’ll use a sample database called TSQLV5. You can find the script that creates and populates it here, and its ER diagram here.

Derived tables

The term derived table is used in SQL and T-SQL with more than one meaning. So first I want to make it clear which one I’m referring to in this article. I’m referring to a specific language construct that you define typically, but not only, in the FROM clause of an outer query. I’ll provide the syntax for this construct shortly.

The more general use of the term derived table in SQL is the counterpart to a derived relation from relational theory. A derived relation is a result relation that is derived from one or more input base relations, by applying relational operators from relational algebra like projection, intersection and others to those base relations. Similarly, in the general sense, a derived table in SQL is a result table that is derived from one or more base tables, by evaluating expressions against those input base tables.

As an aside, I checked how the SQL standard defines a base table and was immediately sorry I bothered.

4.15.2 Base tables

A base table is either a persistent base table or a temporary table.

A persistent base table is either a regular persistent base table or a system-versioned table.

A regular base table is either a regular persistent base table or a temporary table.”

Added here with no further comments…

In T-SQL, you can create a base table with a CREATE TABLE statement, but there are other options, e.g., SELECT INTO and DECLARE @T AS TABLE.

Here’s the standard’s definition for derived tables in the general sense:

4.15.3 Derived tables

A derived table is a table derived directly or indirectly from one or more other tables by the evaluation of an expression, such as a <joined table>, <data change delta table>, <query expression>, or <table expression>. A <query expression> can contain an optional <order by clause>. The ordering of the rows of the table specified by the <query expression> is guaranteed only for the <query expression> that immediately contains the <order by clause>.”

There are a couple of interesting things to note here about derived tables in the general sense. One has to do with the comment about ordering. I’ll get to this one later in the article. Another is that a derived table in SQL can be a valid stand-alone table expression, but doesn’t have to be. For example, the following expression represents a derived table, and is also considered a valid stand-alone table expression (you can run it):

SELECT custid, companyname
FROM Sales.Customers
WHERE country = N'USA'

Conversely, the following expression represents a derived table, but is not a valid stand-alone table expression:

T1 INNER JOIN T2
  ON T1.keycol = T2.keycol

T-SQL supports a number of table operators that yield a derived table, but are not supported as stand-alone expressions. Those are: JOIN, PIVOT, UNPIVOT and APPLY. You do need a clause for them to operate within (typically FROM, but also the MERGE statement’s USING clause), and a host query.

From here on, I’ll be using the term derived table to describe a more specific language construct and not in the general sense described above.

Syntax

A derived table can be defined as part of an outer SELECT statement in its FROM clause. It can also be defined as part of DELETE and UPDATE statements in their FROM clause, and as part of a MERGE statement in its USING clause. I’ll provide more details on the syntax when used in modification statements later in this article.

Here’s the syntax for a simplified SELECT query against a derived table:

SELECT < select list >
FROM ( < table expression > ) [ AS ] < table name >[ (< target columns >) ];

The derived table definition appears where a base table can normally appear, in the outer query’s FROM clause. It can be an input to a table operator such as JOIN, APPLY, PIVOT and UNPIVOT. When used as the right input to an APPLY operator, the < table expression > part of the derived table is allowed to have correlations to columns from an outer table (more on this in a dedicated future article in the series). Otherwise, the table expression must be self-contained.

The outer statement can have all the usual querying elements. In a SELECT statement case: WHERE, GROUP BY, HAVING, ORDER BY and as mentioned, table operators in the FROM clause.

Here’s an example for a simple query against a derived table representing USA customers:

SELECT custid, companyname
FROM ( SELECT custid, companyname
       FROM Sales.Customers
       WHERE country = N'USA' ) AS UC;

This query generates the following output:

custid  companyname
------- ---------------
32      Customer YSIQX
36      Customer LVJSO
43      Customer UISOJ
45      Customer QXPPT
48      Customer DVFMB
55      Customer KZQZT
65      Customer NYUHS
71      Customer LCOUJ
75      Customer XOJYP
77      Customer LCYBZ
78      Customer NLTYP
82      Customer EYHKM
89      Customer YBQTI

There are three main parts to identify in a statement involving a derived table definition:

The table expression (the inner query)
The derived table name, or more accurately, what in relational theory is considered a range variable
The outer statement

The table expression is supposed to represent a table, and as such, must satisfy certain requirements that a normal query doesn’t necessarily need to satisfy. I’ll provide the details shortly in the section “A table expression is a table”.

As for the target derived table name; a common assumption among T-SQL developers is that it is merely a name, or alias that you assign to the target table. Similarly, consider the following query:

SELECT custid, companyname
FROM Sales.Customers AS C
WHERE country = N'USA';

Also here, the common assumption is that AS C is just a way to rename, or alias, the table Customers for the purposes of this query, starting with the logical query processing step where the name is assigned and onwards. However, from relational theory’s standpoint, there’s a deeper meaning to what C represents. C is what’s known as a range variable. C is a derived relation variable that ranges over the tuples in the input relation variable Customers. In the above example, C ranges over the tuples in Customers and evaluates the predicate country = N'USA'. Tuples for which the predicate evaluates to true become part of the result relation C.

A table expression is a table

With the background that I provided so far, what I’m about to explain next should be of little surprise. The < table expression > part of a derived table definition is a table. That’s the case even if it’s expressed as a query. Remember the closure property of relational algebra? The same applies to the rest of the aforementioned named table expressions (CTEs, views and inline TVFs). As you already learned, SQL’s table is the counterpart to relational theory’s relation, albeit not a perfect counterpart. Thus, a table expression needs to satisfy certain requirements to ensure that the result is a table—ones that a query that is not used as a table expression doesn’t necessarily have to. Here are three specific requirements:

All of the table expression’s columns must have names
All of the table expression’s column names must be unique
The table expression’s rows have no order

Let’s break these requirements down one by one, discussing the relevance to both relational theory and SQL.

All columns must have names

Remember that a relation has a heading and a body. The heading of a relation is a set of attributes (columns in SQL). An attribute has a name and a type name, and is identified by its name. A query that is not used as a table expression doesn’t necessarily have to assign names to all target columns. Consider the following query as an example:

SELECT empid, firstname, lastname,
  CONCAT_WS(N'/', country, region, city)
FROM HR.Employees;

This query generates the following output:

empid  firstname  lastname   (No column name)
------ ---------- ---------- -----------------
1      Sara       Davis      USA/WA/Seattle
2      Don        Funk       USA/WA/Tacoma
3      Judy       Lew        USA/WA/Kirkland
4      Yael       Peled      USA/WA/Redmond
5      Sven       Mortensen  UK/London
6      Paul       Suurs      UK/London
7      Russell    King       UK/London
8      Maria      Cameron    USA/WA/Seattle
9      Patricia   Doyle      UK/London

The query output has an anonymous column resulting from the concatenation of the location attributes using the CONCAT_WS function. (By the way, this function was added in SQL Server 2017, so if you’re running the code in an earlier version, feel free to replace this computation with an alternative computation of your choosing.) This query, therefore, doesn’t return a table, not to speak of a relation. Hence, it’s not valid to use such a query as the table expression/inner query part of a derived table definition.

Try it:

SELECT *
FROM ( SELECT empid, firstname, lastname,
         CONCAT_WS(N'/', country, region, city)
       FROM HR.Employees ) AS D;

You get the following error:

Msg 8155, Level 16, State 2, Line 50
No column name was specified for column 4 of 'D'.

As an aside, notice something interesting about the error message? It complains about column 4, highlighting the difference between columns in SQL and attributes in relational theory.

The solution is, of course, to make sure that you explicitly assign names to columns that result from computations. T-SQL supports quite a few column naming techniques. I’ll mention two of them.

You can use an inline naming technique where you assign the target column name after the computation and an optional AS clause, as in < expression > [ AS ] < column name >, like so:

SELECT empid, firstname, lastname, custlocation
FROM ( SELECT empid, firstname, lastname,
         CONCAT_WS(N'/', country, region, city) AS custlocation
       FROM HR.Employees ) AS D;

This query generates the following output:

empid  firstname  lastname   custlocation
------ ---------- ---------- ----------------
1      Sara       Davis      USA/WA/Seattle
2      Don        Funk       USA/WA/Tacoma
3      Judy       Lew        USA/WA/Kirkland
4      Yael       Peled      USA/WA/Redmond
5      Sven       Mortensen  UK/London
6      Paul       Suurs      UK/London
7      Russell    King       UK/London
8      Maria      Cameron    USA/WA/Seattle
9      Patricia   Doyle      UK/London

Using this technique, it’s very easy when reviewing the code to tell which target column name is assigned to which expression. Also, you only need to name columns that don’t already have names otherwise.

You can also use a more external column naming technique where you specify the target column names in parentheses right after the derived table name, like so:

SELECT empid, firstname, lastname, custlocation
FROM ( SELECT empid, firstname, lastname,
         CONCAT_WS(N'/', country, region, city)
       FROM HR.Employees ) AS D(empid, firstname, lastname, custlocation);

With this technique though, you have to list names for all columns—including ones that already have names. The assignment of the target column names is done by position, left to right, i.e., the first target column name represents the first expression in the inner query’s SELECT list; the second target column name represents the second expression; and so on.

Note that in case of inconsistency between the inner and outer column names, say, due to a bug in the code, the scope of the inner names is the inner query—or, more precisely, the inner range variable (here implicitly HR.Employees AS Employees)—and the scope of the outer names is the outer range variable (D in our case). There’s a bit more involved in the scoping of column names that has to do with logical query processing, but that’s an item for later discussions.

The potential for bugs with the external naming syntax is best explained with an example.

Examine the output of the previous query, with the full set of employees from the HR.Employees table. Then, consider the following query, and before running it, try to figure out which employees you expect to see in the result:

SELECT empid, firstname, lastname, custlocation
FROM ( SELECT empid, firstname, lastname,
         CONCAT_WS(N'/', country, region, city)
       FROM HR.Employees
       WHERE lastname LIKE N'D%' ) AS D(empid, lastname, firstname, custlocation)
WHERE firstname LIKE N'D%';

If you expect the query to return an empty set for the given sample data, since there are no employees currently with both a last name and a first name that start with the letter D, you’re missing the bug in the code.

Now run the query, and examine the actual output:

empid  firstname  lastname  custlocation
------ ---------- --------- ---------------
1      Davis      Sara      USA/WA/Seattle
9      Doyle      Patricia  UK/London

What happened?

The inner query specifies firstname as the second column, and lastname as the third column in the SELECT list. The code that assigns the derived table’s target column names in the outer query specifies lastname second and firstname third. The code names firstname as lastname and lastname as firstname in the range variable D. Effectively, you’re just filtering employees whose last name starts with the letter D. You’re not filtering employees with both a last name and a first name that start with the letter D.

The inline aliasing syntax is not prone to such bugs. For one, you normally don’t alias a column that already has a name you’re happy with. Second, even if you do want to assign a different alias for a column that already has a name, it’s not very likely that with the syntax < column > AS < alias > you will assign the wrong alias. Think about it; how likely is it that you will write like this:

SELECT empid, firstname, lastname, custlocation
FROM ( SELECT empid AS empid, firstname AS lastname, lastname AS firstname,
         CONCAT_WS(N'/', country, region, city) AS custlocation
       FROM HR.Employees
       WHERE lastname LIKE N'D%' ) AS D
WHERE firstname LIKE N'D%';

Obviously, not very likely.

All column names must be unique

Back to the fact that the heading of a relation is a set of attributes, and given that an attribute is identified by name, attribute names must be unique for the same relation. In a given query, you can always refer to an attribute using a two-part name with the range variable name as qualifier, as in < range variable name >.< column name >. When the column name without the qualifier is unambiguous, you can omit the range variable name prefix. What’s important to remember though is what I said earlier about the scope of the column names. In code that involves a named table expression, with both an inner query (the table expression) and an outer query, the scope of the column names in the inner query is the inner range variables, and the scope of the column names in the outer query are the outer range variables. If the inner query involves multiple source tables with the same column name, you can still refer to those columns in an unambiguous way by adding the range variable name as a prefix. If you don’t assign a range variable name explicitly, you get one assigned implicitly, as if you used < table name > AS < table name >.

Consider the following standalone query as an example:

SELECT C.custid, O.custid, O.orderid
FROM Sales.Customers AS C
  LEFT OUTER JOIN Sales.Orders AS O
    ON C.custid = O.custid;

This query doesn’t fail with a duplicate column name error since one custid column is actually named C.custid and the other O.custid within the current query’s scope. This query generates the following output:

custid      custid      orderid
----------- ----------- -----------
1           1           10643
1           1           10692
1           1           10702
1           1           10835
1           1           10952
1           1           11011
2           2           10308
2           2           10625
2           2           10759
2           2           10926
...

However, try using this query as a table expression in the definition of a derived table named CO, like so:

SELECT *
FROM ( SELECT C.custid, O.custid, O.orderid
       FROM Sales.Customers AS C
         LEFT OUTER JOIN Sales.Orders AS O
           ON C.custid = O.custid ) AS CO;

As far as the outer query is concerned, you have one range variable named CO, and the scope of all column names in the outer query is that range variable. The names of all columns in a given range variable (remember, a range variable is a relation variable) must be unique. Hence, you get the following error:

Msg 8156, Level 16, State 1, Line 80
The column 'custid' was specified multiple times for 'CO'.

The fix is of course to assign different column names to the two custid columns as far as the range variable CO is concerned, like so:

SELECT *
FROM ( SELECT C.custid AS custcustid, O.custid AS ordercustid, O.orderid
       FROM Sales.Customers AS C
         LEFT OUTER JOIN Sales.Orders AS O
           ON C.custid = O.custid ) AS CO;

This query generates the following output:

custcustid  ordercustid orderid
----------- ----------- -----------
1           1           10643
1           1           10692
1           1           10702
1           1           10835
1           1           10952
1           1           11011
2           2           10308
2           2           10625
2           2           10759
2           2           10926
...

If you follow good practices, you explicitly list the column names in the outermost query’s SELECT list. Since there’s only one range variable involved, you don’t have to use the two-part name for the outer column references. If you do wish to use the two-part name, you prefix the column names with the outer range variable name CO, like so:

SELECT CO.custcustid, CO.ordercustid, CO.orderid
FROM ( SELECT C.custid AS custcustid, O.custid AS ordercustid, O.orderid
       FROM Sales.Customers AS C
         LEFT OUTER JOIN Sales.Orders AS O
           ON C.custid = O.custid ) AS CO;

No order

There’s quite a lot I have to say about named table expressions and ordering—enough for an article in its own right—so I will dedicate a future article to this topic. Still, I wanted to touch the topic briefly here since it is so important. Recall that the body of a relation is a set of tuples, and similarly, the body of a table is a set of rows. A set has no order. Still, SQL allows the outermost query to have an ORDER BY clause serving a presentation ordering meaning, as the following query demonstrates:

SELECT orderid, val
FROM Sales.OrderValues
ORDER BY val DESC;

What you need to understand, though, is that this query doesn’t return a relation as a result. Even from SQL’s perspective, the query doesn’t return a table as a result, and hence it is not considered a table expression. Consequently, it’s invalid to use such a query as the table expression part of a derived table definition.

Try running the following code:

SELECT orderid, val
FROM ( SELECT orderid, val
       FROM Sales.OrderValues
       ORDER BY val DESC ) AS D;

You get the following error:

Msg 1033, Level 15, State 1, Line 124
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

I’ll address the unless part of the error message shortly.

If you want the outermost query to return an ordered result, you need the specify the ORDER BY clause in the outermost query, like so:

SELECT orderid, val
FROM ( SELECT orderid, val
       FROM Sales.OrderValues ) AS D
ORDER BY val DESC;

As for the unless part of the error message; T-SQL supports the proprietary TOP filter as well as the standard OFFSET-FETCH filter. Both filters rely on an ORDER BY clause in the same query scope to define for them which top rows to filter. This is unfortunately the outcome of a trap in the design of these features, which doesn’t separate presentation ordering from the filter ordering. Be that as it may, both Microsoft with its TOP filter, and the standard with its OFFSET-FETCH filter, allow specifying an ORDER BY clause in the inner query so long as it also specifies the TOP or OFFSET-FETCH filter, respectively. So, this query is valid, for example:

SELECT orderid, val
FROM ( SELECT TOP (3) orderid, val
       FROM Sales.OrderValues
       ORDER BY val DESC ) AS D;

When I ran this query on my system, it generated the following output:

orderid  val
-------- ---------
10865    16387.50
10981    15810.00
11030    12615.05

What’s important to stress though, is that the only reason that the ORDER BY clause is allowed in the inner query is to support the TOP filter. That’s the only guarantee that you get as far as ordering is concerned. Since the outer query doesn’t have an ORDER BY clause as well, you do not get a guarantee for any specific presentation ordering from this query, despite whatever is the observed behavior. That’s both the case in T-SQL, as well as in the standard. Here’s a quote from the standard addressing this part:

“The ordering of the rows of the table specified by the <query expression> is guaranteed only for the <query expression> that immediately contains the <order by clause>.”

As mentioned, there’s a lot more to say about table expressions and ordering, which I’ll do in a future article. I’ll also provide examples demonstrating how the lack of ORDER BY clause in the outer query means that you don’t get any presentation ordering guarantees.

So, a table expression, e.g., an inner query in a derived table definition, is a table. Similarly, a derived table (in the specific sense) itself is also a table. It’s not a base table, but it is nevertheless a table. The same applies to CTEs, views and inline TVFs. They are not base tables, rather derived ones (in the more general sense), but they are nevertheless tables.

Design flaws

Derived tables have two main shortcomings in their design. Both have to do with the fact that the derived table is defined in the FROM clause of the outer query.

One design flaw has to do with the fact that if you need to query a derived table from an outer query, and in turn use that query as a table expression in another derived table definition, you end up nesting those derived table queries. In computing, explicit nesting of code involving multiple levels of nesting tends to result in complex code that is hard to maintain.

Here’s a very basic example demonstrating this:

SELECT orderyear, numcusts
FROM ( SELECT orderyear, COUNT(DISTINCT custid) AS numcusts
       FROM ( SELECT YEAR(orderdate) AS orderyear, custid
              FROM Sales.Orders ) AS D1
       GROUP BY orderyear ) AS D2
WHERE numcusts > 70;

This code returns order years and the number of customers who placed orders during each year, only for years where the number of customers who placed orders was greater than 70.

The main motivation for using table expressions here is in order to be able to refer to a column alias multiple times. The innermost query used as a table expression for the derived table D1 queries the Sales.Orders table, and assigns the column name orderyear to the expression YEAR(orderdate), and also returns the custid column. The query against D1 groups the rows from D1 by orderyear, and returns orderyear as well as the distinct number of customers who placed orders during the year in question aliased as numcusts. The code defines a derived table called D2 based on this query. The outermost query than queries D2 and filters only years where the number of customers who placed orders was greater than 70.

An attempt to review this code or troubleshoot it in case of problems is tricky due to the multiple levels of nesting. Instead of reviewing the code in the more natural top-to-bottom manner, you find yourself having to analyze it starting with the innermost unit and gradually going outwards, since that’s more practical.

The whole point about using derived tables in this example was to simplify the code by avoiding the need to repeat expressions. But I’m not sure that this solution achievs this goal. In this case, you’re probably better off repeating some expressions, avoiding the need to use derived tables altogether, like so:

SELECT YEAR(orderdate) AS orderyear, COUNT(DISTINCT custid) AS numcusts
FROM Sales.Orders
GROUP BY YEAR(orderdate)
HAVING COUNT(DISTINCT custid) > 70;

Keep in mind that I’m showing a very simple example here for illustration purposes. Imagine production code with more levels of nesting, and with longer, more elaborate code, and you can see how it becomes substantially more complicated to maintain.

Another flaw in the design of derived tables has to do with cases where you need to interact with multiple instances of the same derived table. Consider the following query as an example:

SELECT CUR.orderyear, CUR.numorders,
  CUR.numorders - PRV.numorders AS diff
FROM ( SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS numorders
       FROM Sales.Orders
       GROUP BY YEAR(orderdate) ) AS CUR
  LEFT OUTER JOIN
     ( SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS numorders
       FROM Sales.Orders
       GROUP BY YEAR(orderdate) ) AS PRV
    ON CUR.orderyear = PRV.orderyear + 1;

This code computes the number of orders processed in each year, as well as the difference from the previous year. Ignore the fact that there are simpler ways to achieve the same task with window functions—I’m using this code to illustrate a certain point, so the task itself and the different ways to solve it are not significant.

A join is a table operator that treats its two inputs as a set—meaning that there’s no order among them. They are referred to as the left and right inputs so that you can mark one of them (or both) as a preserved table in an outer join, but still, there’s no first and second among them. You are allowed to use derived tables as join inputs, but the range variable name that you assign to the left input is not accessible in the definition of the right input. That’s because both are conceptually defined in the same logical step, as if at the same point in time. Consequently, when joining derived tables, you can’t define two range variables based on one table expression. Unfortunately, you have to repeat the code, defining two range variables based on two identical copies of the code. This of course complicates the maintainability of the code, and increases the likelihood for bugs. Every change that you make to one table expression needs to be applied to the other as well.

As I’ll explain in a future article, CTEs, in their design, do not incur these two flaws that derived tables incur.

Table value constructor

A table value constructor allows you to construct a table value based on self-contained scalar expressions. You can then use such a table in an outer query just like you use a derived table that is based on an inner query. In a future article I discuss lateral derived tables and correlations in detail, and I’ll show more sophisticated forms of table value constructors. In this article, though, I’ll focus on a simple form that is based purely on self-contained scalar expressions.

The general syntax for a query against a table value constructor is as follows:

SELECT < select list >
FROM ( < table body > ) AS < table name >(< table heading >);

The table value constructor is defined in the FROM clause of the outer query.

The table’s body is made of a VALUES clause, followed by a comma separated list of pairs of parentheses, each defining a row with a comma separated list of expressions forming the row’s values.

The table’s heading is a comma separated list of the target column names. I’ll talk about a shortcoming of this syntax regarding the table’s heading shortly.

The following code uses a table value constructor to define a table called MyCusts with three columns called custid, companyname and contractdate, and three rows:

SELECT custid, companyname, contractdate
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

The above code is equivalent (both logically and in performance terms) in T-SQL to the following alternative:

SELECT custid, companyname, contractdate
FROM ( SELECT 2, 'Cust 2', '20200212' UNION ALL
       SELECT 3, 'Cust 3', '20200118' UNION ALL
       SELECT 5, 'Cust 5', '20200401' )
       AS MyCusts(custid, companyname, contractdate);

The two are internally algebrized the same way. The syntax with the VALUES clause is standard whereas the syntax with the unified FROMless queries isn’t, hence I prefer the former.

There is a shortcoming in the design of table value constructors in both standard SQL and in T-SQL. Remember that the heading of a relation is made of a set of attributes, and an attribute has a name and a type name. In the table value constructor’s syntax, you specify the column names, but not their data types. Suppose that you need the custid column to be of a SMALLINT type, the companyname column of a VARCHAR(50) type, and the contractdate column of a DATE type. It would have been good if we were able to define the column types as part of the definition of the table’s heading, like so (this syntax isn’t supported):

SELECT custid, companyname, contractdate
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid SMALLINT, companyname VARCHAR(50), contractdate DATE);

That’s of course just wishful thinking.

The way it works in T-SQL, is that each literal that is based on a constant has a predetermined type irrespective of context. For instance, can you guess what the types of the following literals are:

1
2147483647
2147483648
1E
'1E'
'20200212'

Is 1 considered BIT, INT, SMALLINT, other?

Is 1E considered VARBINARY(1), VARCHAR(2), other?

Is '20200212' considered DATE, DATETIME, VARCHAR(8), CHAR(8), other?

There’s a simple trick to figure out the default type of a literal, using the SQL_VARIANT_PROPERTY function with the 'BaseType' property, like so:

SELECT SQL_VARIANT_PROPERTY(2147483648, 'BaseType');

What happens is that SQL Server implicitly converts the literal to SQL_VARIANT—since that’s what the function expects—but preserves its base type. It then reports the base type as requested.

Similarly, you can query other properties of the input value, like the maximum length (MaxLength), Precision, Scale, and so on.

Try it with the aforementioned literal values, and you will get the following:

1: INT
2147483647: INT
2147483648: NUMERIC(10, 0)
1E: FLOAT
'1E': VARCHAR(2)
'20200212': VARCHAR(8)

As you can see, SQL Server has default assumptions about the data type, maximum length, precision, scale, and so on.

There are some cases where you need to specify a literal of a certain type, but you cannot do it directly in T-SQL. For example, you cannot specify a literal of the following types directly: BIT, TINYINT, BIGINT, all date and time types, and quite a few others. Unfortunately, T-SQL doesn’t provide a selector property for its types, which would have served exactly the needed purpose of selecting a value of the given type. Of course, you can always convert an expression’s type explicitly using the CAST or CONVERT function, as in CAST(5 AS SMALLINT). If you don’t, SQL Server will sometimes need to implicitly convert some of your expressions to a different type based on its implicit conversion rules. For example, when you try to compare values of different types, e.g., WHERE datecol = '20200212', assuming datecol is of a DATE type. Another example is when you specify a literal in an INSERT or an UPDATE statement, and the literal’s type is different than the target column’s type.

If all this is not confusing enough, set operators like UNION ALL rely on data type precedence to define the target column types—and remember, a table value constructor is algebrized like a series of UNION ALL operations. Consider the table value constructor shown earlier:

SELECT custid, companyname, contractdate
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

Each literal here has a predetermined type. 2, 3 and 5 are all of an INT type, so clearly the custid target column type is INT. If you had the values 1000000000, 3000000000 and 2000000000, the first and the third are considered INT and the second is considered NUMERIC(10, 0). According to data type precedence NUMERIC (same as DECIMAL) is stronger than INT, hence in such a case the target column type would be NUMERIC(10, 0).

If you want to figure out which data types SQL Server chooses for the target columns in your table value constructor, you have a few options. One is to use a SELECT INTO statement to write the table value constructor’s data into a temporary table, and then query the metadata for the temporary table, like so:

SELECT custid, companyname, contractdate
INTO #MyCusts
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

SELECT name AS colname, TYPE_NAME(system_type_id) AS typename, max_length AS maxlength
FROM tempdb.sys.columns
WHERE OBJECT_ID = OBJECT_ID(N'tempdb..#MyCusts');

Here’s the output of this code:

colname       typename   maxlength
------------- ---------- ---------
custid        int        4
companyname   varchar    6
contractdate  varchar    8

You can then drop the temporary table for cleanup:

DROP TABLE IF EXISTS #MyCusts;

Another option is to use the SQL_VARIANT_PROPERTY, which I mentioned earlier, like so:

SELECT TOP (1)
  SQL_VARIANT_PROPERTY(custid, 'BaseType')        AS custid_typename,
  SQL_VARIANT_PROPERTY(custid, 'MaxLength')       AS custid_maxlength,
  SQL_VARIANT_PROPERTY(companyname, 'BaseType')   AS companyname_typename,
  SQL_VARIANT_PROPERTY(companyname, 'MaxLength')  AS companyname_maxlength,
  SQL_VARIANT_PROPERTY(contractdate, 'BaseType')  AS contractdate_typename,
  SQL_VARIANT_PROPERTY(contractdate, 'MaxLength') AS contractdate_maxlength
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

This code generates the following output (formatted for readability):

custid_typename       custid_maxlength
--------------------  ---------------- 
int                   4                

companyname_typename  companyname_maxlength 
--------------------  --------------------- 
varchar               6                     

contractdate_typename contractdate_maxlength
--------------------- ----------------------
varchar               8

So, what if you need to control the types of the target columns? As mentioned earlier, say you need custid to be SMALLINT, companyname VARCHAR(50), and contractdate DATE.

Don’t be misled to think that it’s enough to explicitly convert just one row’s values. If a corresponding value’s type in any other row is considered stronger, it would dictate the target column’s type. Here’s an example demonstrating this:

SELECT custid, companyname, contractdate
INTO #MyCusts1
FROM ( VALUES( CAST(2 AS SMALLINT), CAST('Cust 2' AS VARCHAR(50)), CAST('20200212' AS DATE)),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

SELECT name AS colname, TYPE_NAME(system_type_id) AS typename, max_length AS maxlength
FROM tempdb.sys.columns
WHERE OBJECT_ID = OBJECT_ID(N'tempdb..#MyCusts1');

This code generates the following output:

colname       typename  maxlength
------------- --------- ---------
custid        int       4
companyname   varchar   50
contractdate  date      3

Notice that the type for custid is INT.

The same applies never mind which row’s values you explicitly convert, if you don’t convert all of them. For example, here the code explicitly converts the types of the values in the second row:

SELECT custid, companyname, contractdate
INTO #MyCusts2
FROM ( VALUES( 2, 'Cust 2', '20200212'),
             ( CAST(3 AS SMALLINT), CAST('Cust 3' AS VARCHAR(50)), CAST('20200118' AS DATE) ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid, companyname, contractdate);

SELECT name AS colname, TYPE_NAME(system_type_id) AS typename, max_length AS maxlength
FROM tempdb.sys.columns
WHERE OBJECT_ID = OBJECT_ID(N'tempdb..#MyCusts2');

This code generates the following output:

colname       typename  maxlength
------------- --------- ---------
custid        int       4
companyname   varchar   50
contractdate  date      3

As you can see, custid is still of an INT type.

You basically have two main options. One is to explicitly convert all values, like so:

SELECT custid, companyname, contractdate
INTO #MyCusts3
FROM ( VALUES( CAST(2 AS SMALLINT), CAST('Cust 2' AS VARCHAR(50)), CAST('20200212' AS DATE)),
             ( CAST(3 AS SMALLINT), CAST('Cust 3' AS VARCHAR(50)), CAST('20200118' AS DATE)),
             ( CAST(5 AS SMALLINT), CAST('Cust 5' AS VARCHAR(50)), CAST('20200401' AS DATE)) )
       AS MyCusts(custid, companyname, contractdate);

SELECT name AS colname, TYPE_NAME(system_type_id) AS typename, max_length AS maxlength
FROM tempdb.sys.columns
WHERE OBJECT_ID = OBJECT_ID(N'tempdb..#MyCusts3');

This code generates the following output, showing all target columns have the desired types:

colname       typename  maxlength
------------- --------- ---------
custid        smallint  2
companyname   varchar   50
contractdate  date      3

That’s a lot of coding, though. Another option is to apply the conversions in the SELECT list of the query against the table value constructor, and then define a derived table against the query that applies the conversions, like so:

SELECT custid, companyname, contractdate
INTO #MyCusts4
FROM ( SELECT
         CAST(custid AS SMALLINT) AS custid,
         CAST(companyname AS VARCHAR(50)) AS companyname,
         CAST(contractdate AS DATE) AS contractdate
       FROM ( VALUES( 2, 'Cust 2', '20200212' ),
                    ( 3, 'Cust 3', '20200118' ),
                    ( 5, 'Cust 5', '20200401' ) )
              AS D(custid, companyname, contractdate) ) AS MyCusts;

SELECT name AS colname, TYPE_NAME(system_type_id) AS typename, max_length AS maxlength
FROM tempdb.sys.columns
WHERE OBJECT_ID = OBJECT_ID(N'tempdb..#MyCusts4');

This code generates the following output:

colname       typename  maxlength
------------- --------- ---------
custid        smallint  2
companyname   varchar   50
contractdate  date      3

The reasoning for using the additional derived table is due to how logical query processing is designed. The SELECT clause is evaluated after FROM, WHERE, GROUP BY and HAVING. By applying the conversions in the SELECT list of the inner query, you allow expressions in all clauses of the outermost query to interact with the columns with the proper types.

Back to our wishful thinking, clearly, it would be good if we ever get a syntax that allows explicit control of the types in the definition of the table value constructor’s heading, like so:

SELECT custid, companyname, contractdate
FROM ( VALUES( 2, 'Cust 2', '20200212' ),
             ( 3, 'Cust 3', '20200118' ),
             ( 5, 'Cust 5', '20200401' ) )
       AS MyCusts(custid SMALLINT, companyname VARCHAR(50), contractdate DATE);

When you’re done, run the following code for cleanup:

DROP TABLE IF EXISTS #MyCusts1, #MyCusts2, #MyCusts3, #MyCusts4;

Used in modification statements

T-SQL allows you to modify data through table expressions. That’s true for derived tables, CTEs, views and inline TVFs. What gets modified in practice is some underlying base table that is used by the table expression. I have much to say about modifying data through table expressions, and I will in a future article dedicated to this topic. Here, I just wanted to briefly mention the types of modification statements that specifically support derived tables, and provide the syntax.

Derived tables can be used as the target table in DELETE and UPDATE statements, and also as the source table in the MERGE statement (in the USING clause). They cannot be used in the TRUNCATE statement, and as the target in the INSERT and MERGE statements.

For the DELETE and UPDATE statements, the syntax for defining the derived table is a bit awkward. You don’t define the derived table in the DELETE and UPDATE clauses, like you would expect, but rather in a separate FROM clause. You then specify the derived table name in the DELETE or UPDATE clause.

Here’s the general syntax of a DELETE statement against a derived table:

DELETE [ FROM ] < table name >
FROM ( < table expression > ) [ AS ] < table name >[ (< target columns >) ]
[ WHERE < filter predicate > ];

As an example (don’t actually run it), the following code deletes all US customers with a customer ID that is greater than the minimum for the same region (the region column represents the state for US customers):

DELETE FROM UC
FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY region ORDER BY custid) AS rownum
       FROM Sales.Customers
       WHERE country = N'USA' ) AS UC
WHERE rownum > 1;

Here’s the general syntax of an UPDATE statement against a derived table:

UPDATE < table name >
SET < assignments >
FROM ( < table expression > ) [ AS ] < table name >[ (< target columns >) ]
[ WHERE < filter predicate > ];

As you can see, from the perspective of the definition of the derived table, it’s quite similar to the syntax of the DELETE statement.

As an example, the following code changes the company names of US customers to one using the format N'USA Cust ' + rownum, where rownum represents a position based on customer ID ordering:

BEGIN TRAN;

UPDATE UC
  SET companyname = newcompanyname
    OUTPUT
      inserted.custid,
      deleted.companyname AS oldcompanyname,
      inserted.companyname AS newcompanyname
FROM ( SELECT custid, companyname,
         N'USA Cust ' + CAST(ROW_NUMBER() OVER(ORDER BY custid) AS NVARCHAR(10)) AS newcompanyname 
       FROM Sales.Customers
       WHERE country = N'USA' ) AS UC;

ROLLBACK TRAN;

The code applies the update in a transaction that it then rolls back so that the change won't stick.

This code generates the following output, showing both the old and the new company names:

custid  oldcompanyname  newcompanyname
------- --------------- ----------------
32      Customer YSIQX  USA Cust 1
36      Customer LVJSO  USA Cust 2
43      Customer UISOJ  USA Cust 3
45      Customer QXPPT  USA Cust 4
48      Customer DVFMB  USA Cust 5
55      Customer KZQZT  USA Cust 6
65      Customer NYUHS  USA Cust 7
71      Customer LCOUJ  USA Cust 8
75      Customer XOJYP  USA Cust 9
77      Customer LCYBZ  USA Cust 10
78      Customer NLTYP  USA Cust 11
82      Customer EYHKM  USA Cust 12
89      Customer YBQTI  USA Cust 13

That’s it for now on the topic.

Summary

Derived tables are one of the four main types of named table expressions that T-SQL supports. In this article I focused on the logical aspects of derived tables. I described the syntax for defining them and their scope.

Remember that a table expression is a table and as such, all of its columns must have names, all column names must be unique, and the table has no order.

The design of derived tables incurs two main flaws. In order to query one derived table from another, you need to nest your code, causing it to be more complex to maintain and troubleshoot. If you need to interact with multiple occurrences of the same table expression, using derived tables you are forced to duplicate your code, which hurts the maintainability of your solution.

You can use a table value constructor to define a table based on self-contained expressions as opposed to querying some existing base tables.

You can use derived tables in modification statements like DELETE and UPDATE, though the syntax for doing so is a bit awkward.

3 thoughts on “Fundamentals of table expressions, Part 2 – Derived tables, logical considerations”

André Cardoso says:

May 22, 2020 at 9:02 AM

Thanks for what is already a great series, and looking forward to the next instalment.

One doubt that I had in last month article was that a relation is unique (tuples are unique), but in SQL it may not be (projection that returns duplicate rows). How is that handled in relational theory (or it isn't possible)?

Best regards,
André Cardoso
Itzik Ben-Gan says:

May 22, 2020 at 11:13 PM

Thanks André!

Projection is a relational operator in relational algebra and as such it results in a relation. Remember the closure property? So there's no such thing as relational projection that returns duplicate tuples. Projection always returns distinct tuples.
Charlie says:

June 8, 2020 at 2:44 AM

An interesting idea I've used before is to construct a derived table using a few SELECTs without FROM, joined together with UNION ALL, usually inside an APPLY. This allows you to pull tricks such as generating entirely different rows based on the WHERE, without repeating CASE on each column, or no rows at all. Also you can UNION a parent row with the children, by selecting the correlated parent data in the apply, then unioning with the child data.

Comments are closed.