Itzik Ben-Gan

Fundamentals of table expressions, Part 1

April 8, 2020 by in T-SQL Queries | No Comments
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

This article is the first in a series about the fundamentals of table expressions in T-SQL. I will mainly focus on four types of named table expressions, which are known in T-SQL as derived tables, common table expressions (CTEs), views, and inline table-valued functions (inline TVFs).

I was inspired to write this series by my good friend, Grant Fritchey, whom I have known for many years. As Grant repeatedly points out, many who use common table expressions in T-SQL think that SQL Server persists the inner query result set, and that the reason for this belief is the use of the term table in the construct’s name. When this topic comes up in community discussions, often people argue that the use of the term table in the construct’s name is inappropriate since it’s not really a table. There are even suggestions to start a naming campaign in hope to see a future name change for this construct, at least in T-SQL. Some of the suggestions include query expression, inline view, statement-level view, and others.

Perhaps this will come as a surprise to some, but I actually do find the use of the term table in common table expression as very appropriate. In fact, I find the use of the term table expression as appropriate. To me, the best way to describe what a CTE is in T-SQL, it’s a named table expression. The same applies to what T-SQL calls derived tables (the specific language construct as opposed to the general idea), views and inline TVFs. They are all named table expressions.

If you can bear with me a bit, I’ll provide the reasoning for my view of things in this article. It occurred to me that both the naming confusion, and the confusion around whether there’s a persistency aspect to table expressions, can be cleared with a better understanding of the fundamentals of our field of relational database management systems. Those fundamentals being, relational theory, how SQL (the standard language) relates to it, and how the T-SQL dialect used in the SQL Server and Azure SQL Database implementations relates to both.

As a starting point, you want to be able to answer the following questions:

  • What does the physical data independence principle in the relational model mean?
  • What is a table in SQL and what is the counterpart in the relational model?
  • What is the closure property of relational algebra?
  • What is a table expression and what is the counterpart in the relational model?

Once you’re able to answer the above questions correctly, you will very likely find the use of the term named table expression as appropriate for the aforementioned constructs (what T-SQL calls derived tables, CTEs, views and inline TVFs).

I don’t want to sound like I have a very deep understanding of relational theory. My expertise is T-SQL. I acknowledge that there’s much more that I don’t know about relational theory than I do, and that some things that I think I know, are just not so. When I read C. J. Dates’ writings on the topic, I feel that I’m barely scratching the surface of what there is to know, and that I could, and should, strive to understand it better. I do recognize and firmly believe that a good understanding of relational theory translates directly to a better understanding of SQL and T-SQL, and to writing better, more accurate, and more robust T-SQL code. For anyone who choses data as their career, I recommend reading SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date (O'Reilly 2015).

In the first part of this series I want to establish an understanding of my use of the terms table expression and named table expression, which is in accord with Date’s use of this term, and unfortunately not in accord with the SQL Standard’s use of this term. To achieve this, I will provide a bit of background from relational theory and the SQL standard. But like I said, I recommend reading Date’s book for a truly detailed coverage of this topic.

I’ll start by explaining what the physical data independence principle means. Next, I’ll explain what a table is in SQL and its counterpart in relational theory. I’ll then explain what the closure property of relational algebra means. Once you have a reasonable idea of what a table is, and what the closure property means, it becomes pretty straightforward to understand what a table expression is. My focus will then turn to the specifics in T-SQL. I have a lot to say about the fundamentals of table expressions in T-SQL—both in terms of the conceptual treatment and in terms of the implementation details, including physical representation and query tuning considerations.

I find this topic fascinating and very practical once you delve into the implementation details. In fact, I have so much to say about it that I’m not sure how many parts this series will eventually entail. What I can tell you with a great degree of confidence is that there will be multiple parts. Probably more than one and fewer than 100. In future parts I will delve into the individual types of named table expressions, modification considerations, inlining aspects, ordering aspects, correlations, and more.

In my examples I’ll use a sample database called TSQLV5. You can find the script that creates and populates this database here, and its ER diagram here.

Physical data independence

Physical data independence is a principle in relational theory that says that the physical implementation details should be hidden from, or transparent to, the user submitting the queries against the relational database management system. In the queries, users are supposed to focus on what they need using logical operations that are based on relational algebra, as opposed to how to obtain the data. They’re not supposed to worry about how the data is structured, accessed and processed. Such physical implementation details tend to differ substantially between different implementations (RDBMS products). Even with the same RDBMS, the physical implementation details sometimes change between different versions and builds. The idea behind the physical data independence principle in theory is to protect the user investment by removing the need to revise your solutions when you upgrade your RDBMS to a new version, or even when you migrate from one RDBMS to another. As you probably know well, in practice things are not that simple, but that’s a topic for a different discussion.

What is a table?

If you’ve been working with T-SQL or any other dialect of SQL for a while, you develop an intuitive understanding of what a table is. The problem is that without some background of relational theory, often the intuitive understanding is not very accurate. One typical mistake is that we intuitively tend to focus on physical implementation details. For example, when you’re thinking about what a table is, are you thinking of a table as a logical structure (a set of rows) or are you thinking of physical implementation details in the platform that you’re using (in SQL Server, pages, extents, heap versus clustered index, nonclustered indexes, and so on)? As a user writing SQL code to query a table, following the physical data independence principle, you are supposed to be thinking of the table as a logical structure, and let the RDBMS worry about the physical implementation details. So, let’s take a step back and try to figure out what a table is.

A table is SQL’s counterpart to the main structure in relational theory — a relation. To keep things simple and limit the scope of my coverage, I’m not going to go into the distinction between a relation variable and a relation value. If you follow my recommendation and read Date’s book, you will very quickly have a clear picture of such subtleties.

A relation has a heading and a body.

The heading of the relation is a set of attributes. In mathematical set theory, a set has no order and no duplicates. You’re supposed to identify an attribute by name and not by some position. Consequently, attribute names must be unique.

Can you identify what’s the counterpart to an attribute in SQL? You’ve probably guessed that it’s a column. However, SQL actually does have a notion of order to its columns based on their order of appearance in the CREATE TABLE statement. For example, here’s the CREATE TABLE statement for the Sales.Shippers table in the TSQLV5 database:

CREATE TABLE Sales.Shippers
(
  shipperid   INT          NOT NULL IDENTITY,
  companyname NVARCHAR(40) NOT NULL,
  phone       NVARCHAR(24) NOT NULL,
  CONSTRAINT  PK_Shippers  PRIMARY KEY(shipperid)
);

Query the table using the notorious SELECT *, like so:

SELECT * FROM Sales.Shippers;

When I ran this query in my system, I got the following output:

shipperid  companyname    phone
---------- -------------- ---------------
1          Shipper GVSUA  (503) 555-0137
2          Shipper ETYNR  (425) 555-0136
3          Shipper ZHISN  (415) 555-0138

SQL guarantees that the columns will be returned from left-to-right based on definition order. I’ll explain what happens with the rows shortly. SQL even allows you to refer to the ordinal position of the column from the SELECT list in the ORDER BY clause, like so (not that I’m recommending this practice, nor does Aaron Bertrand):

SELECT shipperid, companyname, phone
FROM Sales.Shippers
ORDER BY 2;

This query generates the following output:

shipperid  companyname    phone
---------- -------------- ---------------
2          Shipper ETYNR  (425) 555-0136
1          Shipper GVSUA  (503) 555-0137
3          Shipper ZHISN  (415) 555-0138

The body of a relation is a set of tuples. Again, recall that a set has no order and no duplicates. Therefore, a relation must have at least one candidate key that allows you to uniquely identify a tuple. SQL’s counterpart to a tuple is a row. However, in SQL you’re not forced to define a key in a table, and if you don’t, you can end up with duplicate rows. Even if you do have a key defined in your table, you can get duplicate rows returned from a query against the table. Here’s an example:

SELECT country FROM HR.Employees;

This query generates the following output:

country
--------
USA
USA
USA
USA
UK
UK
UK
USA
UK

This query doesn’t produce a relational result due to the possibility of duplicate rows. Whereas relational theory is based on set theory, SQL is based on multiset theory. A multiset (aka a superset or a bag) can have duplicates. SQL does give you a tool to eliminate duplicates with a DISTINCT clause, like so:

SELECT DISTINCT country FROM HR.Employees;

This query generates the following output:

country
--------
UK
USA

What SQL does maintain from relational theory in terms of the table’s body is the no-order property. Unless you add an ORDER BY clause in the query, you don’t have any assurances that the result will have any specific order among the rows. So, the body of the above query result is relational, at least in the sense that it doesn’t have duplicates and it doesn’t have guaranteed order.

Suppose that you query a table in SQL Server, and you don’t include an ORDER BY clause in the query. Do you expect SQL Server to always return the rows in some specific order as a guaranteed behavior? Many people do. Many think that you will always get the rows back based on clustered index order. That’s a good example of ignoring the physical data independence principle, and making assumptions based on intuition, and perhaps based on past observed behavior. Microsoft knows that an SQL query without an ORDER BY clause doesn’t guarantee any order among the result rows, and hence even if at the physical level the data resides in an index structure, SQL Server doesn’t have to process the data in index order. It may choose, under certain physical conditions, to do so, but it may choose not to under other physical conditions. Also recall that the physical implementation details can change between different versions and builds of the product. If you want to guarantee that the query will return the result rows in some specific order, your only way to guarantee this is to introduce an ORDER BY clause in the outermost query.

As you’ve probably gathered, the designers of SQL didn’t really see it as a priority to follow relational theory. And what I described here are just a few examples. There are many more. As mentioned earlier, my goal in this article is just to provide enough of the critical theoretical background to clear the confusion surrounding table expressions, before I start delving into the specifics in T-SQL in future articles.

What is a table expression?

Relational algebra (the algebra that defines operations on relations in relational theory) has a closure property. What it means is that an operation on relations yield a relation. A relational operator operates on one or more relations as input and yields a single relation as output. The closure property allows you to nest operations. A relational expression is an expression that operates on relations and returns a relation. A relational expression therefore can be used where relational algebra expects a relation.

If you think about it, it’s no different than operations on integers that yield an integer result. Suppose that the variable @i is an integer variable. The expression @i + 42 yields an integer and therefore can be used where an integer is expected, as in (@i + 42) * 2.

Given that a table in SQL is the counterpart of a relation in relational theory, albeit not a very successful one at that, a table expression in SQL is the counterpart of a relational expression. As mentioned earlier, I use the term table expression following C. J. Dates’ use of this term. The SQL standard has a slew of confusing terms, some of which I’m afraid are not very appropriate. For example, the SQL Standard uses the term table expression to describe specifically an expression based on the query clauses starting with a mandatory FROM clause, and including optionally the clauses WHERE, GROUP BY, HAVING and WINDOW (the last isn’t supported in T-SQL), and excluding the SELECT clause. Here’s the standard’s specification:

7.4 <table expression>
 

Function
Specify a table or a grouped table.
 

Format
   <table expression> ::=
     <from clause>
       [ <where clause> ]
       [ <group by clause> ]
       [ <having clause> ]
       [ <window clause> ]

It is true that the outcome of what the standard calls a table expression is considered a table, but you cannot use such an expression as a stand-alone query. Date’s version of the term table expression is actually closer to what the SQL standard calls query expression. Here’s the standard’s specification for what it calls query expression:

7.17 <query expression>
 

Function
Specify a table.
 

Format
  <query expression> ::=
    [ <with clause> ] <query expression body>
      [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]
  <with clause> ::=
    WITH [ RECURSIVE ] <with list>
  <with list> ::=
    <with list element> [ { <comma> <with list element> }… ]
  <with list element> ::=
    <query name> [ <left paren> <with column list> <right paren> ]
      AS <table subquery> [ <search or cycle clause> ]
  <with column list> ::=
    <column name list>
  <query expression body> ::=
      <query term>
    | <query expression body> UNION [ ALL | DISTINCT ]
        [ <corresponding spec> ] <query term>
    | <query expression body> EXCEPT [ ALL | DISTINCT ]
        [ <corresponding spec> ] <query term>
  <query term> ::=
      <query primary>
    | <query term> INTERSECT [ ALL | DISTINCT ]
        [ <corresponding spec> ] <query primary>
  <query primary> ::=
      <simple table>
    | <left paren> <query expression body>
        [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]
        <right paren>
  <simple table> ::=
      <query specification>
    | <table value constructor>
    | <explicit table>
  <explicit table> ::=
    TABLE <table or query name>
  <corresponding spec> ::=
    CORRESPONDING [ BY <left paren> <corresponding column list> <right paren> ]
  <corresponding column list> ::=
    <column name list>
  <order by clause> ::=
    ORDER BY <sort specification list>
  <result offset clause> ::=
    OFFSET <offset row count> { ROW | ROWS }
  <fetch first clause> ::=
    FETCH { FIRST | NEXT } [ <fetch first quantity> ] { ROW | ROWS } { ONLY | WITH TIES }
  <fetch first quantity> ::=
      <fetch first row count>
    | <fetch first percentage>
  <offset row count> ::=
    <simple value specification>
  <fetch first row count> ::=
    <simple value specification>
  <fetch first percentage> ::=
    <simple value specification> PERCENT

7.3 <table value constructor>
 

Function
Specify a set of <row value expression>s to be constructed into a table.
 

Format
  <table value constructor> ::=
    VALUES <row value expression list>
  <row value expression list> ::=
    <table row value expression> [ { <comma> <table row value expression> }… ]
  <contextually typed table value constructor> ::=
    VALUES <contextually typed row value expression list>
  <contextually typed row value expression list> ::=
    <contextually typed row value expression>
      [ { <comma> <contextually typed row value expression> }… ]

Observe that this specification includes what T-SQL calls common table expression, even though the standard doesn’t really use this term, rather just calls it with list element. Also observe that the so-called query expression doesn’t have to be based on a query, rather could be based on what’s called a table value constructor (the use of a VALUES clause to construct a set of rows). Lastly, even though the standard’s query expression is based on an expression, it returns a table, and can be used where a table is normally expected. For these reasons, I find Date’s use of the term table expression much more appropriate.

Conclusion

I can see why some might find the dwelling on naming and terminology as a bit pedantic and perhaps even a waste of time. I feel very differently, though. I believe that in any field, an aspiration to use proper names and terminology forces you to study the foundations well, and reflects on your knowledge. In hope that in this article I didn’t manage to alienate you enough to not want to proceed to the upcoming parts in the series, starting with next month’s article, I’m going to turn my focus to the way the different types of named table expressions are handled using T-SQL in SQL Server and Azure SQL Database.