Closest Match, Part 3 - SQLPerformance.com

In Closest Match, Part 1, I covered a T-SQL challenge that was provided to me by Karen Ly, a Jr. Fixed Income Analyst at RBC. The challenge involved two tables — T1 and T2, each with a key column (keycol), a value (val), and some other columns (represented with a column called othercols). The task was to match to each row from T1 the row from T2 where T2.val is closest to T1.val (the absolute difference is the lowest lowest), using the lowest T2.keycol value as a tiebreaker. I provided a couple of relational solutions, and discussed their performance characteristics. I also challenged you to try and find a reasonably performing solution in cases where you are not allowed/able to create supporting indexes. In Part 2, I demonstrated how this can be achieved by using iterative solutions. I got several very interesting reader solutions from Kamil Konsno, Alejandro Mesa and Radek Celuch, and those solutions are the focus of this month’s article.

Sample Data

As a reminder, I provided both small and large sets of sample data for you to work with. Small sets to check the validity of your solution and large sets to check its performance. Use the following code to create the sample database testdb and within it the sample tables T1 and T2:

-- Sample db
SET NOCOUNT ON;

IF DB_ID('testdb') IS NULL
  CREATE DATABASE testdb;
GO

USE testdb;
GO

-- Sample tables T1 and T2
DROP TABLE IF EXISTS dbo.T1, dbo.T2;

CREATE TABLE dbo.T1
(
  keycol INT NOT NULL IDENTITY
    CONSTRAINT PK_T1 PRIMARY KEY,
  val INT NOT NULL,
  othercols BINARY(100) NOT NULL
    CONSTRAINT DFT_T1_col1 DEFAULT(0xAA)
);

CREATE TABLE dbo.T2
(
  keycol INT NOT NULL IDENTITY
    CONSTRAINT PK_T2 PRIMARY KEY,
  val INT NOT NULL,
  othercols BINARY(100) NOT NULL
    CONSTRAINT DFT_T2_col1 DEFAULT(0xBB)
);

Use the following code to populate the tables with small sets of sample data:

-- Populate T1 and T2 with small sets of sample data

TRUNCATE TABLE dbo.T1;
TRUNCATE TABLE dbo.T2;

INSERT INTO dbo.T1 (val)
  VALUES(1),(1),(3),(3),(5),(8),(13),(16),(18),(20),(21);

INSERT INTO dbo.T2 (val)
  VALUES(2),(2),(7),(3),(3),(11),(11),(13),(17),(19);

I’ll use the small sets of sample data when describing the logic of the different solutions and provide intermediate result sets for the steps of the solutions.

Use the following code to create the helper function GetNums and to populate the tables with large sets of sample data:

-- Helper function to generate a sequence of integers
DROP FUNCTION IF EXISTS dbo.GetNums;
GO

CREATE OR ALTER FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) 
RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
             FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Populate T1 and T2 with large sets of sample data
DECLARE
  @numrowsT1 AS INT = 1000000,
  @maxvalT1  AS INT = 10000000,
  @numrowsT2 AS INT = 1000000,
  @maxvalT2  AS INT = 10000000;

TRUNCATE TABLE dbo.T1;
TRUNCATE TABLE dbo.T2;

INSERT INTO dbo.T1 WITH(TABLOCK) (val)
  SELECT ABS(CHECKSUM(NEWID())) % @maxvalT1 + 1 AS val
  FROM dbo.GetNums(1, @numrowsT1) AS Nums;

INSERT INTO dbo.T2 WITH(TABLOCK) (val)
  SELECT ABS(CHECKSUM(NEWID())) % @maxvalT2 + 1 AS val
  FROM dbo.GetNums(1, @numrowsT2) AS Nums;

When you want to test your solution with supporting indexes, use the following code to create those indexes:

CREATE INDEX idx_val_key ON dbo.T1(val, keycol) INCLUDE(othercols);
CREATE INDEX idx_val_key ON dbo.T2(val, keycol) INCLUDE(othercols);
CREATE INDEX idx_valD_key ON dbo.T2(val DESC, keycol) INCLUDE(othercols);

When you want to test your solution without supporting indexes, use the following code to remove those indexes:

DROP INDEX IF EXISTS idx_val_key ON dbo.T1;
DROP INDEX IF EXISTS idx_val_key ON dbo.T2;
DROP INDEX IF EXISTS idx_valD_key ON dbo.T2;

Solution 1 by Kamil Kosno – Using a temporary table

Kamil Kosno submitted a few — two with his original ideas, and two variations on Alejandro’s and Radek’s solutions. Kamil’s first solution uses an indexed temporary table. Often it is the case that even if you’re not allowed to create supporting indexes on the original tables, you are allowed to work with temporary tables, and with temporary tables you can always create your own indexes.

Here’s the complete solution’s code:

DROP TABLE IF EXISTS #T2;

SELECT MIN(keycol) AS keycol, val
INTO #T2
FROM dbo.T2
GROUP BY val;

CREATE UNIQUE INDEX idx_nc_val_key ON #T2(val, keycol);

WITH bestvals AS
(
  SELECT
    keycol AS keycol1,
    val AS val1,
    othercols AS othercols1,
    CASE
      WHEN (prev + nxt IS NULL) THEN COALESCE(prev, nxt)
      WHEN ABS(val - prev) < ABS(val - nxt) THEN prev
      WHEN ABS(val - nxt) < ABS(val - prev) THEN nxt
      ELSE  (CASE WHEN nxt < prev THEN nxt ELSE prev END)
    END AS val2
  FROM dbo.T1
    CROSS APPLY ( SELECT MAX(val) AS prev
                  FROM #T2
                  WHERE val <= T1.val ) AS prev
    CROSS APPLY ( SELECT MIN(val) AS nxt
                  FROM #T2
                  WHERE T1.val <= val ) AS nxt
)
SELECT
  keycol1, val1,
  SUBSTRING(othercols1, 1, 1) AS othercols1,
  tempT2.keycol AS keycol2, val2,
  SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM bestvals AS B
  INNER JOIN #T2 AS tempT2
    ON tempT2.val = B.val2
  INNER JOIN dbo.T2
    ON T2.keycol = tempT2.keycol
    
DROP TABLE #T2;

In Step 1 of the solution you store the minimum keycol for each unique val in a temporary table called #T2, and create a supporting index on (val, keycol). Here’s the code implementing this step:

DROP TABLE IF EXISTS #T2;

SELECT MIN(keycol) AS keycol, val
INTO #T2
FROM dbo.T2
GROUP BY val;

CREATE UNIQUE INDEX idx_nc_val_key ON #T2(val, keycol);

SELECT * FROM #T2;

The temporary table is populated with the following data:

keycol      val
----------- ----------
1           2
4           3
3           7
6           11
8           13
9           17
10          19

In Step 2 of the solution you apply the following:

For each row in T1 compute prev and nxt values from #T2. Compute prev as the maximum val in #T2 that is less than or equal to val in T1. Compute nxt as the minimum val in #T2 that is greater than or equal to val in T1.

Use a CASE expression to compute val2 based on the following logic:

When prev or nxt is null, return the first nonnull of the two, or NULL if both are NULL.
When the absolute difference between val and prev is less than the absolute difference between val and nxt, return prev.
When if the absolute difference between val and nxt is less than the absolute difference between val and prv, return nxt.
Else, if nxt is less than prev, return nxt, otherwise return prev.

Here’s the code implementing this step:

SELECT
  keycol AS keycol1,
  val AS val1,
  othercols AS othercols1,
  CASE
    WHEN (prev + nxt IS NULL) THEN COALESCE(prev, nxt)
    WHEN ABS(val - prev) < ABS(val - nxt) THEN prev
    WHEN ABS(val - nxt) < ABS(val - prev) THEN nxt
    ELSE  (CASE WHEN nxt < prev THEN nxt ELSE prev END)
  END AS val2
FROM dbo.T1
  CROSS APPLY ( SELECT MAX(val) AS prev
                FROM #T2
                WHERE val <= T1.val ) AS prev
  CROSS APPLY ( SELECT MIN(val) AS nxt
                FROM #T2
                WHERE T1.val <= val ) AS nxt;

This code generates the following output:

keycol1  val1  othercols1  val2
-------- ----- ----------- -----
1        1     0xAA        2
2        1     0xAA        2
3        3     0xAA        3
4        3     0xAA        3
5        5     0xAA        3
6        8     0xAA        7
7        13    0xAA        13
8        16    0xAA        17
9        18    0xAA        17
10       20    0xAA        19
11       21    0xAA        19

In Step 3 of the solution, you define a CTE called bestvals based on the query from Step2. You then join bestvals with #T2 to get the keys, and join the result with T2 to get the rest of the data (othercols).

Here’s the code implementing this step:

SELECT
  keycol1, val1,
  SUBSTRING(othercols1, 1, 1) AS othercols1,
  tempT2.keycol AS keycol2, val2,
  SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM bestvals AS B
  INNER JOIN #T2 AS tempT2
    ON tempT2.val = B.val2
  INNER JOIN dbo.T2
    ON T2.keycol = tempT2.keycol;

The plan for solution 1 by Kamil with supporting indexes is shown in Figure 1.

Figure 1: Plan for solution 1 by Kamil with indexes

The first branch of the plan groups and aggregates the data from T2 and writes the result to the temporary table #T2. Notice that this branch uses a preordered Stream Aggregate algorithm that relies on an ordered scan of the index idx_valD_key on T2.

Here are the performance stats for this plan:

run time (sec): 5.55, CPU (sec): 16.6, logical reads: 6,687,172

The plan for solution 1 by Kamil without supporting indexes is shown in Figure 2.

Figure 2: Plan for solution 1 by Kamil without indexes

The main difference between this plan and the previous one is that the first branch of the plan, which handles the grouping and aggregation part in Step 1, this time cannot pull the data preordered from a supporting index, so it pulls it unordered from the clustered index on T2 and then uses a Hash Aggregate algorithm.

Here are the performance stats for this plan:

run time: 6.44, CPU: 19.3, logical reads: 6,688,277

Solution by Alejandro Mesa – Using last nonnull technique

Alejandro’s solution uses the last nonnull technique described here.

Here’s the complete solution’s code (provided here without returning othercols):

WITH R1 AS
(
  SELECT keycol, val, tid,
    MAX( CASE WHEN tid = 0 THEN CAST(val AS binary(4)) + CAST(keycol AS binary(4)) END )
      OVER( ORDER BY val, tid, keycol
            ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS below_binstr,
    MIN( CASE WHEN tid = 0 THEN CAST(val AS binary(4)) + CAST(keycol AS binary(4)) END )
      OVER( ORDER BY val, tid, keycol
            ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING ) AS above_binstr
  FROM ( SELECT keycol, val, 1 AS tid
         FROM dbo.T1
  
         UNION ALL
  
         SELECT MIN(keycol) AS keycol, val, 0 AS tid
         FROM dbo.T2
         GROUP BY val ) AS T
),
R2 AS
(
  SELECT keycol, val,
    CAST(SUBSTRING(below_binstr, 1, 4) AS int) AS below_T2_val,
    CAST(SUBSTRING(below_binstr, 5, 4) AS int) AS below_T2_keycol,
    CAST(SUBSTRING(above_binstr, 1, 4) AS int) AS above_T2_val,
    CAST(SUBSTRING(above_binstr, 5, 4) AS int) AS above_T2_keycol
  FROM R1
  WHERE tid = 1
),
R3 AS
(
  SELECT keycol, val,
    COALESCE(below_T2_val, above_T2_val) AS below_T2_val,
    COALESCE(below_T2_keycol, above_T2_keycol) AS below_T2_keycol,
    COALESCE(above_T2_val, below_T2_val) AS above_T2_val,
    COALESCE(above_T2_keycol, below_T2_keycol) AS above_T2_keycol
  FROM R2
)
SELECT keycol AS keycol1, val AS val1,
  CASE
    WHEN ABS(val - below_T2_val) <= ABS(above_T2_val - val) THEN below_T2_keycol
    ELSE above_T2_keycol
  END AS keycol2,
  CASE
    WHEN ABS(val - below_T2_val) <= ABS(above_T2_val - val) THEN below_T2_val
    ELSE above_T2_val
  END AS val2
FROM R3;

In Step 1 of the solution, you unify the rows from T1 (assigning a result column called tid to 1) with the rows from T2 (assigning tid = 0), and define a derived table called T based on the result. Using the last nonnull technique, based on the order of val, tid, keycol, for each current row, you retrieve the last tid = 0 row's values concatenated into a binary column called below_binstr. Similarly, you retrieve the next tid = 0 row's values concatenated into a binary column called above_binstr.

Here’s the code implementing this step:

SELECT keycol, val, tid,
  MAX( CASE WHEN tid = 0 THEN CAST(val AS binary(4)) + CAST(keycol AS binary(4)) END )
    OVER( ORDER BY val, tid, keycol
          ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS below_binstr,
  MIN( CASE WHEN tid = 0 THEN CAST(val AS binary(4)) + CAST(keycol AS binary(4)) END )
    OVER( ORDER BY val, tid, keycol
          ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING ) AS above_binstr
FROM ( SELECT keycol, val, 1 AS tid
       FROM dbo.T1
       UNION ALL
       SELECT MIN(keycol) AS keycol, val, 0 AS tid
       FROM dbo.T2
       GROUP BY val ) AS T;

This code generates the following output:

keycol      val         tid         below_binstr       above_binstr
----------- ----------- ----------- ------------------ ------------------
1           1           1           NULL               0x0000000200000001
2           1           1           NULL               0x0000000200000001
1           2           0           NULL               0x0000000300000004
4           3           0           0x0000000200000001 0x0000000700000003
3           3           1           0x0000000300000004 0x0000000700000003
4           3           1           0x0000000300000004 0x0000000700000003
5           5           1           0x0000000300000004 0x0000000700000003
3           7           0           0x0000000300000004 0x0000000B00000006
6           8           1           0x0000000700000003 0x0000000B00000006
6           11          0           0x0000000700000003 0x0000000D00000008
8           13          0           0x0000000B00000006 0x0000001100000009
7           13          1           0x0000000D00000008 0x0000001100000009
8           16          1           0x0000000D00000008 0x0000001100000009
9           17          0           0x0000000D00000008 0x000000130000000A
9           18          1           0x0000001100000009 0x000000130000000A
10          19          0           0x0000001100000009 NULL
10          20          1           0x000000130000000A NULL
11          21          1           0x000000130000000A NULL

In Step 2 of the solution you define a CTE called R1 based on the query from Step 1. You query R1, filter only rows where tid = 1, and extract the individual values from the concatenated binary strings.

Here’s the code implementing this step:

SELECT keycol, val,
  CAST(SUBSTRING(below_binstr, 1, 4) AS int) AS below_T2_val,
  CAST(SUBSTRING(below_binstr, 5, 4) AS int) AS below_T2_keycol,
  CAST(SUBSTRING(above_binstr, 1, 4) AS int) AS above_T2_val,
  CAST(SUBSTRING(above_binstr, 5, 4) AS int) AS above_T2_keycol
FROM R1
WHERE tid = 1;

This code generates the following output:

keycol      val         below_T2_val below_T2_keycol above_T2_val above_T2_keycol
----------- ----------- ------------ --------------- ------------ ---------------
1           1           NULL         NULL            2            1
2           1           NULL         NULL            2            1
3           3           3            4               7            3
4           3           3            4               7            3
5           5           3            4               7            3
6           8           7            3               11           6
7           13          13           8               17           9
8           16          13           8               17           9
9           18          17           9               19           10
10          20          19           10              NULL         NULL
11          21          19           10              NULL         NULL

In Step 3 of the solution you define a CTE called R2 based on the query from Step 2. You then compute the following result columns:

below_T2_val: the first nonnull between below_T2_val and above_T2_val
below_T2_keycol: the first nonnull between below_T2_keycol and above_T2_keycol
above_T2_val: the first nonnull between above_T2_val and below_T2_val
above_T2_keycol: the first nonnull between above_T2_keycol and below_T2_keycol

Here’s the code implementing this step:

SELECT keycol, val,
  COALESCE(below_T2_val, above_T2_val) AS below_T2_val,
  COALESCE(below_T2_keycol, above_T2_keycol) AS below_T2_keycol,
  COALESCE(above_T2_val, below_T2_val) AS above_T2_val,
  COALESCE(above_T2_keycol, below_T2_keycol) AS above_T2_keycol
FROM R2;

This code generates the following output:

keycol      val         below_T2_val below_T2_keycol above_T2_val above_T2_keycol
----------- ----------- ------------ --------------- ------------ ---------------
1           1           2            1               2            1
2           1           2            1               2            1
3           3           3            4               7            3
4           3           3            4               7            3
5           5           3            4               7            3
6           8           7            3               11           6
7           13          13           8               17           9
8           16          13           8               17           9
9           18          17           9               19           10
10          20          19           10              19           10
11          21          19           10              19           10

In Step 4 of the solution you define a CTE called R3 based on the query from Step 3. You return keycol aliased as T1_keycol and val aliased as T1_val. Compute the result column T2_keycol as: if the absolute difference between val and below_T2_val is less than or equal to the absolute difference between above_T2_val and val, return below_T2_keycol, otherwise above_T2_keycol. Compute the result column T2_val as: if the absolute difference between val and below_T2_val is less than or equal to the absolute difference between above_T2_val and val, return below_T2_val, otherwise above_T2_val.

Here’s the code implementing this step:

SELECT keycol AS keycol1, val AS val1,
  CASE
    WHEN ABS(val - below_T2_val) <= ABS(above_T2_val - val) THEN below_T2_keycol
    ELSE above_T2_keycol
  END AS keycol2,
  CASE
    WHEN ABS(val - below_T2_val) <= ABS(above_T2_val - val) THEN below_T2_val
    ELSE above_T2_val
  END AS val2
FROM R3;

The plan for Alejandro’s solution with supporting indexes is shown in Figure 3.

Figure 3: Plan for solution by Alejandro with indexes

Here are the performance stats for this plan:

run time: 7.8, CPU: 12.6, logical reads: 35,886

The plan for Alejandro’s solution without supporting indexes is shown in Figure 4.

Figure 4: Plan for solution by Alejandro without indexes

Here are the performance stats for this plan:

run time: 8.19, CPU: 14.8, logical reads: 37,091

Notice that in the first two branches of the plan in Figure 4 there are two sorts prior to the Merge Join (Concatenation) operator due to the lack of supporting indexes. Those sorts are not needed in the plan in Figure 3 since the data is retrieved preordered from the supporting indexes.

Kamil's variation on Alejandro's solution

In this solution, Kamil also relied on the last nonnull technique. Here’s the complete solution’s code:

WITH all_vals AS
(
  SELECT keycol, val 
  FROM dbo.T1
  UNION ALL
  SELECT DISTINCT NULL, val
  FROM dbo.T2
),
ranges AS
(
  SELECT keycol, val,  
    MAX(CASE WHEN keycol IS NULL THEN val END) OVER (ORDER BY val, keycol ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prev,
    MIN(CASE WHEN keycol IS NULL THEN val END) OVER (ORDER BY val, keycol ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nxt
  FROM all_vals
),
matched_vals AS
(
  SELECT keycol AS keycol1, val AS val1, 
    CASE 
      WHEN ABS(val - prev) <= ISNULL(ABS(val - nxt), prev) THEN prev
      ELSE nxt
    END AS val2
  FROM ranges
  WHERE keycol IS NOT NULL
)
SELECT
  keycol1, val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  val2, T2.keycol AS keycol2, SUBSTRING(T2.othercols, 1, 1) AS othercols2 
FROM matched_vals AS M
  INNER JOIN ( SELECT *, ROW_NUMBER() OVER (PARTITION BY val ORDER BY keycol) AS rn
               FROM dbo.T2 ) AS T2
    ON T2.val = M.val2 AND rn = 1
  INNER JOIN dbo.T1
    ON T1.keycol = keycol1;

In Step 1 of the solution you define CTEs called all_vals and ranges, where you use the last nonnull technique to identify for each value in T1 (where keycol is not NULL) ranges of below (prev) and above (nxt) values from T2 (where keycol is null).

Here’s the code implementing this step:

WITH all_vals AS
(
  SELECT keycol, val 
  FROM dbo.T1
  UNION ALL
  SELECT DISTINCT NULL, val
  FROM dbo.T2
),
ranges AS
(
  SELECT keycol, val,  
    MAX(CASE WHEN keycol IS NULL THEN val END) OVER (ORDER BY val, keycol ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prev,
    MIN(CASE WHEN keycol IS NULL THEN val END) OVER (ORDER BY val, keycol ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nxt
  FROM all_vals
)
SELECT * FROM ranges;

This code generates the following output:

keycol      val         prev        nxt
----------- ----------- ----------- -----------
1           1           NULL        2
2           1           NULL        2
NULL        2           NULL        3
NULL        3           2           7
3           3           3           7
4           3           3           7
5           5           3           7
NULL        7           3           11
6           8           7           11
NULL        11          7           13
NULL        13          11          17
7           13          13          17
8           16          13          17
NULL        17          13          19
9           18          17          19
NULL        19          17          NULL
10          20          19          NULL
11          21          19          NULL

In Step 2 of the solution you query the CTE ranges and filter only rows where keycol is not null. You return the keycol and val columns from T1 as keycol1 and val1, respectively. Then, between prev and nxt, you return the closest to val1 as val2.

Here’s the code implementing this step:

SELECT keycol AS keycol1, val AS val1, 
  CASE 
    WHEN ABS(val - prev) <= ISNULL(ABS(val - nxt), prev) THEN prev
    ELSE nxt
  END AS val2
FROM ranges
WHERE keycol IS NOT NULL;

This code generates the following output:

keycol1     val1        val2
----------- ----------- -----------
1           1           2
2           1           2
3           3           3
4           3           3
5           5           3
6           8           7
7           13          13
8           16          17
9           18          17
10          20          19
11          21          19

In Step 3 of the solution you define a CTE called matched_vals based on the query from Step 2. You also define a derived table called T2 where using partitioned row numbers you handle the tiebreaking logic for the rows from dbo.T2. You then join matched_vals, the CTE T2 and the table T1 to handle the final matching logic.

Here’s the code implementing this step:

SELECT
  keycol1, val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  val2, T2.keycol AS keycol2, SUBSTRING(T2.othercols, 1, 1) AS othercols2 
FROM matched_vals AS M
  INNER JOIN ( SELECT *, ROW_NUMBER() OVER (PARTITION BY val ORDER BY keycol) AS rn
               FROM dbo.T2 ) AS T2
    ON T2.val = M.val2 AND rn = 1
  INNER JOIN dbo.T1
    ON T1.keycol = keycol1;

The plan for Kamil's variation on Alejandro's solution with supporting indexes is shown in Figure 5.

Figure 5: Plan for Kamil’s variation on Alejandro’s solution with indexes

Here are the performance stats for this plan:

run time: 11.5, CPU: 18.3, logical reads: 59,218

The plan for Kamil's variation on Alejandro's solution without supporting indexes is shown in Figure 6.

Figure 6: Plan for Kamil’s variation on Alejandro’s solution without indexes

Here are the performance stats for this plan:

run time: 12.6, CPU: 23.5, logical reads: 61,432

Similar to the plans for Alejandro’s solution, also in this case the plan in Figure 5 doesn’t require explicit sorts prior to applying the Merge Join (concatenation) operator since the data is retrieved preordered from supporting indexes, and the plan in Figure 6 does require explicit sorts since there are no supporting indexes.

Solution by Radek Celuch – Using intervals

Radek came up with a simple and beautiful idea. For each distinct value in T2.val you compute an interval representing the range of values from T1.val that would match the current T2.val.

Here’s the complete solution’s code:

WITH Pre1 AS
(
  SELECT keycol, val, othercols,
    ROW_NUMBER() OVER(PARTITION BY val ORDER BY keycol) AS rn
  FROM dbo.T2
),
Pre2 AS
(
  SELECT keycol, val, othercols,
    LAG(val) OVER(ORDER BY val) AS prev,
    LEAD(val) OVER(ORDER BY val) AS next
  FROM Pre1
  WHERE rn = 1
),
T2 AS
(
  SELECT keycol, val, othercols,
    ISNULL(val - (val - prev) / 2 + (1 - (val - prev) % 2), 0) AS low,
    ISNULL(val + (next - val) / 2, 2147483647) AS high
  FROM Pre2
)
SELECT
  T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  T2.keycol AS keycol2, T2.val AS val2, SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM dbo.T1
  INNER JOIN T2
    ON T1.val BETWEEN T2.low AND T2.high;

In Step 1 of the solution you query T2 and compute row numbers (call the column rn), partitioned by val and ordered by keycol. You define a CTE called Pre1 based on this query. You then query Pre1 and filter only the rows where rn = 1. In the same query, using LAG and LEAD, you return for each row the value of the val column from the previous row (call it prev) and from the next row (call it next).

Here’s the code implementing this step:

WITH Pre1 AS
(
  SELECT keycol, val, othercols,
    ROW_NUMBER() OVER(PARTITION BY val ORDER BY keycol) AS rn
  FROM dbo.T2
)
SELECT keycol, val, othercols,
  LAG(val) OVER(ORDER BY val) AS prev,
  LEAD(val) OVER(ORDER BY val) AS next
FROM Pre1
WHERE rn = 1;

This code generates the following output:

keycol  val  othercols  prev        next
------- ---- ---------- ----------- -----------
1       2    0xBB       NULL        3
4       3    0xBB       2           7
3       7    0xBB       3           11
6       11   0xBB       7           13
8       13   0xBB       11          17
9       17   0xBB       13          19
10      19   0xBB       17          NULL

In Step 2 of the solution you define a CTE called Pre2 based on the query from Step 1. You query Pre2 and compute for each row an interval [low, high] where val from T1 would fall in. Here are the calculations for low and high:

low = ISNULL(val - (val - prev) / 2 + (1 - (val - prev) % 2), 0)
high = ISNULL(val + (next - val) / 2, 2147483647)

The mod 2 check on calculating the low boundary is used to meet the lower T2.val requirement, and the row number filter is used to meet the lower T2.keycol requirement. The values 0 and 2147483647 are used as the extreme lower and upper limits.

Here’s the code implementing this step:

SELECT keycol, val, othercols,
  ISNULL(val - (val - prev) / 2 + (1 - (val - prev) % 2), 0) AS low,
  ISNULL(val + (next - val) / 2, 2147483647) AS high
FROM Pre2;

This code generates the following output:

keycol  val  othercols  low  high
------- ---- ---------- ---- -----------
1       2    0xBB       0    2
4       3    0xBB       3    5
3       7    0xBB       6    9
6       11   0xBB       10   12
8       13   0xBB       13   15
9       17   0xBB       16   18
10      19   0xBB       19   2147483647

In Step 3 of the solution you define a CTE called T2 based on the query from Step 2. You Then join the table T1 with the CTE T2 matching rows based on val from T1 being within the interval [low, high] in T2.

Here’s the code implementing this step:

SELECT
  T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  T2.keycol AS keycol2, T2.val AS val2, SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM dbo.T1
  INNER JOIN T2
    ON T1.val BETWEEN T2.low AND T2.high;

The plan for Radek’s solution with supporting indexes is shown in Figure 7.

Figure 7: Plan for solution by Radek with indexes

Here are the performance stats for this plan:

run time: 10.6, CPU: 7.63, logical reads: 3,193,125

The plan for Radek’s solution without supporting indexes is shown in Figure 8.

Figure 8: Plan for solution by Radek without indexes

Here are the performance stats for this plan:

run time: 18.1, CPU: 27.4, logical reads: 5,827,360

Here the performance difference between the two plans is quite substantial. The plan in Figure 8 requires a preliminary sort before the computation of the row numbers, which the plan in Figure 7 doesn’t. More importantly, to match each interval with the respective rows from T1, the plan in Figure 8 creates an Index Spool essentially as an alternative to the missing index, whereas the plan in Figure 7 simply uses the existing index idx_val_key on T1. That’s the main reason for the big differences across all performance measures. Still, the performance without supporting indexes is reasonable.

Kamil's variation on Radek's solution

Kamil came up with a variation on Radek’s solution. Here’s the complete solution’s code:

WITH T2_collapsed AS
(
  SELECT 
    keycol AS keycol2, 
    val AS val2 ,
    ROW_NUMBER() OVER (PARTITION BY val ORDER BY keycol) AS rn
  FROM dbo.T2
),
ranges AS
(
  SELECT 
    keycol2 AS prevkey, 
    val2 AS prevval, 
    LEAD(keycol2) OVER (ORDER BY val2) AS nxtkey, 
    LEAD(val2) OVER (ORDER BY val2) AS nxtval 
  FROM T2_collapsed 
  WHERE rn = 1
),
bestmatches AS
(
  SELECT 
    T1.keycol AS keycol1, 
    T1.val AS val1, 
    SUBSTRING(T1.othercols, 1, 1) AS othercols1,
    CASE WHEN ABS(T1.val - prevval) <= ABS(T1.val - nxtval) THEN prevkey ELSE nxtkey END AS keycol2,
    CASE WHEN ABS(T1.val - prevval) <= ABS(T1.val - nxtval) THEN prevval ELSE nxtval END AS val2
  FROM ranges
    INNER JOIN dbo.T1
      ON prevval <= T1.val AND T1.val < nxtval
  UNION ALL
  SELECT 
    T1.keycol, 
    T1.val, 
    SUBSTRING(T1.othercols, 1, 1),
    CASE WHEN T1.val < valmin THEN keycolmin ELSE keycolmax END AS keycol2,
    CASE WHEN T1.val < valmin THEN valmin ELSE valmax END AS val2
  FROM dbo.T1
    CROSS APPLY(SELECT TOP(1) keycol AS keycolmin, val AS valmin
                FROM dbo.T2
                ORDER BY val, keycol) AS floorvals
    CROSS APPLY(SELECT TOP(1) keycol AS keycolmax, val AS valmax
                FROM dbo.T2
                ORDER BY val DESC, keycol) AS ceilingvals
  WHERE  T1.val < floorvals.valmin OR ceilingvals.valmax <= T1.val
)
SELECT keycol1, val1, othercols1,  keycol2, val2, 
  SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM bestmatches AS B 
  INNER JOIN dbo.T2 
    ON T2.keycol = B.keycol2;

Here’s Kamil’s own description of this solution:

This solution is close to Radek's idea of checking against low and high range, although the ranges are based purely on actual T2 values. We get each row and the lead values only for each row in collapsed T2 (first step always gets rid of duplicate keys if found, by using row number or min(keycol) grouped by val on t2). The key concepts are how to check low and high range not to get duplicates – lower range inclusive, higher range exclusive, as well as handling the outliers (if present) by creating a separate set looking at the lowest and max values in T2 (UNION ALL bit). The check against the max value is done with inclusive range to complement the case of T1 vals being equal to the top T2 vals.

The plan for Kamil’s variation on Radek’s solution with supporting indexes is shown in Figure 9.

Figure 9: Plan for Kamil’s variation on Radek’s solution with indexes

Here are the performance stats for this plan:

run time: 8.59, CPU: 8.5, logical reads: 3,206,849

The plan for Kamil’s variation on Radek’s solution without supporting indexes is shown in Figure 10.

Figure 10: Plan for Kamil’s variation on Radek’s solution without indexes

Here are the performance stats for this plan:

run time: 14, CPU: 24.5, logical reads: 5,870,596

The plan in Figure 9 is serial and most of the calculations are done based on preordered data obtained from the supporting indexes. The plan in Figure 10 is parallel, plus there are a few sorts, and even an index spool activity. The performance measures of the latter plan are substantially higher than the former plan, but the performance in both cases is reasonable.

Solution 2 by Kamil Kosno – Using ranking, offset and aggregate window functions

Kamil came up with another original approach that is based on a combination of ranking, offset and aggregate window functions. Here’s the complete solution’s code:

WITH A AS
(
  SELECT 1 AS t, keycol,val,
    DENSE_RANK() OVER(ORDER BY val) AS rnk
  FROM dbo.T1
  
  UNION ALL
  
  SELECT NULL, MIN(keycol), val, 0
  FROM dbo.T2
  GROUP BY val
),
B AS
( 
  SELECT t, keycol, val,
    CASE
      WHEN t = 1 THEN DENSE_RANK() OVER (ORDER BY val, t) - rnk
      ELSE 0
    END AS grp,
    LAG(CASE WHEN t IS NULL THEN val END) OVER(ORDER BY val, t) AS prev,
    LAG(CASE WHEN t IS NULL THEN keycol END) OVER(ORDER BY val, t) AS prevkey,
    LEAD(CASE WHEN t IS NULL THEN val END) OVER(ORDER BY val, t) AS nxt,
    LEAD(CASE WHEN t IS NULL THEN keycol END) OVER(ORDER BY val, t) AS nxtkey
  FROM A
),
C AS
(
  SELECT keycol AS keycol1, val AS val1, 
    MAX (prev) OVER(PARTITION BY grp
                    ORDER BY prev DESC
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mxprev,
    MAX (prevkey) OVER(PARTITION BY grp 
                       ORDER BY prev DESC
                       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mxprevkey,
    MAX (nxt) OVER(PARTITION BY grp 
                   ORDER BY nxt
                   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mxnxt,
    MAX (nxtkey) OVER(PARTITION BY grp
                      ORDER BY nxt 
                      ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mxnxtkey
  FROM B
  WHERE t = 1
),
D AS
(
  SELECT keycol1, val1,
    CASE WHEN ABS(val1 - mxprev) <= ISNULL(ABS(val1 - mxnxt), mxprev) THEN mxprevkey ELSE mxnxtkey END AS keycol2,
    CASE WHEN ABS(val1 - mxprev) <= ISNULL(ABS(val1 - mxnxt), mxprev) THEN mxprev ELSE mxnxt END AS val2
  FROM C
)
SELECT D.keycol1, D.val1, D.keycol2, D.val2,
  SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM D 
  INNER JOIN dbo.T1 
    ON T1.keycol = D.keycol1
  INNER JOIN dbo.T2
    ON T2.keycol = D.keycol2;

In Step 1 of the solution you unify the following result sets:

Rows from T1 with a result column called t assigned with the constant 1 and column rnk representing dense rank values ordered by val
Group rows from T2 by val and return val and min keycol for each group, with the result column t assigned with NULL and rnk with 0

Here’s the code implementing this step:

SELECT 1 AS t, keycol,val,
  DENSE_RANK() OVER(ORDER BY val) AS rnk
FROM dbo.T1
UNION ALL
SELECT NULL, MIN(keycol), val, 0
FROM dbo.T2
GROUP BY val;

This code generates the following output:

t           keycol      val         rnk
----------- ----------- ----------- --------------------
1           1           1           1
1           2           1           1
1           3           3           2
1           4           3           2
1           5           5           3
1           6           8           4
1           7           13          5
1           8           16          6
1           9           18          7
1           10          20          8
1           11          21          9
NULL        1           2           0
NULL        4           3           0
NULL        3           7           0
NULL        6           11          0
NULL        8           13          0
NULL        9           17          0
NULL        10          19          0

In Step 2 of the solution you define a CTE called A based on the query from Step 1. You query A and compute group identifiers (grp) for T1 values that fall between ranges of distinct T2 values. This is done using an islands technique where you subtract rnk (dense ranks for only T1 values) from rnk2 (dense ranks for unified T1 values and distinct T2 values).

Using the LAG and LEAD functions, you compute for each T1 row the prev/next val and keycol values from T2, if present, and NULL otherwise. These calculations return values for the first and last rows in each group, if present, but NULLs for intermediate rows in groups.

Here’s the code implementing this step:

SELECT t, keycol, val, rnk,
  DENSE_RANK() OVER (ORDER BY val) AS rnk2,
  CASE
    WHEN t = 1 THEN DENSE_RANK() OVER (ORDER BY val, t) - rnk
    ELSE 0
  END AS grp,
  LAG(CASE WHEN t IS NULL THEN val END) OVER(ORDER BY val, t) AS prev,
  LAG(CASE WHEN t IS NULL THEN keycol END) OVER(ORDER BY val, t) AS prevkey,
  LEAD(CASE WHEN t IS NULL THEN val END) OVER(ORDER BY val, t) AS nxt,
  LEAD(CASE WHEN t IS NULL THEN keycol END) OVER(ORDER BY val, t) AS nxtkey
FROM A;

This code generates the following output:

t     keycol val rnk rnk2 grp prev  prevkey nxt   nxtkey
----- ------ --- --- ---- --- ----- ------- ----- -------
1     1      1   1   1    0   NULL  NULL    NULL  NULL
1     2      1   1   1    0   NULL  NULL    2     1
NULL  1      2   0   2    0   NULL  NULL    3     4
NULL  4      3   0   3    0   2     1       NULL  NULL
1     3      3   2   3    2   3     4       NULL  NULL
1     4      3   2   3    2   NULL  NULL    NULL  NULL
1     5      5   3   4    2   NULL  NULL    7     3
NULL  3      7   0   5    0   NULL  NULL    NULL  NULL
1     6      8   4   6    3   7     3       11    6
NULL  6      11  0   7    0   NULL  NULL    13    8
NULL  8      13  0   8    0   11    6       NULL  NULL
1     7      13  5   8    5   13    8       NULL  NULL
1     8      16  6   9    5   NULL  NULL    17    9
NULL  9      17  0   10   0   NULL  NULL    NULL  NULL
1     9      18  7   11   6   17    9       19    10
NULL  10     19  0   12   0   NULL  NULL    NULL  NULL
1     10     20  8   13   7   19    10      NULL  NULL
1     11     21  9   14   7   NULL  NULL    NULL  NULL

In Step 3 you define a CTE called B based on the query from Step 2. You Query B and filter only original T1 rows (where t = 1). In each group, you return the first row's prev and prevkey values, and last row's nxt and nxtkey values. Instead of just using a window partition clause, a window frame specification is used to define the minimal frame with the desired row to reduce I/O against the spool.

Here’s the code implementing this step:

SELECT keycol AS keycol1, val AS val1, 
  MAX (prev) OVER(PARTITION BY grp
                  ORDER BY prev DESC
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mxprev,
  MAX (prevkey) OVER(PARTITION BY grp 
                     ORDER BY prev DESC
                     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mxprevkey,
  MAX (nxt) OVER(PARTITION BY grp 
                 ORDER BY nxt
                 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mxnxt,
  MAX (nxtkey) OVER(PARTITION BY grp
                    ORDER BY nxt 
                    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS mxnxtkey
FROM B
WHERE t = 1;

This code generates the following output:

keycol1     val1        mxprev      mxprevkey   mxnxt       mxnxtkey
----------- ----------- ----------- ----------- ----------- -----------
2           1           NULL        NULL        2           1
1           1           NULL        NULL        2           1
5           5           3           4           7           3
3           3           3           4           7           3
4           3           3           4           7           3
6           8           7           3           11          6
8           16          13          8           17          9
7           13          13          8           17          9
9           18          17          9           19          10
10          20          19          10          NULL        NULL
11          21          19          10          NULL        NULL

In Step 4 you define a CTE called C based on the query from Step 3. You query C to compute keycol2 and val2 based on the closest match.

Here’s the code implementing this step:

SELECT keycol1, val1,
  CASE WHEN ABS(val1 - mxprev) <= ISNULL(ABS(val1 - mxnxt), mxprev) THEN mxprevkey ELSE mxnxtkey END AS keycol2,
  CASE WHEN ABS(val1 - mxprev) <= ISNULL(ABS(val1 - mxnxt), mxprev) THEN mxprev ELSE mxnxt END AS val2
FROM C;

This code generates the following output:

keycol1     val1        keycol2     val2
----------- ----------- ----------- -----------
2           1           1           2
1           1           1           2
5           5           4           3
3           3           4           3
4           3           4           3
6           8           3           7
8           16          9           17
7           13          8           13
9           18          9           17
10          20          10          19
11          21          10          19

In Step 5 you define a CTE called D based on the query from Step 4. You then join D with T1 and T2 to get the other needed columns.

Here’s the code implementing this step:

SELECT D.keycol1, D.val1, D.keycol2, D.val2,
  SUBSTRING(T1.othercols, 1, 1) AS othercols1,
  SUBSTRING(T2.othercols, 1, 1) AS othercols2
FROM D 
  INNER JOIN dbo.T1 
    ON T1.keycol = D.keycol1
  INNER JOIN dbo.T2
    ON T2.keycol = D.keycol2;

The plan for solution 2 by Kamil with supporting indexes is shown in Figure 11.

Figure 11: Plan for solution 2 by Kamil with indexes

Here are the performance stats for this plan:

run time: 18.1, CPU: 34.4, logical reads: 56,736

The plan for solution 2 by Kamil without supporting indexes is shown in Figure 12.

Figure 12: Plan for solution 2 by Kamil without indexes

Here are the performance stats for this plan:

run time: 20.3, CPU: 38.9, logical reads: 59,012

As you can see, the Plan in Figure 12 involves a couple of extra sorts compared to the plan in Figure 1 to order the data for the two queries in the CTE A—one to support the dense rank calculation and the other to support the grouped query. Other than that, the rest of the work is similar in both. In the grand scheme of things, there’s a bit of extra CPU work and consequentially time penalty, but still both queries perform fairly reasonably.

Conclusion

This article concludes the series on the closest match challenge. It’s great to see the different creative ideas that Kamil, Alejandro and Radek came up with. Thanks to all of you for sending your solutions!

4 thoughts on “Closest Match, Part 3”

Kamil Kosno says:

February 13, 2019 at 4:47 PM

Thanks Itzik, challenges are the best way to learn sql, as you always underline in great books and articles on the subject. Have a great day!
André Cardoso says:

March 12, 2019 at 3:13 PM

Thanks for another great post with some great ideas.
Is there a book, series of blog posts or another source that enumerates these type of solutions/ideas?

I can think of Celko's Tree books (a bit too difficult), Numbers tables, Quirky Update, Dynamic pivot, but there are so many others that I don't even know that I don't know…

Best regards!
Itzik Ben-Gan says:

March 12, 2019 at 3:55 PM

Thanks André!

I cover solutions of this type in T-SQL Querying 3rd Ed, as well as in High Performance T-SQL using Window Functions. In T-SQL Querying it's spread across the book. In High Performance T-SQL using Window Functions the last chapter is dedicated to solutions using window functions.
André Cardoso says:

March 13, 2019 at 1:56 PM

Thank you very much!

I had checked your T-SQL fundamentals & T-SQL Programming books some years ago (maybe 2005?). I'll definitely check those books and set some time aside to reading :)

Comments are closed.