Itzik Ben-Gan

Closest Match, Part 1

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

Karen Ly, a Jr. Fixed Income Analyst at RBC, gave me a T-SQL challenge that involves finding the closest match, as opposed to finding an exact match. In this article I cover a generalized, simplistic, form of the challenge.

The challenge

The challenge involves matching rows from two tables, T1 and T2. Use the following code to create a sample database called testdb, and within it the tables T1 and T2:

  SET NOCOUNT ON;

  IF DB_ID('testdb') IS NULL
    CREATE DATABASE testdb;
  GO

  USE testdb;

  DROP TABLE IF EXISTS dbo.T1, dbo.T2;

  CREATE TABLE dbo.T1
  (
    keycol INT NOT NULL IDENTITY
      CONSTRAINT PK_T1 PRIMARY KEY,
    val INT NOT NULL,
    othercols BINARY(100) NOT NULL
      CONSTRAINT DFT_T1_col1 DEFAULT(0xAA)
  );

  CREATE TABLE dbo.T2
  (
    keycol INT NOT NULL IDENTITY
      CONSTRAINT PK_T2 PRIMARY KEY,
    val INT NOT NULL,
    othercols BINARY(100) NOT NULL
      CONSTRAINT DFT_T2_col1 DEFAULT(0xBB)
  );

As you can see, both T1 and T2 have a numeric column (INT type in this example) called val. The challenge is to match to each row from T1 the row from T2 where the absolute difference between T2.val and T1.val is the lowest. In case of ties (multiple matching rows in T2), match the top row based on val ascending, keycol ascending order. That is, the row with the lowest value in the val column, and if you still have ties, the row with the lowest keycol value. The tiebreaker is used to guarantee determinism.

Use the following code to populate T1 and T2 with small sets of sample data in order to be able to check the correctness of your solutions:

  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 (val)
    VALUES(1),(1),(3),(3),(5),(8),(13),(16),(18),(20),(21);

  INSERT INTO dbo.T2 (val)
    VALUES(2),(2),(7),(3),(3),(11),(11),(13),(17),(19);

Check the contents of T1:

  SELECT keycol, val, SUBSTRING(othercols, 1, 1) AS othercols
  FROM dbo.T1
  ORDER BY val, keycol;

This code generates the following output:

  keycol      val         othercols
  ----------- ----------- ---------
  1           1           0xAA
  2           1           0xAA
  3           3           0xAA
  4           3           0xAA
  5           5           0xAA
  6           8           0xAA
  7           13          0xAA
  8           16          0xAA
  9           18          0xAA
  10          20          0xAA
  11          21          0xAA

Check the contents of T2:

  SELECT keycol, val, SUBSTRING(othercols, 1, 1) AS othercols
  FROM dbo.T2
  ORDER BY val, keycol;

This code generates the following output:

  keycol      val         othercols
  ----------- ----------- ---------
  1           2           0xBB
  2           2           0xBB
  4           3           0xBB
  5           3           0xBB
  3           7           0xBB
  6           11          0xBB
  7           11          0xBB
  8           13          0xBB
  9           17          0xBB
  10          19          0xBB

Here’s the desired result for the given sample data:

  keycol1     val1        othercols1 keycol2     val2        othercols2
  ----------- ----------- ---------- ----------- ----------- ----------
  1           1           0xAA       1           2           0xBB
  2           1           0xAA       1           2           0xBB
  3           3           0xAA       4           3           0xBB
  4           3           0xAA       4           3           0xBB
  5           5           0xAA       4           3           0xBB
  6           8           0xAA       3           7           0xBB
  7           13          0xAA       8           13          0xBB
  8           16          0xAA       9           17          0xBB
  9           18          0xAA       9           17          0xBB
  10          20          0xAA       10          19          0xBB
  11          21          0xAA       10          19          0xBB

Before you start working on the challenge, examine the desired result and make sure you understand the matching logic, including the tiebreaking logic. For instance, consider the row in T1 where keycol is 5 and val is 5. This row has multiple closest matches in T2:

  keycol      val         othercols
  ----------- ----------- ---------
  4           3           0xBB
  5           3           0xBB
  3           7           0xBB

In all these rows the absolute difference between T2.val and T1.val (5) is 2. Using the tiebreaking logic based on the order val ascending, keycol ascending the topmost row here is the one where val is 3 and keycol is 4.

You will use the small sets of sample data to check the validity of your solutions. To check performance, you need larger sets. Use the following code to create a helper function called GetNums, which generates a sequence of integers in a requested range:

  DROP FUNCTION IF EXISTS dbo.GetNums;
  GO

  CREATE OR ALTER FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
  AS
  RETURN
    WITH
      L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
      L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
      L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
      L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
      L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
      L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
      Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
               FROM L5)
    SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
    FROM Nums
    ORDER BY rownum;
  GO

Use the following code to populate T1 and T2 with large sets of sample data:

  DECLARE
    @numrowsT1 AS INT = 1000000,
    @maxvalT1  AS INT = 10000000,
    @numrowsT2 AS INT = 1000000,
    @maxvalT2  AS INT = 10000000;
    
  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT1 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT1) AS Nums;
    
  INSERT INTO dbo.T2 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT2 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT2) AS Nums;

The variables @numrowsT1 and @numrowsT2 control the numbers of rows you want the tables to be populated with. The variables @maxvalT1 and @maxvalT2 control the range of distinct values in the val column, and therefore affect the density of the column. The above code fills the tables with 1,000,000 rows each, and uses the range of 1 – 10,000,000 for the val column in both tables. This results in low density in the column (large number of distinct values, with a small number of duplicates). Using lower maximum values will result in higher density (smaller number of distinct values, and hence more duplicates).

Solution 1, using one TOP subquery

The simplest and most obvious solution is probably one that queries T1, and using the CROSS APPLY operator applies a query with a TOP (1) filter, ordering the rows by the absolute difference between T1.val and T2.val, using T2.val, T2.keycol as the tiebreaker. Here’s the solution’s code:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) T2.*
        FROM dbo.T2
        ORDER BY ABS(T2.val - T1.val), T2.val, T2.keycol ) AS A;

Remember, there are 1,000,000 rows in each of the tables. And the density of the val column is low in both tables. Unfortunately, since there’s no filtering predicate in the applied query, and the ordering involves an expression that manipulates columns, there’s no potential here to create supporting indexes. This query generates the plan in Figure 1.

Figure 1: Plan for Solution 1

The outer input of the loop scans 1,000,000 rows from T1. The inner input of the loop performs a full scan of T2 and a TopN sort for each distinct T1.val value. In our plan this happens 998,657 times since we have very low density. It places the rows in an index spool, keyed by T1.val, so that it can reuse those for duplicate occurrences of T1.val values, but we have very few duplicates. This plan has quadratic complexity. Between all expected executions of the inner branch of the loop, it’s expected to process close to a trillion rows. When talking about large numbers of rows for a query to process, once you start mentioning billions of rows, people already know you’re dealing with an expensive query. Normally, people don’t utter terms like trillions, unless they’re discussing the US national debt, or stock market crashes. You usually don’t deal with such numbers in query processing context. But plans with quadratic complexity can quickly end up with such numbers. Running the query in SSMS with Include Live Query Statistics enabled took 39.6 seconds to process just 100 rows from T1 on my laptop. This means that it should take this query about 4.5 days to complete. The question is, are you really into binge-watching live query plans? Could be an interesting Guinness record to try and set.

Solution 2, using two TOP subqueries

Assuming you’re after a solution that takes seconds, not days, to complete, here’s an idea. In the applied table expression, unify two inner TOP (1) queries—one filtering a row where T2.val < T1.val (ordered by T2.val DESC, T2.keycol), and another filtering a row where T2.val >= T1.val (ordered by T2.val, T2.keycol). It’s easy to create indexes to support such queries by enabling a single-row seek. Still within the applied table expression, use an outer TOP (1) query against the two-row result, based on the order of ABS(D.val – T1.val), D.val, D.keycol. There will be a TopN sort involved, but it will be applied to two rows at a time.

Here are the recommended indexes to support this solution:

  CREATE INDEX idx_val_key ON dbo.T1(val, keycol) INCLUDE(othercols);
  CREATE INDEX idx_val_key ON dbo.T2(val, keycol) INCLUDE(othercols);
  CREATE INDEX idx_valD_key ON dbo.T2(val DESC, keycol) INCLUDE(othercols);

Here’s the complete solution code:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) D.*
        FROM ( SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val < T1.val
               ORDER BY T2.val DESC, T2.keycol

               UNION ALL

               SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val >= T1.val
               ORDER BY T2.val, T2.keycol ) AS D
        ORDER BY ABS(D.val - T1.val), D.val, D.keycol ) AS A;

Remember that we have 1,000,000 rows in each table, with the val column having values in the range 1 – 10,000,000 (low density), and optimal indexes in place.

The plan for this query is shown in Figure 2.

Figure 2: Plan for Solution 2

Observe the optimal use of the indexes on T2, resulting in a plan with linear scaling. This plan uses an index spool in the same manner the previous plan did, i.e., to avoid repeating the work in the inner branch of the loop for duplicate T1.val values. Here are the performance statistics that I got for the execution of this query on my system:

  Elapsed: 27.7 sec, CPU: 27.6 sec, logical reads: 17,304,674

Solution 2, with spooling disabled

You can’t help but wonder whether the index spool is really beneficial here. The point is to avoid repeating work for duplicate T1.val values, but in a situation like ours where we have very low density, there are very few duplicates. This means that most likely the work involved in spooling is greater than just repeating the work for duplicates. There’s a simple way to verify this—using trace flag 8690 you can disable spooling in the plan, like so:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) D.*
        FROM ( SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val < T1.val
               ORDER BY T2.val DESC, T2.keycol
               
               UNION ALL
               
               SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val >= T1.val
               ORDER BY T2.val, T2.keycol ) AS D
        ORDER BY ABS(D.val - T1.val), D.val, D.keycol ) AS A
  OPTION(QUERYTRACEON 8690);

I got the plan shown in Figure 3 for this execution:

Figure 3: Plan for Solution 2, with spooling disabled

Observe that the index spool disappeared, and this time the inner branch of the loop gets executed 1,000,000 times. Here are the performance statistics that I got for this execution:

  Elapsed: 19.18 sec, CPU: 19.17 sec, logical reads: 6,042,148

That’s a reduction of 44 percent in the execution time.

Solution 2, with modified value range and indexing

Disabling spooling makes a lot of sense when you have low density in the T1.val values; however, the spooling can be very beneficial when you have high density. For example, run the following code to repopulate T1 and T2 with sample data setting @maxvalT1 and @maxvalT2 to 1000 (1,000 maximum distinct values):

  DECLARE
    @numrowsT1 AS INT = 1000000,
    @maxvalT1  AS INT = 1000,
    @numrowsT2 AS INT = 1000000,
    @maxvalT2  AS INT = 1000;

  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT1 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT1) AS Nums;

  INSERT INTO dbo.T2 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT2 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT2) AS Nums;

Rerun Solution 2, first without the trace flag:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) D.*
        FROM ( SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val < T1.val
               ORDER BY T2.val DESC, T2.keycol
            
               UNION ALL
               
               SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val >= T1.val
               ORDER BY T2.val, T2.keycol ) AS D
        ORDER BY ABS(D.val - T1.val), D.val, D.keycol ) AS A;

The plan for this execution is shown in Figure 4:

Figure 4: Plan for Solution 2, with value range 1 – 1000

The optimizer decided to use a different plan due to the high density in T1.val. Observe that the index on T1 is scanned in key order. So, for each first occurrence of a distinct T1.val value the inner branch of the loop is executed, and the result is spooled in a regular table spool (rebind). Then, for each nonfirst occurrence of the value, a rewind is applied. With 1,000 distinct values, the inner branch is executed only 1,000 times. This results in excellent performance statistics:

  Elapsed: 1.16 sec, CPU: 1.14 sec, logical reads: 23,278

Now try running the solution with spooling disabled:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) D.*
        FROM ( SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val < T1.val
               ORDER BY T2.val DESC, T2.keycol
               
               UNION ALL
               
               SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val >= T1.val
               ORDER BY T2.val, T2.keycol ) AS D
        ORDER BY ABS(D.val - T1.val), D.val, D.keycol ) AS A
  OPTION(QUERYTRACEON 8690);

I got the plan shown in Figure 5.

Figure 5: Plan for Solution 2, with value range 1 – 1,000 and spooling disabled

It’s essentially the same plan shown earlier in Figure 3. The inner branch of the loop gets executed 1,000,000 times. Here are the performance statistics that I got for this execution:

  Elapsed: 24.5 sec, CPU: 24.2 sec, logical reads: 8,012,548

Clearly, you want to be carful not to disable spooling when you have high density in T1.val.

Life is good when your situation is simple enough that you are able to create supporting indexes. The reality is that in some cases in practice, there’s enough additional logic in the query, that precludes the ability to create optimal supporting indexes. In such cases, Solution 2 will not do well.

To demonstrate the performance of Solution 2 without supporting indexes, repopulate T1 and T2 back with sample data setting @maxvalT1 and @maxvalT2 to 10000000 (value range 1 – 10M), and also remove the supporting indexes:

  DROP INDEX IF EXISTS idx_val_key ON dbo.T1;
  DROP INDEX IF EXISTS idx_val_key ON dbo.T2;
  DROP INDEX IF EXISTS idx_valD_key ON dbo.T2;

  DECLARE
    @numrowsT1 AS INT = 1000000,
    @maxvalT1  AS INT = 10000000,
    @numrowsT2 AS INT = 1000000,
    @maxvalT2  AS INT = 10000000;

  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT1 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT1) AS Nums;
    
  INSERT INTO dbo.T2 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT2 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT2) AS Nums;

Rerun Solution 2, with Include Live Query Statistics enabled in SSMS:

  SELECT
     T1.keycol AS keycol1, T1.val AS val1, SUBSTRING(T1.othercols, 1, 1) AS othercols1,
     A.keycol AS keycol2, A.val AS val2, SUBSTRING(A.othercols, 1, 1) AS othercols2
  FROM dbo.T1
    CROSS APPLY
      ( SELECT TOP (1) D.*
        FROM ( SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val < T1.val
               ORDER BY T2.val DESC, T2.keycol
               
               UNION ALL
               
               SELECT TOP (1) *
               FROM dbo.T2
               WHERE T2.val >= T1.val
               ORDER BY T2.val, T2.keycol ) AS D
        ORDER BY ABS(D.val - T1.val), D.val, D.keycol ) AS A;

I got the plan shown in Figure 6 for this query:

Figure 6: Plan for Solution 2, without indexing, with value range 1 – 1,000,000

You can see a pattern very similar to the one shown earlier in Figure 1, only this time the plan scans T2 twice per distinct T1.val value. Again, the plan complexity goes quadratic. Running the query in SSMS with Include Live Query Statistics enabled took 49.6 seconds to process 100 rows from T1 on my laptop, meaning that it should take this query about 5.7 days to complete. This could of course mean good news if you’re trying to break the Guinness world record for binge-watching a live query plan.

Conclusion

I’d like to thank Karen Ly from RBC for presenting me with this nice closest match challenge. I was quite impressed with her code for handling it, which included a lot of extra logic that was specific to her system. In this article I showed reasonably performing solutions when you are able to create optimal supporting indexes. But as you could see, in cases where this is not an option, obviously the execution times that you get are abysmal. Can you think of solutions that will do well even without optimal supporting indexes? To be continued…