Itzik Ben-Gan

Closest Match, Part 2

SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

Last month, I covered a puzzle involving matching each row from one table with the closest match from another table. I got this puzzle from Karen Ly, a Jr. Fixed Income Analyst at RBC. I covered two main relational solutions that combined the APPLY operator with TOP-based subqueries. Solution 1 always had quadratic scaling. Solution 2 did quite well when provided with good supporting indexes, but without those indexes also had quadric scaling. In this article I cover iterative solutions, which despite being generally frowned upon by SQL pros, do provide much better scaling in our case even without optimal indexing.

The challenge

As a quick reminder, our challenge involves tables called T1 and T2, which you create with the following code:

  SET NOCOUNT ON;

  IF DB_ID('testdb') IS NULL
    CREATE DATABASE testdb;
  GO

  USE testdb;

  DROP TABLE IF EXISTS dbo.T1, dbo.T2;

  CREATE TABLE dbo.T1
  (
    keycol INT NOT NULL IDENTITY
      CONSTRAINT PK_T1 PRIMARY KEY,
    val INT NOT NULL,
    othercols BINARY(100) NOT NULL
      CONSTRAINT DFT_T1_col1 DEFAULT(0xAA)
  );

  CREATE TABLE dbo.T2
  (
    keycol INT NOT NULL IDENTITY
      CONSTRAINT PK_T2 PRIMARY KEY,
    val INT NOT NULL,
    othercols BINARY(100) NOT NULL
      CONSTRAINT DFT_T2_col1 DEFAULT(0xBB)
  );

You then use the following code to populate the tables with small sets of sample data in order to check the correctness of your solutions:

  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 (val)
    VALUES(1),(1),(3),(3),(5),(8),(13),(16),(18),(20),(21);

  INSERT INTO dbo.T2 (val)
    VALUES(2),(2),(7),(3),(3),(11),(11),(13),(17),(19);

Recall the challenge was to match to each row from T1 the row from T2 where the absolute difference between T2.val and T1.val is the lowest. In case of ties, you’re supposed to use val ascending, keycol ascending order as the tiebreaker.

Here’s the desired result for the given sample data:

  keycol1     val1        othercols1 keycol2     val2        othercols2
  ----------- ----------- ---------- ----------- ----------- ----------
  1           1           0xAA       1           2           0xBB
  2           1           0xAA       1           2           0xBB
  3           3           0xAA       4           3           0xBB
  4           3           0xAA       4           3           0xBB
  5           5           0xAA       4           3           0xBB
  6           8           0xAA       3           7           0xBB
  7           13          0xAA       8           13          0xBB
  8           16          0xAA       9           17          0xBB
  9           18          0xAA       9           17          0xBB
  10          20          0xAA       10          19          0xBB
  11          21          0xAA       10          19          0xBB

To check the performance of your solutions you need larger sets of sample data. You first create the helper function GetNums, which generates a sequence of integers in a requested range, using the following code:

  DROP FUNCTION IF EXISTS dbo.GetNums;
  GO

  CREATE OR ALTER FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
  AS
  RETURN
    WITH
      L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
      L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
      L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
      L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
      L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
      L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
      Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
               FROM L5)
    SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
    FROM Nums
    ORDER BY rownum;
  GO

You then populate T1 and T2 using the following code, adjusting the parameters indicating the numbers of rows and maximum values based on your needs:

  DECLARE
    @numrowsT1 AS INT = 1000000,
    @maxvalT1  AS INT = 10000000,
    @numrowsT2 AS INT = 1000000,
    @maxvalT2  AS INT = 10000000;
    
  TRUNCATE TABLE dbo.T1;
  TRUNCATE TABLE dbo.T2;

  INSERT INTO dbo.T1 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT1 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT1) AS Nums;
    
  INSERT INTO dbo.T2 WITH(TABLOCK) (val)
    SELECT ABS(CHECKSUM(NEWID())) % @maxvalT2 + 1 AS val
    FROM dbo.GetNums(1, @numrowsT2) AS Nums;

In this example you’re populating the tables with 1,000,000 rows each, with values in the range 1 – 10,000,000 in the val column (low density).

Solution 3, using a cursor and a disk-based table variable

An efficient iterative solution for our closest match challenge is based on an algorithm similar to the Merge join algorithm. The idea is to apply only one ordered pass against each table using cursors, evaluating the ordering and tiebreaking elements in each round to decide in which side to advance, and matching the rows along the way.

The ordered pass against each table will certainly benefit from supporting indexes, but the implication of not having those is that explicit sorting will take place. This means that the sorting part will incur n log n scaling, but that’s much less severe than the quadratic scaling that you get from Solution 2 in similar circumstances.

Also, the performance of Solutions 1 and 2 was affected by the density of the val column. With higher density the plan applied fewer rebinds. Conversely, since the iterative solutions perform only one pass against each of the inputs, the density of the val column isn’t a performance-affecting factor.

Use the following code to create supporting indexes:

  CREATE INDEX idx_val_key ON dbo.T1(val, keycol) INCLUDE(othercols);
  CREATE INDEX idx_val_key ON dbo.T2(val, keycol) INCLUDE(othercols);

Make sure you test the solutions both with and without these indexes.

Here’s the complete code for Solution 3:

  SET NOCOUNT ON;

  BEGIN TRAN;

  DECLARE
    @keycol1 AS INT, @val1 AS INT, @othercols1 AS BINARY(100),
    @keycol2 AS INT, @val2 AS INT, @othercols2 AS BINARY(100),
    @prevkeycol2 AS INT, @prevval2 AS INT, @prevothercols2 AS BINARY(100),
    @C1 AS CURSOR, @C2 AS CURSOR,
    @C1fetch_status AS INT, @C2fetch_status AS INT;

  DECLARE @Result AS TABLE
  (
    keycol1    INT         NOT NULL PRIMARY KEY,
    val1       INT         NOT NULL,
    othercols1 BINARY(100) NOT NULL,
    keycol2    INT         NULL,
    val2       INT         NULL,
    othercols2 BINARY(100) NULL
  );

  SET @C1 = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
    SELECT keycol, val, othercols FROM dbo.T1 ORDER BY val, keycol;

  SET @C2 = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
    SELECT keycol, val, othercols FROM dbo.T2 ORDER BY val, keycol;

  OPEN @C1;
  OPEN @C2;

  FETCH NEXT FROM @C2 INTO @keycol2, @val2, @othercols2;

  SET @C2fetch_status = @@fetch_status;

  SELECT @prevkeycol2 = @keycol2, @prevval2 = @val2, @prevothercols2 = @othercols2;

  FETCH NEXT FROM @C1 INTO @keycol1, @val1, @othercols1;

  SET @C1fetch_status = @@fetch_status;

  WHILE @C1fetch_status = 0
  BEGIN
    IF @val1 <= @val2 OR @C2fetch_status <> 0
    BEGIN
      IF ABS(@val1 - @val2) < ABS(@val1 - @prevval2)
        INSERT INTO @Result(keycol1, val1, othercols1, keycol2, val2, othercols2)
          VALUES(@keycol1, @val1, @othercols1, @keycol2, @val2, @othercols2);
      ELSE
        INSERT INTO @Result(keycol1, val1, othercols1, keycol2, val2, othercols2)
          VALUES(@keycol1, @val1, @othercols1, @prevkeycol2, @prevval2, @prevothercols2);

      FETCH NEXT FROM @C1 INTO @keycol1, @val1, @othercols1;
      SET @C1fetch_status = @@fetch_status;
    END
    ELSE IF @C2fetch_status = 0
    BEGIN
      IF @val2 > @prevval2
        SELECT @prevkeycol2 = @keycol2, @prevval2 = @val2, @prevothercols2 = @othercols2;

      FETCH NEXT FROM @C2 INTO @keycol2, @val2, @othercols2;
      SET @C2fetch_status = @@fetch_status;
    END;  
  END;

  SELECT
     keycol1, val1, SUBSTRING(othercols1, 1, 1) AS othercols1,
     keycol2, val2, SUBSTRING(othercols2, 1, 1) AS othercols2
  FROM @Result;

  COMMIT TRAN;

The code uses a table variable called @Result to store the matches and eventually returns them by querying the table variable. Notice that the code performs the work in one transaction to reduce logging.

The code uses cursor variables called @C1 and @C2 to iterate through rows in T1 and T2, respectively, in both cases ordered by val, keycol. Local variables are used to store the current row values from each cursor (@keycol1, @val1 and @othercols1 for @C1, and @keycol2, @val2 and @othercols2 for @C2). Additional local variables store the previous row values from @C2 (@prevkeycol2, @prevval2 and @prevothercols2). The variables @C1fetch_status and @C2fetch_status hold the status of the last fetch from the respective cursor.

After declaring and opening both cursors, the code fetches a row from each cursor into the respective local variables, and initially stores the current row values from @C2 also in the previous row variables. The code then enters a loop that keeps running while the last fetch from @C1 was successful (@C1fetch_status = 0). The loop’s body applies the following pseudo code in each round:

  If @val1 <= @val2 or reached end of @C2

  Begin

    If absolute difference between @val1 and @val2 is less than between @val1 and @prevval2

      Add row to @Result with current row values from @C1 and current row values from @C2

    Else

      Add row to @Result with current row values from @C1 and previous row values from @C2

    Fetch next row from @C1

  End

  Else if last fetch from @C2 was successful

  Begin

    If @val2 > @prevval2

      Set variables holding @C2’s previous row values to values of current row variables

    Fetch next row from @C2

  End

The code then simply queries the table variable @Result to return all matches.

Using the large sets of sample data (1,000,000 rows in each table), with optimal indexing in place, this solution took 38 seconds to complete on my system, and performed 28,240 logical reads. Of course, the scaling of this solution is then linear. Without optimal indexing, it took 40 seconds to complete (only 2 seconds extra!), and performed 29,519 logical reads. The sorting part in this solution has n log n scaling.

Solution 4, using a cursor and a memory-optimized table variable

In attempt to improve the performance of the iterative approach, one thing you could try is to replace the use of the disk-based table variable with a memory-optimized one. Since the solution involves writing 1,000,000 rows to the table variable, this could result in a non-negligible improvement.

First, you need to enable In-Memory OLTP in the database by creating a filegroup that is marked as CONTAINS MEMORY_OPTIMIZED_DATA, and within it a container that points to a folder in the file system. Assuming you created ahead a parent folder called C:\IMOLTP\, use the following code to apply these two steps:

  ALTER DATABASE testdb
    ADD FILEGROUP testdb_MO CONTAINS MEMORY_OPTIMIZED_DATA;

  ALTER DATABASE testdb
    ADD FILE ( NAME = testdb_dir,
               FILENAME = 'C:\IMOLTP\testdb_dir' )
      TO FILEGROUP testdb_MO;

The next step is to create a memory optimized table type as a template for our table variable by running the following code:

  DROP TYPE IF EXISTS dbo.TYPE_closestmatch;
  GO

  CREATE TYPE dbo.TYPE_closestmatch AS TABLE
  (
    keycol1    INT         NOT NULL PRIMARY KEY NONCLUSTERED,
    val1       INT         NOT NULL,
    othercols1 BINARY(100) NOT NULL,
    keycol2    INT         NULL,
    val2       INT         NULL,
    othercols2 BINARY(100) NULL
  )
  WITH (MEMORY_OPTIMIZED = ON);

Then instead of the original declaration of the table variable @Result, you would use the following code:

  DECLARE @Result AS dbo.TYPE_closestmatch;

Here’s the complete solution code:

  SET NOCOUNT ON;

  USE testdb;

  BEGIN TRAN;

  DECLARE
    @keycol1 AS INT, @val1 AS INT, @othercols1 AS BINARY(100),
    @keycol2 AS INT, @val2 AS INT, @othercols2 AS BINARY(100),
    @prevkeycol2 AS INT, @prevval2 AS INT, @prevothercols2 AS BINARY(100),
    @C1 AS CURSOR, @C2 AS CURSOR,
    @C1fetch_status AS INT, @C2fetch_status AS INT;

  DECLARE @Result AS dbo.TYPE_closestmatch;

  SET @C1 = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
    SELECT keycol, val, othercols FROM dbo.T1 ORDER BY val, keycol;

  SET @C2 = CURSOR FORWARD_ONLY STATIC READ_ONLY FOR
    SELECT keycol, val, othercols FROM dbo.T2 ORDER BY val, keycol;

  OPEN @C1;
  OPEN @C2;

  FETCH NEXT FROM @C2 INTO @keycol2, @val2, @othercols2;

  SET @C2fetch_status = @@fetch_status;

  SELECT @prevkeycol2 = @keycol2, @prevval2 = @val2, @prevothercols2 = @othercols2;

  FETCH NEXT FROM @C1 INTO @keycol1, @val1, @othercols1;

  SET @C1fetch_status = @@fetch_status;

  WHILE @C1fetch_status = 0
  BEGIN
    IF @val1 <= @val2 OR @C2fetch_status <> 0
    BEGIN
      IF ABS(@val1 - @val2) < ABS(@val1 - @prevval2)
        INSERT INTO @Result(keycol1, val1, othercols1, keycol2, val2, othercols2)
          VALUES(@keycol1, @val1, @othercols1, @keycol2, @val2, @othercols2);
      ELSE
        INSERT INTO @Result(keycol1, val1, othercols1, keycol2, val2, othercols2)
          VALUES(@keycol1, @val1, @othercols1, @prevkeycol2, @prevval2, @prevothercols2);
          
      FETCH NEXT FROM @C1 INTO @keycol1, @val1, @othercols1;
      SET @C1fetch_status = @@fetch_status;
    END
    ELSE IF @C2fetch_status = 0
    BEGIN
      IF @val2 > @prevval2
        SELECT @prevkeycol2 = @keycol2, @prevval2 = @val2, @prevothercols2 = @othercols2;
        
      FETCH NEXT FROM @C2 INTO @keycol2, @val2, @othercols2;
      SET @C2fetch_status = @@fetch_status;
    END;  
  END;

  SELECT
     keycol1, val1, SUBSTRING(othercols1, 1, 1) AS othercols1,
     keycol2, val2, SUBSTRING(othercols2, 1, 1) AS othercols2
  FROM @Result;

  COMMIT TRAN;

With the optimal indexing in place this solution took 27 seconds to complete on my machine (compared to 38 seconds with the disk-based table variable), and without optimal indexing it took 29 seconds to complete (compared to 40 seconds). That’s close to 30 percent reduction in the run time.

Solution 5, using SQL CLR

Another way to further improve the performance of the iterative approach is to implement the solution using SQL CLR, given that most of the overhead of the T-SQL solution is due to the inefficiencies of cursor fetching and looping in T-SQL.

Here's the complete solution code implementing the very same algorithm I used in Solutions 3 and 4 with C#, using SqlDataReader objects instead of T-SQL cursors:

  using System;
  using System.Data;
  using System.Data.SqlClient;
  using System.Data.SqlTypes;
  using Microsoft.SqlServer.Server;

  public partial class ClosestMatch
  {
      [SqlProcedure]
      public static void GetClosestMatches()
      {
          using (SqlConnection conn = new SqlConnection("data source=MyServer\\MyInstance;Database=testdb;Trusted_Connection=True;MultipleActiveResultSets=true;"))
          {
              SqlCommand comm1 = new SqlCommand();
              SqlCommand comm2 = new SqlCommand();
              comm1.Connection = conn;
              comm2.Connection = conn;
              comm1.CommandText = "SELECT keycol, val, othercols FROM dbo.T1 ORDER BY val, keycol;";
              comm2.CommandText = "SELECT keycol, val, othercols FROM dbo.T2 ORDER BY val, keycol;";
              
              SqlMetaData[] columns = new SqlMetaData[6];
              columns[0] = new SqlMetaData("keycol1", SqlDbType.Int);
              columns[1] = new SqlMetaData("val1", SqlDbType.Int);
              columns[2] = new SqlMetaData("othercols1", SqlDbType.Binary, 100);
              columns[3] = new SqlMetaData("keycol2", SqlDbType.Int);
              columns[4] = new SqlMetaData("val2", SqlDbType.Int);
              columns[5] = new SqlMetaData("othercols2", SqlDbType.Binary, 100);

              SqlDataRecord record = new SqlDataRecord(columns);
              SqlContext.Pipe.SendResultsStart(record);
              conn.Open();
              SqlDataReader reader1 = comm1.ExecuteReader();
              SqlDataReader reader2 = comm2.ExecuteReader();
              SqlInt32 keycol1 = SqlInt32.Null;
              SqlInt32 val1 = SqlInt32.Null;
              SqlBinary othercols1 = SqlBinary.Null;
              SqlInt32 keycol2 = SqlInt32.Null;
              SqlInt32 val2 = SqlInt32.Null;
              SqlBinary othercols2 = SqlBinary.Null;
              SqlInt32 prevkeycol2 = SqlInt32.Null;
              SqlInt32 prevval2 = SqlInt32.Null;
              SqlBinary prevothercols2 = SqlBinary.Null;
              
              Boolean reader2foundrow = reader2.Read();
              
              if (reader2foundrow)
              {
                  keycol2 = reader2.GetSqlInt32(0);
                  val2 = reader2.GetSqlInt32(1);
                  othercols2 = reader2.GetSqlBinary(2);
                  prevkeycol2 = keycol2;
                  prevval2 = val2;
                  prevothercols2 = othercols2;
              }

              Boolean reader1foundrow = reader1.Read();

              if (reader1foundrow)
              {
                  keycol1 = reader1.GetSqlInt32(0);
                  val1 = reader1.GetSqlInt32(1);
                  othercols1 = reader1.GetSqlBinary(2);
              }
              
              while (reader1foundrow)
              {
                  if (val1 <= val2 || !reader2foundrow)
                  {
                      if (Math.Abs((int)(val1 - val2)) < Math.Abs((int)(val1 - prevval2)))
                      {
                          record.SetSqlInt32(0, keycol1);
                          record.SetSqlInt32(1, val1);
                          record.SetSqlBinary(2, othercols1);
                          record.SetSqlInt32(3, keycol2);
                          record.SetSqlInt32(4, val2);
                          record.SetSqlBinary(5, othercols2);
                          SqlContext.Pipe.SendResultsRow(record);
                      }
                      else
                      {
                          record.SetSqlInt32(0, keycol1);
                          record.SetSqlInt32(1, val1);
                          record.SetSqlBinary(2, othercols1);
                          record.SetSqlInt32(3, prevkeycol2);
                          record.SetSqlInt32(4, prevval2);
                          record.SetSqlBinary(5, prevothercols2);
                          SqlContext.Pipe.SendResultsRow(record);                        
                      }

                      reader1foundrow = reader1.Read();

                      if (reader1foundrow)
                      {
                          keycol1 = reader1.GetSqlInt32(0);
                          val1 = reader1.GetSqlInt32(1);

                          othercols1 = reader1.GetSqlBinary(2);
                      }
                  }
                  else if (reader2foundrow)
                  {
                      if (val2 > prevval2)
                      {
                          prevkeycol2 = keycol2;
                          prevval2 = val2;
                          prevothercols2 = othercols2;
                      }
                      
                      reader2foundrow = reader2.Read();

                      if (reader2foundrow)
                      {                      
                          keycol2 = reader2.GetSqlInt32(0);
                          val2 = reader2.GetSqlInt32(1);
                          othercols2 = reader2.GetSqlBinary(2);
                      }
                  }
              }
              SqlContext.Pipe.SendResultsEnd();
          }
      }
  }

To connect to the database you would normally use the option "context connection=true" instead of a full blown connection string. Unfortunately, this option is not available when you need to work with multiple active result sets. Our solution emulating parallel work with two cursors using two SqlDataReader objects, and therefore you do need a full blown connection string, with the option MultipleActiveResultSets=true. Here’s the full connection string:

  "data source=MyServer\\MyInstance;Database=testdb;Trusted_Connection=True;MultipleActiveResultSets=true;"

Of course in your case you would need to replace MyServer\\MyInstance with your server and instance (if relevant) names.

Also, the fact that you didn’t use "context connection=true" rather an explicit connection string means that the assembly needs access to an external resource and therefore be trusted. Normally, you would achieve this by signing it with a certificate or an asymmetric key that has a corresponding login with the right permission, or whitelist it using the sp_add_trusted_assembly procedure. For simplicity’s sake, I will set the database option TRUSTWORTHY to ON, and specify the EXTERNAL_ACCESS permission set when creating the assembly. The following code deploys the solution in the database:

  EXEC sys.sp_configure 'advanced', 1;
  RECONFIGURE;

  EXEC sys.sp_configure 'clr enabled', 1;
  EXEC sys.sp_configure 'clr strict security', 0;
  RECONFIGURE;

  EXEC sys.sp_configure 'advanced', 0;
  RECONFIGURE;

  ALTER DATABASE testdb SET TRUSTWORTHY ON; 

  USE testdb;

  DROP PROC IF EXISTS dbo.GetClosestMatches;

  DROP ASSEMBLY IF EXISTS ClosestMatch;

  CREATE ASSEMBLY ClosestMatch 
    FROM 'C:\ClosestMatch\ClosestMatch\bin\Debug\ClosestMatch.dll'
  WITH PERMISSION_SET = EXTERNAL_ACCESS;
  GO

  CREATE PROCEDURE dbo.GetClosestMatches
  AS EXTERNAL NAME ClosestMatch.ClosestMatch.GetClosestMatches;

The code enables CLR in the instance, disables the CLR strict security option, sets the database option TRUSTWORTHY to ON, creates the assembly, and creates the procedure GetClosestMatches.

Use the following code to test the stored procedure:

 EXEC dbo.GetClosestMatches;

The CLR solution took 8 seconds to complete on my system with optimal indexing, and 9 seconds without. That’s a pretty impressive performance improvement compared to all other solutions — both relational and iterative.

Conclusion

Iterative solutions are tipically frowned upon in the SQL community since they don’t follow the relational model. The reality though is that sometimes you are not able to create good performing relational solutions and performance is a priority. Using an iterative approach, you are not limited to the algorithms that the SQL Server optimizer has access to, but rather can implement any algorithm that you like. As demonstrated in this article, using a merge-like algorithm, you were able to achieve the task with a single ordered pass against each of the inputs. Using T-SQL cursors and a disk-based table variable you got reasonable performance and scaling. You were able to improve the performance by about 30 percent by switching to a memory optimized table variable, and significantly more by using SQL CLR.