My friend Peter Larsson sent me a T-SQL challenge involving matching supply with demand. As far as challenges go, it's a formidable one. It's a pretty common need in real life; it's simple to describe, and it's quite easy to solve with reasonable performance using a cursor-based solution. The tricky part is to solve it using an efficient set-based solution.
When Peter sends me a puzzle, I typically respond with a suggested solution, and he usually comes up with a better-performing, brilliant solution. I sometimes suspect he sends me puzzles to motivate himself to come up with a great solution.
Since the task represents a common need and is so interesting, I asked Peter if he would mind if I shared it and his solutions with the sqlperformance.com readers, and he was happy to let me.
This month, I'll present the challenge and the classic cursor-based solution. In subsequent articles, I'll present set-based solutions—including the ones Peter and I worked on.
The challenge involves querying a table called Auctions, which you create using the following code:
DROP TABLE IF EXISTS dbo.Auctions; CREATE TABLE dbo.Auctions ( ID INT NOT NULL IDENTITY(1, 1) CONSTRAINT pk_Auctions PRIMARY KEY CLUSTERED, Code CHAR(1) NOT NULL CONSTRAINT ck_Auctions_Code CHECK (Code = 'D' OR Code = 'S'), Quantity DECIMAL(19, 6) NOT NULL CONSTRAINT ck_Auctions_Quantity CHECK (Quantity > 0) );
This table holds demand and supply entries. Demand entries are marked with the code D and supply entries with S. Your task is to match supply and demand quantities based on ID ordering, writing the result into a temporary table. The entries could represent cryptocurrency buy and sell orders such as BTC/USD, stock buy and sell orders such as MSFT/USD, or any other item or good where you need to match supply with demand. Notice the Auctions entries are missing a price attribute. As mentioned, you should match the supply and demand quantities based on ID ordering. This ordering could have been derived from price (ascending for supply entries and descending for demand entries) or any other relevant criteria. In this challenge, the idea was to keep things simple and assume the ID already represents the relevant order for matching.
As an example, use the following code to fill the Auctions table with a small set of sample data:
SET NOCOUNT ON; DELETE FROM dbo.Auctions; SET IDENTITY_INSERT dbo.Auctions ON; INSERT INTO dbo.Auctions(ID, Code, Quantity) VALUES (1, 'D', 5.0), (2, 'D', 3.0), (3, 'D', 8.0), (5, 'D', 2.0), (6, 'D', 8.0), (7, 'D', 4.0), (8, 'D', 2.0), (1000, 'S', 8.0), (2000, 'S', 6.0), (3000, 'S', 2.0), (4000, 'S', 2.0), (5000, 'S', 4.0), (6000, 'S', 3.0), (7000, 'S', 2.0); SET IDENTITY_INSERT dbo.Auctions OFF;
You're supposed to match the supply and demand quantities like so:
- A quantity of 5.0 is matched for Demand 1 and Supply 1000, depleting Demand 1 and leaving 3.0 of Supply 1000
- A quantity of 3.0 is matched for Demand 2 and Supply 1000, depleting both Demand 2 and Supply 1000
- A quantity of 6.0 is matched for Demand 3 and Supply 2000, leaving 2.0 of Demand 3 and depleting supply 2000
- A quantity of 2.0 is matched for Demand 3 and Supply 3000, depleting both Demand 3 and Supply 3000
- A quantity of 2.0 is matched for Demand 5 and Supply 4000, depleting both Demand 5 and Supply 4000
- A quantity of 4.0 is matched for Demand 6 and Supply 5000, leaving 4.0 of Demand 6 and depleting Supply 5000
- A quantity of 3.0 is matched for Demand 6 and Supply 6000, leaving 1.0 of Demand 6 and depleting Supply 6000
- A quantity of 1.0 is matched for Demand 6 and Supply 7000, depleting Demand 6 and leaving 1.0 of Supply 7000
- A quantity of 1.0 is matched for Demand 7 and Supply 7000, leaving 3.0 of Demand 7 and depleting Supply 7000; Demand 8 isn't matched with any Supply entries and therefore is left with the full quantity 2.0
Your solution is supposed to write the following data into the resulting temporary table:
DemandID SupplyID TradeQuantity ----------- ----------- -------------- 1 1000 5.000000 2 1000 3.000000 3 2000 6.000000 3 3000 2.000000 5 4000 2.000000 6 5000 4.000000 6 6000 3.000000 6 7000 1.000000 7 7000 1.000000
Large Set of Sample Data
To test the performance of the solutions, you'll need a larger set of sample data. To help with this, you can use the function GetNums, which you create by running the following code:
CREATE FUNCTION dbo.GetNums(@low AS BIGINT = 1, @high AS BIGINT) RETURNS TABLE AS RETURN WITH L0 AS ( SELECT 1 AS c FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1), (1),(1),(1),(1),(1),(1),(1),(1)) AS D(c) ), L1 AS ( SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B ), L2 AS ( SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B ), L3 AS ( SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B ), Nums AS ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L3 ) SELECT TOP(@high - @low + 1) rownum AS rn, @high + 1 - rownum AS op, @low - 1 + rownum AS n FROM Nums ORDER BY rownum; GO
This function returns a sequence of integers in the requested range.
With this function in place, you can use the following code provided by Peter to populate the Auctions table with sample data, customizing the inputs as needed:
DECLARE -- test with 50K, 100K, 150K, 200K in each of variables @Buyers and @Sellers -- so total rowcount is double (100K, 200K, 300K, 400K) @Buyers AS INT = 200000, @Sellers AS INT = 200000, @BuyerMinQuantity AS DECIMAL(19, 6) = 0.000001, @BuyerMaxQuantity AS DECIMAL(19, 6) = 99.999999, @SellerMinQuantity AS DECIMAL(19, 6) = 0.000001, @SellerMaxQuantity AS DECIMAL(19, 6) = 99.999999; DELETE FROM dbo.Auctions; INSERT INTO dbo.Auctions(Code, Quantity) SELECT Code, Quantity FROM ( SELECT 'D' AS Code, (ABS(CHECKSUM(NEWID())) % (1000000 * (@BuyerMaxQuantity - @BuyerMinQuantity))) / 1000000E + @BuyerMinQuantity AS Quantity FROM dbo.GetNums(1, @Buyers) UNION ALL SELECT 'S' AS Code, (ABS(CHECKSUM(NEWID())) % (1000000 * (@SellerMaxQuantity - @SellerMinQuantity))) / 1000000E + @SellerMinQuantity AS Quantity FROM dbo.GetNums(1, @Sellers) ) AS X ORDER BY NEWID(); SELECT Code, COUNT(*) AS Items FROM dbo.Auctions GROUP BY Code;
As you can see, you can customize the number of buyers and sellers as well as the minimum and maximum buyer and seller quantities. The above code specifies 200,000 buyers and 200,000 sellers, resulting in a total of 400,000 rows in the Auctions table. The last query tells you how many demand (D) and supply (S) entries were written to the table. It returns the following output for the aforementioned inputs:
Code Items ---- ----------- D 200000 S 200000
I'm going to test the performance of various solutions using the above code, setting the number of buyers and sellers each to the following: 50,000, 100,000, 150,000, and 200,000. This means I'll get the following total numbers of rows in the table: 100,000, 200,000, 300,000, and 400,000. Of course, you can customize the inputs as you wish to test the performance of your solutions.
Peter provided the cursor-based solution. It's pretty basic, which is one of its important advantages. It will be used as a benchmark.
The solution uses two cursors: one for demand entries ordered by ID and the other for supply entries ordered by ID. To avoid a full scan and a sort per cursor, create the following index:
CREATE UNIQUE NONCLUSTERED INDEX idx_Code_ID_i_Quantity ON dbo.Auctions(Code, ID) INCLUDE(Quantity);
Here's the complete solution's code:
SET NOCOUNT ON; DROP TABLE IF EXISTS #PairingsCursor; CREATE TABLE #PairingsCursor ( DemandID INT NOT NULL, SupplyID INT NOT NULL, TradeQuantity DECIMAL(19, 6) NOT NULL ); DECLARE curDemand CURSOR LOCAL FORWARD_ONLY READ_ONLY FOR SELECT ID AS DemandID, Quantity FROM dbo.Auctions WHERE Code = 'D' ORDER BY ID; DECLARE curSupply CURSOR LOCAL FORWARD_ONLY READ_ONLY FOR SELECT ID AS SupplyID, Quantity FROM dbo.Auctions WHERE Code = 'S' ORDER BY ID; DECLARE @DemandID AS INT, @DemandQuantity AS DECIMAL(19, 6), @SupplyID AS INT, @SupplyQuantity AS DECIMAL(19, 6); OPEN curDemand; FETCH NEXT FROM curDemand INTO @DemandID, @DemandQuantity; OPEN curSupply; FETCH NEXT FROM curSupply INTO @SupplyID, @SupplyQuantity; WHILE @@FETCH_STATUS = 0 BEGIN IF @DemandQuantity <= @SupplyQuantity BEGIN INSERT #PairingsCursor(DemandID, SupplyID, TradeQuantity) VALUES(@DemandID, @SupplyID, @DemandQuantity); SET @SupplyQuantity -= @DemandQuantity; FETCH NEXT FROM curDemand INTO @DemandID, @DemandQuantity; END; ELSE BEGIN IF @SupplyQuantity > 0 BEGIN INSERT #PairingsCursor(DemandID, SupplyID, TradeQuantity) VALUES(@DemandID, @SupplyID, @SupplyQuantity); SET @DemandQuantity -= @SupplyQuantity; END; FETCH NEXT FROM curSupply INTO @SupplyID, @SupplyQuantity; END; END; CLOSE curDemand; DEALLOCATE curDemand; CLOSE curSupply; DEALLOCATE curSupply;
As you can see, the code starts by creating a temporary table. It then declares the two cursors and fetches a row from each, writing the current demand quantity to the variable @DemandQuantity and the current supply quantity to @SupplyQuantity. It then processes the entries in a loop as long as the last fetch was successful. The code applies the following logic in the loop's body:
If the current demand quantity is less than or equal to the current supply quantity, the code does the following:
- Writes a row into the temp table with the current pairing, with the demand quantity (@DemandQuantity) as the matched quantity
- Subtracts the current demand quantity from the current supply quantity (@SupplyQuantity)
- Fetches the next row from the demand cursor
Otherwise, the code does the following:
Checks if the current supply quantity is greater than zero, and if so, it does the following:
- Writes a row into the temp table with the current pairing, with the supply quantity as the matched quantity
- Subtracts the current supply quantity from the current demand quantity
- Fetches the next row from the supply cursor
Once the loop is done, there are no more pairs to match, so the code just cleans up the cursor resources.
From a performance standpoint, you do get the typical overhead of cursor fetching and looping. Then again, this solution does a single ordered pass over the demand data and a single ordered pass over the supply data, each using a seek and range scan in the index you created earlier. The plans for the cursor queries are shown in Figure 1.
Since the plan performs a seek and ordered range scan of each of the parts (demand and supply) with no sorting involved, it has linear scaling. This was confirmed by the performance numbers I got when testing it, as shown in Figure 2.
If you're interested in the more precise run times, here they are:
- 100,000 rows (50,000 demands and 50,000 supplies): 2.93 seconds
- 200,000 rows: 5.89 seconds
- 300,000 rows: 8.75 seconds
- 400,000 rows: 11.81 seconds
Given the solution has linear scaling, you can easily predict the run time with other scales. For example, with a million rows (say, 500,000 demands and 500,000 supplies) it should take about 29.5 seconds to run.
The Challenge Is On
The cursor-based solution has linear scaling and, as such, it isn't bad. But it does incur the typical cursor fetching and looping overhead. Obviously, you can reduce this overhead by implementing the same algorithm using a CLR-based solution. The question is, can you come up with a well-performing set-based solution for this task?
As mentioned, I'll explore set-based solutions—both poorly performing and well-performing—starting next month. In the meanwhile, if you're up to the challenge, see if you can come up with set-based solutions of your own.
To verify the correctness of your solution, first compare its result with the one shown here for the small set of sample data. You can also check the validity of your solution's result with a large set of sample data by verifying the symmetric difference between the cursor solution's result and yours is empty. Assuming the cursor's result is stored in a temp table called #PairingsCursor and yours in a temp table called #MyPairings, you can use the following code to achieve this:
(SELECT * FROM #PairingsCursor EXCEPT SELECT * FROM #MyPairings) UNION ALL (SELECT * FROM #MyPairings EXCEPT SELECT * FROM #PairingsCursor);
The result should be empty.