Aaron Bertrand

Please stop using this UPSERT anti-pattern

September 2, 2020 by in locking, T-SQL Queries | 5 Comments
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Itzik is a T-SQL trainer, a co-founder of SolidQ, and blogs about T-SQL fundamentals and query tuning.

Itzik’s Posts

I think everyone already knows my opinions about MERGE and why I stay away from it. But here's another (anti-)pattern I see all over the place when people want to perform an upsert (update a row if it exists and insert it if it doesn't):

IF EXISTS (SELECT 1 FROM dbo.t WHERE [key] = @key)
BEGIN
  UPDATE dbo.t SET val = @val WHERE [key] = @key;
END
ELSE
BEGIN
  INSERT dbo.t([key], val) VALUES(@key, @val); 
END

This looks like a pretty logical flow that reflects how we think about this in real life:

  • Does a row already exist for this key?
    • YES: OK, update that row.
    • NO: OK, then add it.

But this is wasteful.

Locating the row to confirm it exists, only to have to locate it again in order to update it, is doing twice the work for nothing. Even if the key is indexed (which I hope is always the case). If I put this logic into a flow chart and associate, at each step, the type of operation that would have to happen within the database, I'd have this:

Notice that all paths will incur two index operations.

More importantly, performance aside, unless you both use an explicit transaction and elevate isolation level, multiple things could go wrong when the row doesn't already exist:

  • If the key exists and two sessions try to update simultaneously, they'll both update successfully (one will "win"; the "loser" will follow with the change that sticks, leading to a "lost update"). This isn't a problem on its own, and is how we should expect a system with concurrency to work. Paul White talks about the internal mechanics in greater detail here, and Martin Smith talks about some other nuances here.
  • If the key doesn't exist, but both sessions pass the existence check the same way, anything could happen when they both try to insert:
    • deadlock because of incompatible locks;
    • raise key violation errors that shouldn't have happened; or,
    • insert duplicate key values if that column isn't properly constrained.

That last one is the worst, IMHO, because it's the one that potentially corrupts data. Deadlocks and exceptions can be handled easily with things like error handling, XACT_ABORT, and retry logic, depending on how frequently you expect collisions. But if you are lulled into a sense of security that the IF EXISTS check protects you from duplicates (or key violations), that is a surprise waiting to happen. If you expect a column to act like a key, make it official and add a constraint.

"Many people are saying…"

Dan Guzman talked about race conditions more than a decade ago in Conditional INSERT/UPDATE Race Condition and later in "UPSERT" Race Condition With MERGE.

Michael Swart has also treated this subject multiple times:

Make sure you read all the comments on all three posts, too.

The Solution

I've fixed many deadlocks in my career by simply adjusting to the following pattern (ditch the redundant check, wrap the sequence in a transaction, and protect the first table access with appropriate locking):

BEGIN TRANSACTION;

UPDATE dbo.t WITH (UPDLOCK, SERIALIZABLE) SET val = @val WHERE [key] = @key;

IF @@ROWCOUNT = 0
BEGIN
  INSERT dbo.t([key], val) VALUES(@key, @val);
END

COMMIT TRANSACTION;

Why do we need two hints? Isn't UPDLOCK enough?

  • UPDLOCK is used to protect against conversion deadlocks at the statement level (let another session wait instead of encouraging a victim to retry).
  • SERIALIZABLE is used to protect against changes to the underlying data throughout the transaction (ensure a row that doesn't exist continues to not exist).

It's a little more code, but it's 1000% safer, and even in the worst case (the row does not already exist), it performs the same as the anti-pattern. In the best case, if you are updating a row that already exists, it will be more efficient to only locate that row once. Combining this logic with the high-level operations that would have to happen in the database, it is slightly simpler:

In this case, one path only incurs a single index operation.

But again, performance aside:

  • If the key exists and two sessions try to update it at the same time, they'll both take turns and update the row successfully, like before.
  • If the key doesn't exist, one session will "win" and insert the row. The other will have to wait until the locks are released to even check for existence, and be forced to update.

In both cases, the writer who won the race loses their data to anything the "loser" updated after them.

Note that overall throughput on a highly concurrent system might suffer, but that is a trade-off you should be willing to make. That you're getting lots of deadlock victims or key violation errors, but they're happening quickly, is not a good performance metric. Some folks would love to see all blocking removed from all scenarios, but some of that is blocking you absolutely want for data integrity.

But what if an update is less likely?

It is clear that the above solution optimizes for updates, and assumes that a key you're trying to write to will already exist in the table as least as often as it doesn't. If you would rather optimize for inserts, knowing or guessing that inserts will be more likely than updates, you can flip the logic around and still have a safe upsert operation:

BEGIN TRANSACTION;

INSERT dbo.t([key], val) 
  SELECT @key, @val
  WHERE NOT EXISTS
  (
    SELECT 1 FROM dbo.t WITH (UPDLOCK, SERIALIZABLE)
      WHERE [key] = @key
  );

IF @@ROWCOUNT = 0
BEGIN
  UPDATE dbo.t SET val = @val WHERE [key] = @key;
END

COMMIT TRANSACTION;

There's also the "just do it" approach, where you blindly insert and let collisions raise exceptions to the caller:

BEGIN TRANSACTION;

BEGIN TRY
  INSERT dbo.t([key], val) VALUES(@key, @val);
END TRY
BEGIN CATCH
  UPDATE dbo.t SET val = @val WHERE [key] = @key;
END CATCH

COMMIT TRANSACTION;

The cost of those exceptions will often outweigh the cost of checking first; you'll have to try it with a roughly accurate guess of hit/miss rate. I wrote about this here and here.

What about upserting multiple rows?

The above deals with singleton insert/update decisions, but Justin Pealing asked what to do when you are processing multiple rows without knowing which of them already exist?

Assuming you are sending a set of rows in using something like a table-valued parameter, you would update using a join, and then insert using NOT EXISTS, but the pattern would still be equivalent to the first approach above:

CREATE PROCEDURE dbo.UpsertTheThings
    @tvp dbo.TableType READONLY
AS
BEGIN
  SET NOCOUNT ON;

  BEGIN TRANSACTION;

  UPDATE t WITH (UPDLOCK, SERIALIZABLE) 
    SET val = tvp.val
  FROM dbo.t AS t
  INNER JOIN @tvp AS tvp
    ON t.[key] = tvp.[key];

  INSERT dbo.t([key], val)
    SELECT [key], val FROM @tvp AS tvp
    WHERE NOT EXISTS (SELECT 1 FROM dbo.t WHERE [key] = tvp.[key]);

  COMMIT TRANSACTION;
END

If you're getting multiple rows together in some other way than a TVP (XML, comma-separated list, voodoo), put them into a table form first, and join to whatever that is. Be careful not to optimize for inserts first in this scenario, otherwise you'll potentially update some rows twice.

Conclusion

These upsert patterns are superior to the ones I see all too often, and I hope you start using them. I will point to this post every time I spot the IF EXISTS pattern in the wild. And, hey, another shoutout to Paul White (sql.kiwi | @SQK_Kiwi), because he is so excellent at making hard concepts easy to understand and, in turn, explain.

And if you feel you have to use MERGE, please don't @ me; either you have a good reason (maybe you need some obscure MERGE-only functionality), or you didn't take the above links seriously.