I think everyone already knows my opinions about
MERGE and why I stay away from it. But here's another (anti-)pattern I see all over the place when people want to perform an upsert (update a row if it exists and insert it if it doesn't):
IF EXISTS (SELECT 1 FROM dbo.t WHERE [key] = @key) BEGIN UPDATE dbo.t SET val = @val WHERE [key] = @key; END ELSE BEGIN INSERT dbo.t([key], val) VALUES(@key, @val); END
This looks like a pretty logical flow that reflects how we think about this in real life:
- Does a row already exist for this key?
- YES: OK, update that row.
- NO: OK, then add it.
But this is wasteful.
Locating the row to confirm it exists, only to have to locate it again in order to update it, is doing twice the work for nothing. Even if the key is indexed (which I hope is always the case). If I put this logic into a flow chart and associate, at each step, the type of operation that would have to happen within the database, I'd have this:
More importantly, performance aside, unless you both use an explicit transaction and elevate isolation level, multiple things could go wrong when the row doesn't already exist:
- If the key exists and two sessions try to update simultaneously, they'll both update successfully (one will "win"; the "loser" will follow with the change that sticks, leading to a "lost update"). This isn't a problem on its own, and is how we should expect a system with concurrency to work. Paul White talks about the internal mechanics in greater detail here, and Martin Smith talks about some other nuances here.
- If the key doesn't exist, but both sessions pass the existence check the same way, anything could happen when they both try to insert:
- deadlock because of incompatible locks;
- raise key violation errors that shouldn't have happened; or,
- insert duplicate key values if that column isn't properly constrained.
That last one is the worst, IMHO, because it's the one that potentially corrupts data. Deadlocks and exceptions can be handled easily with things like error handling,
XACT_ABORT, and retry logic, depending on how frequently you expect collisions. But if you are lulled into a sense of security that the
IF EXISTS check protects you from duplicates (or key violations), that is a surprise waiting to happen. If you expect a column to act like a key, make it official and add a constraint.
"Many people are saying…"
Michael Swart has also treated this subject multiple times:
- Mythbusting: Concurrent Update/Insert Solutions – where he acknowledged that leaving the initial logic in place and only elevating the isolation level just changed key violations to deadlocks;
- Be Careful with the Merge Statement – where he checked his enthusiasm about
- What To Avoid If You Want To Use MERGE – where he confirmed once again that there are still plenty of valid reasons to continue avoiding
Make sure you read all the comments on all three posts, too.
I've fixed many deadlocks in my career by simply adjusting to the following pattern (ditch the redundant check, wrap the sequence in a transaction, and protect the first table access with appropriate locking):
BEGIN TRANSACTION; UPDATE dbo.t WITH (UPDLOCK, SERIALIZABLE) SET val = @val WHERE [key] = @key; IF @@ROWCOUNT = 0 BEGIN INSERT dbo.t([key], val) VALUES(@key, @val); END COMMIT TRANSACTION;
Why do we need two hints? Isn't
UPDLOCKis used to protect against conversion deadlocks at the statement level (let another session wait instead of encouraging a victim to retry).
SERIALIZABLEis used to protect against changes to the underlying data throughout the transaction (ensure a row that doesn't exist continues to not exist).
It's a little more code, but it's 1000% safer, and even in the worst case (the row does not already exist), it performs the same as the anti-pattern. In the best case, if you are updating a row that already exists, it will be more efficient to only locate that row once. Combining this logic with the high-level operations that would have to happen in the database, it is slightly simpler:
But again, performance aside:
- If the key exists and two sessions try to update it at the same time, they'll both take turns and update the row successfully, like before.
- If the key doesn't exist, one session will "win" and insert the row. The other will have to wait until the locks are released to even check for existence, and be forced to update.
In both cases, the writer who won the race loses their data to anything the "loser" updated after them.
Note that overall throughput on a highly concurrent system might suffer, but that is a trade-off you should be willing to make. That you're getting lots of deadlock victims or key violation errors, but they're happening quickly, is not a good performance metric. Some folks would love to see all blocking removed from all scenarios, but some of that is blocking you absolutely want for data integrity.
But what if an update is less likely?
It is clear that the above solution optimizes for updates, and assumes that a key you're trying to write to will already exist in the table as least as often as it doesn't. If you would rather optimize for inserts, knowing or guessing that inserts will be more likely than updates, you can flip the logic around and still have a safe upsert operation:
BEGIN TRANSACTION; INSERT dbo.t([key], val) SELECT @key, @val WHERE NOT EXISTS ( SELECT 1 FROM dbo.t WITH (UPDLOCK, SERIALIZABLE) WHERE [key] = @key ); IF @@ROWCOUNT = 0 BEGIN UPDATE dbo.t SET val = @val WHERE [key] = @key; END COMMIT TRANSACTION;
There's also the "just do it" approach, where you blindly insert and let collisions raise exceptions to the caller:
BEGIN TRANSACTION; BEGIN TRY INSERT dbo.t([key], val) VALUES(@key, @val); END TRY BEGIN CATCH UPDATE dbo.t SET val = @val WHERE [key] = @key; END CATCH COMMIT TRANSACTION;
What about upserting multiple rows?
The above deals with singleton insert/update decisions, but Justin Pealing asked what to do when you are processing multiple rows without knowing which of them already exist?
Assuming you are sending a set of rows in using something like a table-valued parameter, you would update using a join, and then insert using NOT EXISTS, but the pattern would still be equivalent to the first approach above:
CREATE PROCEDURE dbo.UpsertTheThings @tvp dbo.TableType READONLY AS BEGIN SET NOCOUNT ON; BEGIN TRANSACTION; UPDATE t WITH (UPDLOCK, SERIALIZABLE) SET val = tvp.val FROM dbo.t AS t INNER JOIN @tvp AS tvp ON t.[key] = tvp.[key]; INSERT dbo.t([key], val) SELECT [key], val FROM @tvp AS tvp WHERE NOT EXISTS (SELECT 1 FROM dbo.t WHERE [key] = tvp.[key]); COMMIT TRANSACTION; END
If you're getting multiple rows together in some other way than a TVP (XML, comma-separated list, voodoo), put them into a table form first, and join to whatever that is. Be careful not to optimize for inserts first in this scenario, otherwise you'll potentially update some rows twice.
These upsert patterns are superior to the ones I see all too often, and I hope you start using them. I will point to this post every time I spot the
IF EXISTS pattern in the wild. And, hey, another shoutout to Paul White (sql.kiwi | @SQK_Kiwi), because he is so excellent at making hard concepts easy to understand and, in turn, explain.
And if you feel you have to use
MERGE, please don't @ me; either you have a good reason (maybe you need some obscure
MERGE-only functionality), or you didn't take the above links seriously.