Aaron Bertrand

Break large delete operations into chunks

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul Randal, CEO of SQLskills, writes about knee-jerk performance tuning, DBCC, and SQL Server internals.

Paul’s Posts

http://www.flickr.com/photos/ladymixy-uk/4059154289/

Far too often I see folks complaining about how their transaction log took over their hard disk. Many times it turns out that they were performing a large delete operation, such as purging or archiving data, in one large transaction.

I wanted to run some tests to show the impact, on both duration and the transaction log, of performing the same data operation in chunks versus a single transaction. I created a database and populated it with a largish table (SalesOrderDetailEnlarged, from this AdventureWorks enlarging script from Jonathan Kehayias (blog | @SQLPoolBoy)). This table has 4.85 million rows and has a reserved space usage of 711 MB (478 MB in data and 233 MB in indexes).

After populating the table, I backed up the database, backed up the log, and ran a DBCC SHRINKFILE (don't shoot me) so that the impact on the log file could be established from a baseline (knowing full well that these operations *will* cause the transaction log to grow).

I purposely used a mechanical disk as opposed to an SSD. While we may start seeing a more popular trend of moving to SSD, it hasn't happened yet on a large enough scale; in many cases it's still too cost prohibitive to do so in large storage devices.

The Tests

So next I had to determine what I wanted to test for greatest impact. Since I was involved in a discussion with a co-worker just yesterday about deleting data in chunks, I chose deletes. And since the clustered index on this table is on SalesOrderID, I didn't want to use that – that would be too easy (and would very rarely match the way deletes are handled in real life). So I decided instead to go after a series of ProductID values, which would ensure I would hit a large number of pages and require a lot of logging. I determined which products to delete by the following query:

SELECT TOP (3) 
  ProductID, ProductCount = COUNT(*)
FROM dbo.SalesOrderDetailEnlarged
GROUP BY ProductID
ORDER BY ProductCount DESC;

This yielded the following results:

ProductID  ProductCount
---------  ------------
870	   187520
712	   135280
873	   134160

This would delete 456,960 rows (about 10% of the table), spread across many orders. This isn't a realistic modification in this context, since it will mess with pre-calculated order totals, and you can't really remove a product from an order that has already shipped. But using a database we all know and love, it is analogous to, say, deleting a user from a forum site, and also deleting all of their messages – a real scenario I have seen in the wild.

So one test would be to perform the following, one-shot delete:

DELETE dbo.SalesOrderDetailEnlarged WHERE ProductID IN (712, 870, 873);

I know this is going to require a massive scan and take a huge toll on the transaction log. That's kind of the point. :-)

While that was running, I put together a different script that will perform this delete in chunks: 25,000, 50,000, 75,000 and 100,000 rows at a time. Each chunk will be committed in its own transaction (so that if you need to stop the script, you can, and all previous chunks will already be committed, instead of having to start over), and depending on the recovery model, will be followed by either a CHECKPOINT or a BACKUP LOG to minimize the ongoing impact on the transaction log. (I will also test without these operations.) It will look something like this (I am not going to bother with error handling and other niceties for this test, but you shouldn't be as cavalier):

SET NOCOUNT ON;

DECLARE @r INT;

SET @r = 1;

WHILE @r > 0
BEGIN
  BEGIN TRANSACTION;

  DELETE TOP (100000) -- this will change
    dbo.SalesOrderDetailEnlarged
    WHERE ProductID IN (712, 870, 873);

  SET @r = @@ROWCOUNT;

  COMMIT TRANSACTION;

  -- CHECKPOINT;    -- if simple
  -- BACKUP LOG ... -- if full
END

Of course, after each test, I would restore the original backup of the database WITH REPLACE, RECOVERY, set the recovery model accordingly, and run the next test.

The Results

The outcome of the first test was not very surprising at all. To perform the delete in a single statement, it took 42 seconds in full, and 43 seconds in simple. In both cases this grew the log to 579 MB.

The next set of tests had a couple of surprises for me. One is that, while these chunking methods did significantly reduce impact to the log file, only a couple of combinations came close in duration, and none were actually faster. Another is that, in general, chunking in full recovery (without performing a log backup between steps) performed better than equivalent operations in simple recovery. Here are the results for duration and log impact:

duration_small
Duration, in seconds, of various delete operations removing 457K rows

log_size_small
Log size, in MB, after various delete operations removing 457K rows

Again, in general, while log size is significantly reduced, duration is increased. You can use this type of scale to determine whether it's more important to reduce the impact to disk space or to minimize the amount of time spent. For a small hit in duration (and after all, most of these processes are run in the background), you can have a significant savings (up to 94%, in these tests) in log space usage.

Note that I did not try any of these tests with compression enabled (possibly a future test!), and I left the log autogrow settings at the terrible defaults (10%) – partly out of laziness and partly because many environments out there have retained this awful setting.

But what if I have more data?

Next I thought I should test this on a slightly larger database. So I made another database and created a new, larger copy of dbo.SalesOrderDetailEnlarged. Roughly ten times larger, in fact. This time instead of a primary key on SalesOrderID, SalesorderDetailID, I just made it a clustered index (to allow for duplicates), and populated it this way:

SELECT c.* 
  INTO dbo.SalesOrderDetailReallyReallyEnlarged 
  FROM AdventureWorks2012.Sales.SalesOrderDetailEnlarged AS c
  CROSS JOIN 
  (
    SELECT TOP 10 Number FROM master..spt_values
  ) AS x;

CREATE CLUSTERED INDEX so ON dbo.SalesOrderDetailReallyReallyEnlarged
  (SalesOrderID,SalesOrderDetailID);

-- I also made this index non-unique:
CREATE NONCLUSTERED INDEX rg ON dbo.SalesOrderDetailReallyReallyEnlarged(rowguid);

CREATE NONCLUSTERED INDEX p ON dbo.SalesOrderDetailReallyReallyEnlarged(ProductID);

Due to disk space limitations, I had to move off of my laptop's VM for this test (and chose a 40-core box, with 128 GB of RAM, that just happened to be sitting around quasi-idle :-)), and still it was not a quick process by any means. Population of the table and creation of the indexes took ~24 minutes.

The table has 48.5 million rows and takes up 7.9 GB in disk (4.9 GB in data, and 2.9 GB in index).

This time, my query to determine a good set of candidate ProductID values to delete:

SELECT TOP (3) 
  ProductID, ProductCount = COUNT(*)
FROM dbo.SalesOrderDetailReallyReallyEnlarged
GROUP BY ProductID
ORDER BY ProductCount DESC;

Yielded the following results:

ProductID  ProductCount
---------  ------------
870	   1828320
712	   1318980
873	   1308060

So we are going to delete 4,455,360 rows, a little under 10% of the table. Following a similar pattern to the above test, we're going to delete all in one shot, then in chunks of 500,000, 250,000 and 100,000 rows.

Results:

duration_10xDuration, in seconds, of various delete operations removing 4.5MM rows

log_size_10xLog size, in MB, after various delete operations removing 4.5MM rows

So again, we see a significant reduction in log file size (over 97% in cases with the smallest chunk size of 100K); however, at this scale, we see a few cases where we also accomplish the delete in less time, even with all the autogrow events that must have occurred. That sounds an awful lot like win-win to me!

This time with a bigger log

Now, I was curious how these different deletes would compare with a log file pre-sized to accommodate for such large operations. Sticking with our larger database, I pre-expanded the log file to 6 GB, backed it up, then ran the tests again:

ALTER DATABASE delete_test MODIFY FILE
(NAME=delete_test_log, SIZE=6000MB);

Results, comparing duration with a fixed log file to the case where the file had to autogrow continuously:

fixed_log_compare
Duration, in seconds, of various delete operations removing 4.5MM rows, comparing fixed log size and autogrow

Again we see that the methods that chunk deletes into batches, and do *not* perform a log backup or a checkpoint after each step, rival the equivalent single operation in terms of duration. In fact, see that most actually perform in less overall time, with the added bonus that other transactions will be able to get in and out between steps. Which is a good thing unless you want this delete operation to block all unrelated transactions.

Conclusion

It is clear that there is no single, correct answer to this problem – there are a lot of inherent "it depends" variables. It may take some experimenting to find your magic number, as there will be a balance between the overhead it takes to backup the log and how much work and time you save at different chunk sizes. But if you are planning to delete or archive a large number of rows, it is quite likely that you will be better off, overall, performing the changes in chunks, rather than in one, massive transaction – even though the duration numbers seem to make that a less attractive operation. It's not all about duration – if you don't have a sufficiently pre-allocated log file, and don't have the space to accommodate such a massive transaction, it is probably much better to minimize log file growth at the cost of duration, in which case you'll want to ignore the duration graphs above and pay attention to the log size graphs.

If you can afford the space, you still may or may not want to pre-size your transaction log accordingly. Depending on the scenario, sometimes using the default autogrow settings ended up slightly faster in my tests than using a fixed log file with plenty of room. Plus, it may be tough to guess exactly how much you'll need to accommodate a large transaction you haven't run yet. If you can't test a realistic scenario, try your best to picture your worst case scenario – then, for safety, double it. Kimberly Tripp (blog | @KimberlyLTripp) has some great advice in this post: 8 Steps to better Transaction Log throughput – in this context, specifically, look at point #6. Regardless of how you decide to calculate your log space requirements, if you're going to end up needing the space anyway, better to take it in a controlled fashion well in advance, than to halt your business processes while they wait for an autogrow (never mind multiple!).

Another very important facet of this that I did not measure explicitly is the impact to concurrency – a bunch of shorter transactions will, in theory, have less impact on concurrent operations. While a single delete took slightly less time than the longer, batched operations, it held all of its locks for that entire duration, while the chunked operations would allow for other queued transactions to sneak in between each transaction. In a future post I'll try to take a closer look on this impact (and I have plans for other deeper analysis as well).