Aaron Bertrand

"0 to 60" : Switching to indirect checkpoints

Downtime and Performance Issues? Detect the root cause with SQL Sentry
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Paul White is an independent SQL Server consultant specializing in performance tuning, execution plans, and the query optimizer.

Paul’s Posts

In a recent tip, I described a scenario where a SQL Server 2016 instance seemed to be struggling with checkpoint times. The error log was populated with an alarming number of FlushCache entries like this one:

FlushCache: cleaned up 394031 bufs with 282252 writes in 65544 ms (avoided 21 new dirty bufs) for db 19:0
      average writes per second: 4306.30 writes/sec
      average throughput:        46.96 MB/sec, I/O saturation: 53644, context switches 101117
      last target outstanding:   639360, avgWriteLatency 1

I was a bit perplexed by this issue, since the system was certainly no slouch — plenty of cores, 3TB of memory, and XtremIO storage. And none of these FlushCache messages were ever paired with the telltale 15 second I/O warnings in the error log. Still, if you stack a bunch of high-transaction databases on there, checkpoint processing can get pretty sluggish. Not so much because of the direct I/O, but more reconciliation that has to be done with a massive number of dirty pages (not just from committed transactions) scattered across such a large amount of memory, and potentially waiting on the lazywriter (since there is only one for the whole instance).

I did some quick "freshen-up" reading of some very valuable posts:

I quickly decided that I wanted to track checkpoint durations for a few of these more troublesome databases, before and after changing their target recovery interval from 0 (the old way) to 60 seconds (the new way). Back in January, I borrowed an Extended Events session from friend and fellow Canadian Max Vernon:

CREATE EVENT SESSION CheckpointTracking ON SERVER 
ADD EVENT sqlserver.checkpoint_begin
(
  WHERE 
  (
       sqlserver.database_id = 19 -- db4
    OR sqlserver.database_id = 78 -- db2
    ...
  )
)
, ADD EVENT sqlserver.checkpoint_end
(
  WHERE 
  (
       sqlserver.database_id = 19 -- db4
    OR sqlserver.database_id = 78 -- db2
    ...
  )
)
ADD TARGET package0.event_file
(
  SET filename = N'L:\SQL\CP\CheckPointTracking.xel',
      max_file_size = 50, -- MB
      max_rollover_files = 50
)
WITH 
(
  MAX_MEMORY = 4096 KB,
  MAX_DISPATCH_LATENCY = 30 SECONDS, 
  STARTUP_STATE = ON
);
GO
 
ALTER EVENT SESSION CheckpointTracking ON SERVER 
  STATE = START;

I marked the time that I changed each database, and then analyzed the results from the Extended Events data using a query published in the original tip. The results showed that after changing to indirect checkpoints, each database went from checkpoints averaging 30 seconds to checkpoints averaging less than a tenth of a second (and far fewer checkpoints in most cases, too). There's a lot to unpack from this graphic, but this is the raw data I used to present my argument (click to enlarge):

My evidenceMy evidence

Once I proved my case across these problematic databases, I got the green light to implement this across all of our user databases throughout our environment. In dev first, and then in production, I ran the following via a CMS query to get a gauge for how many databases we were talking about:

DECLARE @sql nvarchar(max) = N'';
 
SELECT @sql += N'ALTER DATABASE ' + QUOTENAME(name) + ' SET TARGET_RECOVERY_TIME = 60 SECONDS;' 
  FROM sys.databases AS d 
  WHERE target_recovery_time_in_seconds = 0
    AND database_id > 4 
    AND [state] = 0
    AND is_read_only = 0
    AND NOT EXISTS 
    (
      SELECT 1 FROM sys.dm_hadr_availability_replica_states AS s
        INNER JOIN sys.availability_databases_cluster AS c
             ON s.group_id  = c.group_id 
          WHERE c.database_name = d.name
            AND (s.role_desc <> 'PRIMARY' OR s.is_local = 0) 
    );
 
SELECT DatabaseCount = @@ROWCOUNT, Version = @@VERSION, cmd = @sql;
 
--EXEC sys.sp_executesql @sql;

Some notes about the query:

  • database_id > 4
    I didn't want to touch master at all, and I didn't want to change tempdb yet because we're not on the latest SQL Server 2017 CU (see KB #4497928 for one reason that detail is important). The latter rules out model, too, because changing model would affect tempdb on the next failover / restart. I could have changed msdb, and I may go back to do that at some point, but my focus here was on user databases.
     
  • [state] / is_read_only
    We need to make sure the databases we're trying to change are online and not read only (I hit one that was currently set to read only, and will have to come back to that one later).
     
  • AND NOT EXISTS (...)
    We want to restrict our actions to databases that are either the primary in an AG or not in an AG at all. Thanks to Erik Darling for helping with this logic.
     
  • If you have instances running older versions like SQL Server 2008 R2 (I found one!), you'll have to tweak this a little bit, since the target_recovery_time_in_seconds column doesn't exist there. I had to use dynamic SQL to get around this in one case, but you could also temporarily move or remove where those instances fall in your CMS hierarchy. You could also not be lazy like me, and run the code in Powershell instead of a CMS query window, where you could easily filter out databases given any number of properties before ever hitting compile-time issues.

In production, there were 102 instances (about half) and 1,590 total databases using the old setting. Everything was on SQL Server 2017, so why was this setting so prevalent? Because they were created before indirect checkpoints became the default in SQL Server 2016. Here is a sample of the results:

Partial results from CMS query.

Then I ran the CMS query again, this time with sys.sp_executesql uncommented. It took about 12 minutes to run that across all 1,590 databases. Within the hour, I was already getting reports of people observing a significant drop in CPU on some of the busier instances.

I still have more to do. For example, I need to test out the potential impact on tempdb, and whether there is any weight in our use case to the horror stories I've heard. And we need to make sure that the 60 second setting is part of our automation and all database creation requests, especially those that are scripted or restored from backups.