"0 to 60" : Switching to indirect checkpoints

In a recent tip, I described a scenario where a SQL Server 2016 instance seemed to be struggling with checkpoint times. The error log was populated with an alarming number of FlushCache entries like this one:

FlushCache: cleaned up 394031 bufs with 282252 writes in 65544 ms (avoided 21 new dirty bufs) for db 19:0
      average writes per second: 4306.30 writes/sec
      average throughput:        46.96 MB/sec, I/O saturation: 53644, context switches 101117
      last target outstanding:   639360, avgWriteLatency 1

I was a bit perplexed by this issue, since the system was certainly no slouch — plenty of cores, 3TB of memory, and XtremIO storage. And none of these FlushCache messages were ever paired with the telltale 15 second I/O warnings in the error log. Still, if you stack a bunch of high-transaction databases on there, checkpoint processing can get pretty sluggish. Not so much because of the direct I/O, but more reconciliation that has to be done with a massive number of dirty pages (not just from committed transactions) scattered across such a large amount of memory, and potentially waiting on the lazywriter (since there is only one for the whole instance).

I did some quick "freshen-up" reading of some very valuable posts:

I quickly decided that I wanted to track checkpoint durations for a few of these more troublesome databases, before and after changing their target recovery interval from 0 (the old way) to 60 seconds (the new way). Back in January, I borrowed an Extended Events session from friend and fellow Canadian Hannah Vernon:

CREATE EVENT SESSION CheckpointTracking ON SERVER 
ADD EVENT sqlserver.checkpoint_begin
(
  WHERE 
  (
       sqlserver.database_id = 19 -- db4
    OR sqlserver.database_id = 78 -- db2
    ...
  )
)
, ADD EVENT sqlserver.checkpoint_end
(
  WHERE 
  (
       sqlserver.database_id = 19 -- db4
    OR sqlserver.database_id = 78 -- db2
    ...
  )
)
ADD TARGET package0.event_file
(
  SET filename = N'L:\SQL\CP\CheckPointTracking.xel',
      max_file_size = 50, -- MB
      max_rollover_files = 50
)
WITH 
(
  MAX_MEMORY = 4096 KB,
  MAX_DISPATCH_LATENCY = 30 SECONDS, 
  TRACK_CAUSALITY = ON,
  STARTUP_STATE = ON
);
GO

ALTER EVENT SESSION CheckpointTracking ON SERVER 
  STATE = START;

I marked the time that I changed each database, and then analyzed the results from the Extended Events data using a query published in the original tip. The results showed that after changing to indirect checkpoints, each database went from checkpoints averaging 30 seconds to checkpoints averaging less than a tenth of a second (and far fewer checkpoints in most cases, too). There's a lot to unpack from this graphic, but this is the raw data I used to present my argument (click to enlarge):

My evidence

Once I proved my case across these problematic databases, I got the green light to implement this across all of our user databases throughout our environment. In dev first, and then in production, I ran the following via a CMS query to get a gauge for how many databases we were talking about:

DECLARE @sql nvarchar(max) = N'';

SELECT @sql += CASE
  WHEN (ag.role = N'PRIMARY' AND ag.ag_status = N'READ_WRITE') OR ag.role IS NULL THEN N'
    ALTER DATABASE ' + QUOTENAME(d.name) + N' SET TARGET_RECOVERY_TIME = 60 SECONDS;' 
  ELSE N'
    PRINT N''-- fix ' + QUOTENAME(d.name) + N' on Primary.'';' 
  END
FROM sys.databases AS d 
OUTER APPLY
(
  SELECT role = s.role_desc, 
    ag_status = DATABASEPROPERTYEX(c.database_name, N'Updateability')
    FROM sys.dm_hadr_availability_replica_states AS s
    INNER JOIN sys.availability_databases_cluster AS c
       ON s.group_id = c.group_id 
       AND d.name = c.database_name
    WHERE s.is_local = 1
) AS ag
WHERE d.target_recovery_time_in_seconds <> 60
  AND d.database_id > 4 
  AND d.[state] = 0 
  AND d.is_in_standby = 0 
  AND d.is_read_only = 0;

SELECT DatabaseCount = @@ROWCOUNT, Version = @@VERSION, cmd = @sql;

--EXEC sys.sp_executesql @sql;

Some notes about the query:

database_id > 4
I didn't want to touch master at all, and I didn't want to change tempdb yet because we're not on the latest SQL Server 2017 CU (see KB #4497928 for one reason that detail is important). The latter rules out model, too, because changing model would affect tempdb on the next failover / restart. I could have changed msdb, and I may go back to do that at some point, but my focus here was on user databases.
[state] / is_read_only / is_in_standby
We need to make sure the databases we're trying to change are online and not read only (I hit one that was currently set to read only, and will have to come back to that one later).
OUTER APPLY (...)
We want to restrict our actions to databases that are either the primary in an AG or not in an AG at all (and also have to account for distributed AGs, where we can be primary and local but still not be writable). If you happen to run the check on a secondary, you can't fix the issue there, but you should still get a warning about it. Thanks to Erik Darling for helping with this logic, and Taylor Martell for motivating improvements.
If you have instances running older versions like SQL Server 2008 R2 (I found one!), you'll have to tweak this a little bit, since the target_recovery_time_in_seconds column doesn't exist there. I had to use dynamic SQL to get around this in one case, but you could also temporarily move or remove where those instances fall in your CMS hierarchy. You could also not be lazy like me, and run the code in Powershell instead of a CMS query window, where you could easily filter out databases given any number of properties before ever hitting compile-time issues.

In production, there were 102 instances (about half) and 1,590 total databases using the old setting. Everything was on SQL Server 2017, so why was this setting so prevalent? Because they were created before indirect checkpoints became the default in SQL Server 2016. Here is a sample of the results:

Partial results from CMS query.

Then I ran the CMS query again, this time with sys.sp_executesql uncommented. It took about 12 minutes to run that across all 1,590 databases. Within the hour, I was already getting reports of people observing a significant drop in CPU on some of the busier instances.

I still have more to do. For example, I need to test out the potential impact on tempdb, and whether there is any weight in our use case to the horror stories I've heard. And we need to make sure that the 60 second setting is part of our automation and all database creation requests, especially those that are scripted or restored from backups.

4 thoughts on “"0 to 60" : Switching to indirect checkpoints”

Dev Bethanasamy says:

May 1, 2020 at 7:59 PM

Why should we leave model DB to 0. Why not set the model DB to 60 as well and If you would like to disable indirect checkpoint on tempdb you can add TF 3468 to the startup parameters of your SQL Server instance.
Aaron Bertrand says:

May 1, 2020 at 8:14 PM

Dev, it was a choice, like any other. You can make your own decision; these are the criteria I used. While setting model to 60 would ensure new databases inherit the new setting:
- We have many configs and options we put into our deployment scripts for creating / restoring databases and can make those a part of our checks and rules, review process, etc., so relying on model for this is not crucial (we currently don't deploy any databases without overriding virtually everything they would have otherwise inherited from model).
- We certainly don't have enough activity going on in model to have it benefit directly from the setting.
- We aren't sure how it might affect tempdb and will simply enable it on both when we have had enough time to test thoroughly (but not until after we've finished getting caught up on latest CUs). There's also some minor benefit to changing fewer things at once. If we changed all user databases and tempdb, and things blew up, it would be harder to diagnose whether one or both contributed.
- We prefer not to enable trace flags when possible, especially when there are other ways to achieve the same thing, because they're less discoverable (and how multiple trace flags behave together can be unpredictable).
If you have different priorities, cool, set model to 60 and use the trace flag. I was merely explaining what I did in my case.
Dev Bethanasamy says:

May 2, 2020 at 12:04 AM

Thanks for the clarification. We have seen benefit on setting the model to 60 every new databases, except tempdb, so we chose to enable at the model db, so we don't have to keep checking for the new databases. But we have seen a drastic poor performance at the tempdb, so we have enabled TF 3468. Moreover, we set all DB properties we require at the model DB and rely on that heavily. Yes I agree, it is a choice based on the requirements.
john kauffman says:

May 29, 2020 at 2:25 PM

Thanks for this! I had been getting those messages in the error log, and didn't know they were related to checkpointing. I captured the xevents for a day before the change, and went from 1000 seconds/hour across all DBs on one of my servers, to 10 seconds/hour.

Comments are closed.