Aaron Bertrand

Minimizing the impact of DBCC CHECKDB : DOs and DON'Ts

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

Your responsibilities as a DBA (or <insert role here>) probably include things like performance tuning, capacity planning, and disaster recovery. What many people tend to forget or defer is ensuring the integrity of the structure of their databases (both logical and physical); the most important step being DBCC CHECKDB. You can get partway there by creating a simple maintenance plan with a "Check Database Integrity Task" – however, in my mind, this is just checking a checkbox.

Maintenance Plan

If you look closer, there is very little you can do to control how the task operates. Even the quite expansive Properties panel exposes a whole lot of settings for the maintenance subplan, but virtually nothing about the DBCC commands it will run. Personally I think you should take a much more proactive and controlled approach to how you perform your CHECKDB operations in production environments, by creating your own jobs and manually hand-crafting your DBCC commands. You might tailor your schedule or the commands themselves to different databases – for example the ASP.NET membership database is probably not as crucial as your sales database, and could tolerate less frequent and/or less thorough checks.

But for your crucial databases, I thought I would put together a post to detail some of the things I would investigate in order to minimize the disruption DBCC commands may cause – and what myths and marketing hoopla you should be wary of. And I want to thank Paul "Mr. DBCC" Randal (@PaulRandal) for providing valuable input – not only to this specific post, but also the endless advice he provides on his blog, #sqlhelp and in SQLskills Immersion training.

Please take all of these ideas with a grain of salt, and do your best to perform adequate testing in your environment – not all of these suggestions will yield better performance in all environments. But you owe it to yourself, your users and your stakeholders to at least consider the impact that your CHECKDB operations might have, and take steps to mitigate those effects where feasible – without introducing unnecessary risk by not checking the right things.

Reduce the noise and consume all errors

No matter where you are running CHECKDB, always use the WITH NO_INFOMSGS option. This simply suppresses all the irrelevant output that just tells you how many rows are in each table; if you're interested in that information, you can get it from simple queries against DMVs and not while DBCC is running. Suppressing the output makes it far less likely that you'll miss a critical message buried in all that happy output.

Similarly, you should always use the WITH ALL_ERRORMSGS option, but especially if you are running SQL Server 2008 RTM or SQL Server 2005 (in those cases, you may see the list of per-object errors truncated to 200). For any CHECKDB operations other than quick ad-hoc checks, you should consider directing output to a file. Management Studio is limited to 1000 lines of output from DBCC CHECKDB, so you might miss out on some errors if you exceed this figure.

While not strictly a performance issue, using these options will prevent you from having to run the process again. This is particularly critical if you're in the middle of disaster recovery.

Offload logical checks where possible

In most cases, CHECKDB spends the majority of its time performing logical checks of the data. If you have the ability to perform these checks on a true copy of the data, you can focus your efforts on the physical structure of your production systems, and use the secondary server to handle all of the logical checks and alleviate that load from the primary. By secondary server, I mean only the following:

  • The place where you test your full restores – because you test your restores, right?

Other folks (most notably the behemoth marketing force that is Microsoft) might have convinced you that other forms of secondary servers are suitable for DBCC checks. For example:

  • an AlwaysOn Availability Group readable secondary;
  • a snapshot of a mirrored database;
  • a log shipped secondary;
  • SAN mirroring;
  • or other variations…

Unfortunately, this is not the case, and none of these secondaries are valid, reliable places to perform your checks as an alternative to the primary. Only a one-for-one backup can serve as a true copy; anything else that relies on things like the application of log backups to get to a consistent state is not going to reliably reflect integrity problems on the primary.

So rather than try to offload your logical checks to a secondary and never perform them on the primary, here is what I suggest:

  1. Make sure you are frequently testing the restores of your full backups. And no, this does not include COPY_ONLY backups from from an AG secondary, for the same reasons as above – that would only be valid in the case where you have just initiated the secondary with a full restore.
  2. Run DBCC CHECKDB often against the full restore, before doing anything else. Again, replaying log records at this point will invalidate this database as a true copy of the source.
  3. Run DBCC CHECKDB against your primary, perhaps broken up in ways that Paul Randal suggests, and/or on a less frequent schedule, and/or using PHYSICAL_ONLY more often than not. This can depend on how often and reliably you are performing (2).
  4. Never assume that checks against the secondary are enough. Even with an exact replica of your primary database, there are still physical issues that can occur on the I/O subsystem of your primary that will never propagate to the secondary.
  5. Always analyze DBCC output. Just running it and ignoring it, to check it off some list, is as helpful as running backups and claiming success without ever testing that you can actually restore that backup when needed.

Experiment with trace flags 2549, 2562, and 2566

I've done some thorough testing of two trace flags (2549 and 2562) and have found that they can yield substantial performance improvements, however Lonny reports they are no longer necessary or useful. If you're on 2016 or newer, skip this entire section. If you're on an older version, these two trace flags are described in a lot more detail in KB #2634571, but basically:

  • Trace Flag 2549
    • This optimizes the checkdb process by treating each individual database file as residing on a unique underlying disk. This is okay to use if your database has a single data file, or if you know that each database file is, in fact, on a separate drive. If your database has multiple files and they share a single, direct-attached spindle, you should be wary of this trace flag, as it may do more harm than good.

      IMPORTANT: sql.sasquatch reports a regression in this trace flag behavior in SQL Server 2014.
       

  • Trace Flag 2562
    • This flag treats the entire checkdb process as a single batch, at the cost of higher tempdb utilization (up to 5% of the database size).
    • Uses a better algorithm to determine how to read pages from the database, reducing latch contention (specifically for DBCC_MULTIOBJECT_SCANNER). Note that this specific improvement is in the SQL Server 2012 code path, so you will benefit from it even without the trace flag. This can avoid errors such as:
       
      Timeout occurred while waiting for latch: class 'DBCC_MULTIOBJECT_SCANNER'.
  • The above two trace flags are available in the following versions:
       

      SQL Server 2008 Service Pack 2 Cumulative Update 9+
        (10.00.4330 -> 10.00.5499)

      SQL Server 2008 Service Pack 3 Cumulative Update 4+
        (10.00.5775+)

      SQL Server 2008 R2 RTM Cumulative Update 11+
        (10.50.1809 -> 10.50.2424)

      SQL Server 2008 R2 Service Pack 1 Cumulative Update 4+
        (10.50.2796 -> 10.50.3999)

      SQL Server 2008 R2 Service Pack 2
        (10.50.4000+)

      SQL Server 2012, all versions
        (11.00.2100+)

  • Trace Flag 2566
    • If you are still using SQL Server 2005, this trace flag, introduced in 2005 SP2 CU#9 (9.00.3282) (though not documented in that Cumulative Update's Knowledge Base article, KB #953752), attempts to correct poor performance of DATA_PURITY checks on x64-based systems. At one point, you could see more details in KB #945770, but it seems that article was scrubbed from both Microsoft's support site and the WayBack machine. This trace flag should not be necessary in more modern versions of SQL Server, as the problem in the query processor has been fixed.

If you're going to use any of these trace flags, I highly recommend setting them at the session level using DBCC TRACEON rather than as a startup trace flag. Not only does it enable you to turn them off without having to cycle SQL Server, but it also allows you to implement them only when performing certain CHECKDB commands, as opposed to operations using any type of repair.

Reduce I/O impact: optimize tempdb

DBCC CHECKDB can make heavy use of tempdb, so make sure you plan for resource utilization there. This is usually a good thing to do in any case. For CHECKDB you'll want to properly allocate space to tempdb; the last thing you want is for CHECKDB progress (and any other concurrent operations) to have to wait for an autogrow. You can get an idea for requirements using WITH ESTIMATEONLY, as Paul explains here. Just be aware that the estimate can be quite low due to a bug in SQL Server 2008 R2. Also if you are using trace flag 2562 be sure to accommodate for the additional space requirements.

And of course, all of the typical advice for optimizing tempdb on just about any system is appropriate here as well: make sure tempdb is on its own set of fast spindles, make sure it is sized to accommodate all other concurrent activity without having to grow, make sure you are using an optimal number of data files, etc. A few other resources you might consider:

Reduce I/O impact: control the snapshot

In order to run CHECKDB, modern versions of SQL Server will attempt to create a hidden snapshot of your database on the same drive (or on all of the drives if your data files span multiple drives). You can't control this mechanism, but if you want to control where CHECKDB operates, create your own snapshot first (Enterprise Edition required) on whatever drive you like, and run the DBCC command against the snapshot. In either case, you'll want to run this operation during a relative downtime, to minimize the copy-on-write activity that will go through the snapshot. And you won't want this schedule to conflict with any heavy write operations, like index maintenance or ETL.

You may have seen suggestions to force CHECKDB to run in offline mode using the WITH TABLOCK option. I strongly recommend against this approach. If your database is actively being used, choosing this option will just make users frustrated. And if the database is not actively being used, you're not saving any disk space by avoiding a snapshot, since there will be no copy-on-write activity to store.

Reduce I/O impact: avoid 665 / 1450 / 1452 errors

In some cases you may see one of the following errors:

The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x[…] in file with handle 0x[…]. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

 

The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x[…] in file '[file]'

There are some tips here for reducing the risk of these errors during CHECKDB operations, and reducing their impact in general – with several fixes available, depending on your operating system and SQL Server version:

Reduce CPU impact

DBCC CHECKDB is multi-threaded by default (but only in Enterprise Edition). If your system is CPU-bound, or you just want CHECKDB to use less CPU at the cost of running longer, you can consider reducing parallelism in a couple of different ways:

  1. Use Resource Governor on 2008 and above, as long as you are running Enterprise Edition. To target just DBCC commands for a particular resource pool or workload group, you'll have to write a classifier function that can identify the sessions that will be performing this work (e.g. a specific login or a job_id).
  2. Use Trace flag 2528 to turn off parallelism for DBCC CHECKDB (as well as CHECKFILEGROUP and CHECKTABLE). Trace flag 2528 is described here. Of course this is only valid in Enterprise Edition, because in spite of what Books Online currently says, the truth is that CHECKDB does not go parallel in Standard Edition.
  3. While the DBCC command itself does not support MAXDOP (at least prior to SQL Server 2014 SP2), it does respect the global setting max degree of parallelism. Probably not something I would do in production unless I had no other options, but this is one overarching way to control certain DBCC commands if you can't target them more explicitly.

We'd been asking for better control over the number of CPUs that DBCC CHECKDB uses, but they had been repeatedly denied until SQL Server 2014 SP2. So you can now add WITH MAXDOP = n to the command.

My Findings

I wanted to demonstrate a few of these techniques in an environment I could control. I installed AdventureWorks2012, then expanded it using the AW enlarger script written by Jonathan Kehayias (blog | @SQLPoolBoy), which grew the database to about 7 GB. Then I ran a series of CHECKDB commands against it, and timed them. I used a plain vanilla DBCC CHECKDB on its own, then all other commands used WITH NO_INFOMSGS, ALL_ERRORMSGS. Then four tests with (a) no trace flags, (b) 2549, (c) 2562, and (d) both 2549 and 2562. Then I repeated those four tests, but added the PHYSICAL_ONLY option, which bypasses all of the logical checks. The results (averaged over 10 test runs) are telling:

CHECKDB results against 7 GB database
CHECKDB results against 7 GB database

Then I expanded the database some more, making many copies of the two enlarged tables, leading to a database size just north of 70 GB, and ran the tests again. The results, again averaged over 10 test runs:

CHECKDB results against 70 GB database
CHECKDB results against 70 GB database

In these two scenarios, I have learned the following (again, keeping in mind that your mileage may vary, and that you will need to perform your own tests to draw any meaningful conclusions):

  1. When I have to perform logical checks:
    • At small database sizes, the NO_INFOMSGS option can cut processing time significantly when the checks are run in SSMS. On larger databases, however, this benefit diminishes, as the time and work spent relaying the information becomes such an insignificant portion of the overall duration. 21 seconds out of 2 minutes is substantial; 88 seconds out of 35 minutes, not so much.
    • The two trace flags I tested had a significant impact on performance – representing a runtime reduction of 40-60% when both were used together.
       
  2. When I can push logical checks to a secondary server (again, assuming that I am performing logical checks elsewhere against a true copy):
    • I can reduce processing time on my primary instance by 70-90% compared to a standard CHECKDB call with no options.
    • In my scenario, the trace flags had very little impact on duration when performing PHYSICAL_ONLY checks.

Of course, and I can't stress this enough, these are relatively small databases and only used so that I could perform repeated, measured tests in a reasonable amount of time. This server had 80 logical CPUs and 128 GB RAM, and I was the only user. Duration and interaction with other workloads on the system may skew these results quite a bit. Here is a quick glimpse of typical CPU usage, using SQL Sentry, during one of the CHECKDB operations (and none of the options really changed the overall impact on CPU, just duration):

CPU impact during CHECKDB
CPU impact during CHECKDB – sample mode

And here is another view, showing similar CPU profiles for three different sample CHECKDB operations in historical mode (I've overlaid a description of the three tests sampled in this range):

CPU impact during CHECKDB
CPU impact during CHECKDB – historical mode

On even larger databases, hosted on busier servers, you may see different effects, and your mileage is quite likely to vary. So please perform your due diligence and test out these options and trace flags during a typical concurrent workload before deciding how you want to approach CHECKDB.

Conclusion

DBCC CHECKDB is a very important but often undervalued part of your responsibility as a DBA or architect, and crucial to the protection of your company's data. Do not take this responsibility lightly, and do your best to ensure that you do not sacrifice anything in the interest of reducing impact on your production instances. Most importantly: look beyond the marketing data sheets to be sure you fully understand how valid those promises are and whether you are willing to bet your company's data on them. Skimping out on some checks or offloading them to invalid secondary locations could be a disaster waiting to happen.

You should also consider reading these PSS articles:

And this post from Brent Ozar:

Finally, if you have an unresolved question about DBCC CHECKDB, post it to the #sqlhelp hash tag on twitter. Paul checks that tag often and, since his picture should appear in the main Books Online article, it's likely that if anyone can answer it, he can. If it's too complex for 140 characters, you can ask here (and I will make sure Paul sees it at some point), or post to a forum site such as Database Administrators Stack Exchange.