In this post I'll discuss a general methodology for troubleshooting CPU performance issues. I like applying methodologies by default and I also like building efficiencies in how I troubleshoot issues based on past experiences. Without a general framework, it becomes too easy to miss the true root cause in the middle of a crisis.
The steps I'll describe in this post are as follows:
- Define the problem
- Validate the current conditions
- Answer "Is it SQL Server"?
- Identify CPU consumers
- Match the pattern and resolve
This article will cover each of these steps. I'll be making an assumption that you may not be using a third-party monitoring tool. If you are though, the framework here still applies, but your data sources and tools at your disposal will vary from what I describe.
Define the problem
First we need to scope the issue. When someone comes up to you and says they are seeing a CPU performance issue, this could mean any number of different things. So the first task is to understand what the nature of the CPU performance issue currently is.
Some common categories include:
- Availability being impacted due to "pegged CPUs". For example – all schedulers running at 100% across the board and throughput being stalled or significantly reduced.
- Degradation of performance due to "higher than normal" CPU usage. So we're not pegged, but your CPUs are running at a higher percentage than is ordinary and presumably it is impacting performance.
- Another common category of CPU performance issue is the "winners and losers" scenario where workloads are competing against each other. Perhaps you have an OLTP workload that is encountering reduced throughput due to a parallel executing report query.
- Another problem might be the encountering of a tipping point – where the overall capacity and scalability limitations of your system are hit at a certain point.
I mention these over-arching categories as a starting point, but I know that often there can be heavy dependencies across these issues and one categorization can blend into the other. With that said, the first step is to define the symptoms and problems as clearly as possible.
Validate the current conditions
Whether the issue happened in the past or is happening right now, it is important to get as much background information about the system, workload and configurations as possible. If you're using baselines and run-books, ideally you're tracking much of this information already. If not, ask yourself how quickly you could get answers to these questions at 2AM in the middle of a crisis.
The following sub-sections cover important data points that I'm typically interested in for a CPU-performance issue.
- How many sockets and cores?
- Is hyper-threading enabled?
- What is the processor model, architecture (32-bit/64-bit)?
- Is this a virtual guest?
- If so, you're now also going to be interested in details about the host and the other virtual guests you're sharing resources with.
- Are there any CPU-related settings in effect?
- For example, Hyper-V CPU
- How many vCPUs are allocated across guests?
- How many vCPUs does this guest have?
- Was the guest recently migrated to a new host prior to the issue?
- Max degree of parallelism setting
- Cost threshold for parallelism option
- Processor affinity setting
- Priority boost setting
- Max worker threads setting
- Lightweight pooling setting
- What is the power-option setting? (OS level, VM Host or BIOS controlled)
- High Performance, Balanced, Power Saving?
- Is it configured beyond the default settings?
- Do you see any unusual warnings or errors?
Physical server details
Virtual server details
Reserve, VMware CPU Reservation, Hyper-V CPU Relative Weight, and VMware CPU Shares.
SQL Server instance configuration settings
The first three configurations may require further discussion. There are rarely absolutes regarding these settings.
Regarding the last three settings, such as "priority boost", if I see that they are at non-default values I'm definitely going to be pushing for more background information and history.
CPU power-option settings
Power-option settings below "High Performance" are still very common and shouldn't be ignored for servers that host SQL Server instances.
Resource Governor configuration
I still find that it is rare to encounter customers using this feature at all, but it is easy to validate whether it is being used and will be worth it for the times that it is actually configured beyond the default.
SQL Server error log and Windows event logs
Why look in the error and event logs for a CPU issue? Sometimes upstream issues can cause downstream performance issues in SQL Server. You don't want to waste time tuning a query or adding a new index when you're upstream root-cause issue is a hardware component degradation issue.
Answer "Is it SQL Server?"
It sounds obvious when I ask it, but you really don't want to spend a significant amount of time troubleshooting a high CPU issue in SQL Server if the culprit isn't actually SQL Server.
Instead, take a quick moment to check which process is consuming the most CPU. There are several options to choose from, including:
- Process: % User Time (user mode)
- Process: % Privileged Time (kernel mode)
- Task Manager
- Process Explorer
- Recent CPU information via sys.dm_os_ring_buffers or the system health session for the specific SQL Server instances running on the system
If it is SQL Server and you have multiple SQL Server instances to choose from, be sure you're troubleshooting the right SQL Server instance on the host. There are a few ways to do this, including the use of
SELECT SERVERPROPERTY('processid') to get the PID and then associating it to Task Manager or Process Explorer.
Once you've confirmed it is SQL Server, are you seeing high user time or privileged (kernel) time? Again this can be confirmed via Process: % Privileged Time (sqlservr object) and also Windows Task Manager or Process Explorer.
While high kernel time issues should be rare, they still require different troubleshooting paths than standard user time CPU troubleshooting issues. Some potential causes of high kernel time include faulty filter-drivers (anti-virus, encryption services), out-of-date or missing firmware updates and drivers, or defective I/O components.
Identify CPU consumers
Once you've validated which SQL Server instance is driving the user-time CPU usage on the system, there are plenty of pre-canned query examples out on the web that you could use.
Below is a list of DMVs that people commonly use in various forms during a performance issue. I structured this in a Q&A format to help frame why you would want to access them.
- Aggregate by total_worker_time
- Define averages with execution_count
- If ad hoc workloads, you could group by query_hash
- Use the plan_handle with sys.dm_exec_query_plan to grab the plan
- Ordered by session_id, request_id
- Look at plan operators – but keep in mind this is just the estimated plan
- Filter total_elapsed_time less than total_worker_time
- But note that this can be a false negative for blocking scenarios – where duration is inflated due to a wait on resource
What requests are executing right now and what is their status?
What is it executing?
Where is it from?
What is its estimated plan? (but be careful of shredding xml on an already-CPU-constrained system)
Who's waiting on a resource and what are they waiting for?
Which queries have taken up the most CPU time since the last restart?
Is this query using parallelism?
Match the pattern and resolve
You're probably laughing at this particular step – as this one can be the most involved (and is another reason why SQL Server professionals are gainfully employed). There are several different patterns and associated resolutions – so I'll finish this post with a list of the more common CPU performance issue drivers that I've seen over the last few years:
- High I/O operations (and in my experience this is the most common driver of CPU)
- Cardinality estimate issues (and associated poor query plan quality)
- Unexpected parallelism
- Excessive compilation / recompilation
- Calculation-intensive UDF calls, shredding operations
- Row-by-agonizing row operations
- Concurrent maintenance activities (e.g. UPDATE stats with FULLSCAN)
Each area I've identified has a large associated body of work to research. In terms of consolidated resources, I still think one of the better ones is still the "Troubleshooting Performance Problems in SQL Server 2008" technical article written by Sunil Agarwal, Boris Baryshnikov, Keith Elmore, Juergen Thomas, Kun Cheng and Burzin Patel.
As with any methodology, there are boundaries for its utilization and areas where you are justified in improvising. Please note that I'm not suggesting the steps I described in this post be used as a rigid framework, but instead consider it to be a launch-point for your troubleshooting efforts. Even highly experienced SQL Server professionals can make rookie mistakes or be biased by their more recent troubleshooting experiences, so having a minimal methodology can help avoid troubleshooting the wrong issue.