Performance troubleshooting is an art and a science. The art comes from experience (and learning from others’ experiences) and the science comes from well-known guidelines about what to do in which scenarios.
Or at least that’s what I like to think, and teach.
In reality, many DBAs and developers out there practice what I call ‘knee-jerk performance troubleshooting. This commonly happens when a performance problem has reached the critical stage with, for instance, queries timing out, processes running slowly or failing, users disgruntled, and management wanting answers and action fast!
The ‘knee-jerk’ comes from doing some superficial analysis of the problem and jumping to the conclusion (really it’s grasping at a straw) that the most prevalent symptom must be the root cause and trying to address that, to no or little avail, often using misguided or downright incorrect advice found online. This leads to a lot of frustration and wasted time, and often leads to wasted money when the organization decides to try to throw hardware at the problem by upgrading the server and/or the I/O subsystem, only to find the problem is still there, or reappears pretty quickly again.
Wait statistics analysis is one of the areas where it’s easiest to knee-jerk, and in this post I’m going to talk about a few of the common wait types and the mistakes that people make around them. There isn’t scope in an article like this to go into great depth about what to do in each case, but I’ll give you enough information to point you in the right direction.
LCK_M_XX
Most people assume that if locking waits are the most prevalent, then it must be some kind of blocking problem that is the issue. Often it is, such as a lack of a suitable nonclustered index causing a table scan in REPEATABLE_READ or SERIALIZABLE isolation levels that escalates to an S table lock. (And as a quick hint, if you don’t think you ever use SERIALIZABLE, you do if you use distributed transactions – everything is converted to SERIALIZABLE under the covers, which can lead to unexpected blocking and deadlocks.)
However, it’s often the case the blocking is being caused by something else. Under the default READ_COMMITTED isolation level, locks covering changes are held until the transaction commits, and will block reads and other updates to the same row(s). If anything prevents a transaction from committing, that could cause blocking to show up.
For instance, if the database is synchronously mirrored, then the transaction cannot commit and release its locks until the log records have been sent across to the mirror and written to the mirror’s log drive. If the network is severely congested, or there’s massive I/O contention on the mirror, this could seriously delay the mirroring operation, and so cause the transaction to take much longer to commit. This would look like blocking but the root cause is resource contention to do with mirroring.
For locking waits, unless the cause is obvious from looking at the query plan, lock resource (e.g. table-level indicating lock escalation, or isolation level, follow the blocking chain (using a script that walks the blocking_session_id column in sys.dm_exec_requests and then look to see what the thread at the head of the blocking chain is waiting for. That’s going to point towards the root cause.
ASYNC_NETWORK_IO
The name of this one causes lots of confusion. What word do you focus in on? NETWORK. The cause of this wait type usually has nothing to do with the network. It should really be called WAITING_FOR_APP_ACK (nowledgment), or something similar, as that’s exactly what is happening: SQL Server has sent some data to a client and is waiting for the client to acknowledge that is has consumed the data.
One of my favorite demos to do when teaching about wait statistics is to run a query that returns a large resultset in Management Studio and watch the server rack up ASYNC_NETWORK_IO waits. There’s clearly no network involved – it’s just SSMS taking a long time to reply to SQL Server. It’s doing what is known as RBAR (Row-By-Agonizing-Row), where only one row at a time is pulled from the results and processed, instead of caching all the results and then immediately replying to SQL Server and proceeding to process the cached rows.
This is the major cause of ASYNC_NETWORK_IO waits – poor application design. I'd then look at whether the server running the application code has a performance issue, even if the application code itself is well designed. Occasionally it’s the network, but that’s rare in my experience.
OLEDB
The common knee-jerk reaction here is to equate this wait type with linked servers. However, this wait time became more common to see when SQL Server 2005 shipped, because 2005 contained a raft of new DMVs, and DMVs mostly use OLE DB under the covers. Before looking for linked server problems, I’d check whether a monitoring tool is running DMVs constantly on the server.
If you do have linked servers, continue troubleshooting by going to the linked server and looking at the wait statistics there to see what the most prevalent issue is, and then continue the same analysis.
One other thing that can cause OLEDB waits is DBCC CHECKDB (and related commands). It uses an OLE DB rowset to communicate information between its Query Processor and Storage Engine subsystems.
Other Waits
Some of the other waits that cause knee-jerk reactions are CXPACKET, PAGEIOLATCH_XX, SOS_SCHEDULER_YIELD, and WRITELOG, and I’ll cover those in my post next month.
Summary
When you have a performance problem, take the time to understand the data you’re looking at and perform further investigations to help narrow down to the root-cause of the problem. Don’t just grasp at whatever seems to be the top wait statistic and follow the first piece of advice you come across online (unless it’s from a well-known and reputable source) or you likely won’t solve your problem, and may even make it worse.
As far as general wait statistics are concerned, you can find more information about using them for performance troubleshooting in:
- My blog post series, starting with Wait statistics, or please tell me where it hurts
- My Wait Types and Latch Classes library here
- My Pluralsight online training course SQL Server: Performance Troubleshooting Using Wait Statistics
- SQL Sentry Performance Advisor
This was the first in a series of posts I’ll be doing over the course of this year that talk about knee-jerk (re)actions around SQL Server and why they’re the wrong thing to do. Until next time, happy troubleshooting!
Well said Paul! One of the first things I do with a new client is get them to understand that without knowing the root cause of a problem before taking some corrective action (bouncing SQL Server seems to be a favorite I come across!!) then we cannot hope to build a box around said problem to prevent or minimize the probability it will occur again. And like you say, some "corrective actions" can indeed make things worse. But we can also lose the ability to even find out what was the root cause. Right after explaining this I give them 3 scripts and tell them that when a problem is noted they are to fire off those 3 IMMEDIATELY AND WITHOUT THOUGHT, and then they may proceed with any other checks they may want to pursue. Those 3 are a real-time blocker/workload check, time-interval IO stall and time-interval wait stats checks. Depending on the severity of the outage they can run the time-interval scripts anywhere from tens of seconds to 3-5 minutes or so. Between those 3 there is a VERY high likelihood of precisely finding or at least narrowing down the cause(s) of a performance issue. Once the scripts are going I also have them fire up a very limited RPC/TSQL Batch Completed profiler run to see if something in their code has gone batty. I can't tell you the number of times I have come across infinite-loops, hack-attempts, code changes that were bad, etc that the profiler trace shows immediately.
It irks me to no end when a regular client calls me in a panic 20 minutes into a severe performance issue and they haven't done those actions and/or have made changes to several things in and out of the database environment to try to fix the problem!!! Often they can't even detail what was done and the effect of said actions! They get a good chewing from me out once the dust has settled! :-)
Oh, I forgot QUESTION ONE to ask when a performance problem pops up: WHAT CHANGED?!? Self-inflicted injuries are sadly all-too-common in our line of work.
A good (not older than after the latest approved change) baseline to compare with, is also a good friend to hold hands with, when performance goes down the drain, to drill down to what is the root cause i.e what is changed in the current processing load, compared to how it has been earlier. As is one from before the latest change. But I would follow Pauls lead first, of course. :)
The most fun thing about our jobs as a DBA is these performance issues come out of nowhere with no warning, and since our company has developers from all over the country it's challenging to ping them all to ask if they changed anything. The first thing I do personally is go to my monitoring tool and dig in to the active sessions – something sp_WhoIsActive is outstanding for if you don't have a monitoring tool. At the very least I can determine the obvious – runaway sessions, long running jobs, etc. I will send an email out to all developers telling them to close any connections to production (SSMS, profiler traces, etc. – and yes, I know what most will be thinking; it's called compromise). If nothing rears its ugly head, I'll fire up my profiler traces and filter for Duration/Reads/CPU to see if anything shows up. Normally something will show up, like high duration with relatively low CPU utilization, or particular statements taking 30000 milliseconds to execute (a timeout basically). While profiler data is being collected, I'll run this query to find out what objects have been modified recently – that's often much quicker than asking for me.
The key is to remain calm and not start pulling levers/restarting instances prematurely, else you will lose a lot of valuable data that not only would have helped determine the root case, but quite possibly prevent it from happening in the future.
The problem I think most DBA's have with wait stats is that they forget they are cumulative (including maintenance tasks, etc.). As Kevin noted, you have to gather and compare them over intervals to determine what the issue is between two recent points in time.
Great topic – side bar: I actually had a performance issue crop up the other day and it turned out that stats being updated on an index resulted in an entirely new execution plan for a job step – it went from 45 minutes! Introducing an appropriate index fixed the job step and eventually I found the log entry for the updated stats (from Ola's logging mechanism). It matched up perfectly with the performance degradation. Go figure!