As a DBA, addressing performance issues is often a reactive event; problem occurs, you have to respond. Sometimes you are looking at a SQL Server instance that you know well, other times it may be your first encounter with an environment. This occurs in the consulting world as well. When helping a long-term customer, I already have the details about the environment filed away. However, when we get an email from someone who we have not worked with before, and it’s an emergency situation where they want immediate help, we have no background about the environment and have no idea what we are walking into. We provide assistance without going through the extensive data collection and analysis process that begins every new customer engagement.
For this reason, I have a set of five items that I check immediately when I confront a new environment. The information I collect sets the stage for how I approach troubleshooting going forward, and while it rarely pinpoints THE specific problem, it helps me rule out what is NOT the problem, which is sometimes just as important.
Data Collection Methods
I recognize that everyone has a different approach when tackling a new environment. There are several free, widely available scripts that you can download and run to give you the “lay of the land” for a SQL Server instance (Glenn Berry's DMV scripts come to mind). The focus here is not how you collect the data, it is what data you collect, and what you analyze first.
The very first thing I want to know when I look at an instance is the SQL Server version and edition. The fastest way to get this information is to execute:
With this output, I can check the build to determine the service packs, cumulative updates, and hotfixes applied, and I know what edition is used. I also like to know if the instance is clustered, so I also execute:
I do sometimes have this information from the customer, but it never hurts to verify, as version and edition can affect subsequent troubleshooting steps and recommendations. For example, a client recently contacted us about an intermittent performance issue they saw with SQL Server 2008. A quick check of the version revealed that they were running SQL Server 2008 SP3, and there were several Cumulative Updates released after SP3 that addressed a range of performance issues. While I gathered more information before making the recommendation that they apply the latest CU, this was an immediate red flag as to what may be causing the issue.
This catalog view helps build on the foundation started with server properties, and helps us understand how the instance is configured. With this view, I look for settings that have been changed from the defaults, but should not have been, and those that have not been modified, but should.
SELECT [name], [value], [value_in_use], [description] FROM [sys].[configurations] ORDER BY [name];
Consider the max server memory (MB) setting, which limits the amount of memory available to the buffer pool. The default value is 2147483647, but it should be changed to a value less than the total memory on the server to ensure there is plenty of memory for the OS, other applications, and other SQL Server tasks that require memory not taken from the buffer pool. For guidance on setting the appropriate value for max server memory (MB), I recommend Jonathan's post, How much memory does my SQL Server actually need?
Conversely, the priority boost setting has a default of zero, and should always be left as such. In fact, Microsoft recommends not changing it, and the option will be removed in a future release of SQL Server.
After I understand how the instance is configured, I next look to see what exists at the database level.
SELECT * FROM [sys].[databases] ORDER BY [database_id];
When I check the output of this catalog view, I look for anti-patterns – anything that jumps out as unexpected or atypical – in the data. The output is conducive for quick analysis – many of the settings list a 0 or 1 for the value (off or on) and I make a mental note of what's different. I expect auto-create statistics and auto-update statistics to be enabled (set to 1). I expect auto-close and auto-shrink to be disabled (set to 0). I look to see what the collation is for the user databases, specifically whether they all have the same collation, and if that collation is the same as tempdb. I also note security options such as cross-database chaining and the is_trustworthy option, both disabled (0) by default. If I find that any of these settings deviate from what I expect, I note it, and move on. At no point do I stop my collection or analysis to make a change, as I am simply gathering information as quickly as I can to get a good understanding of the environment.
In addition to checking the settings for the databases, I also take note of the number of user databases. There is no “right number” of user databases for an instance – an instance can perform poorly with one database, and it can perform wonderfully with 100. There are a myriad of factors at play, and the number of databases is simply a data point worth noting.
I admit, I used to neglect the SQL Server ERRORLOG; it was like an after-thought when I investigated a SQL Server problem. Then I realized the error of my ways, and I have not taken it for granted since. I tend to navigate through Management Studio to access the log (within Management | SQL Server Logs), though you can use the sp_readerrorlog stored procedure or browse out to the file and open it your favorite text editor.
Within the ERRORLOG I look for recent errors – for example anything related to memory – and I also look to see what trace flags, if any, are in use. I also check to see if Lock Pages in Memory is enabled, if the cache is being flushed (either purposely or not), and if any other unusual activity occurs with regularity. Depending on how urgent the problem is I also look at the Windows logs (Event, Application, and Security), again not just looking for errors, but also unexpected message patterns.
The final area of SQL Server that I review when looking at a performance issue on an unknown instance is wait statistics. Every SQL Server instance will have waits – no matter how well tuned the code is, no matter how much hardware is behind it. As a DBA you want to know what your typical waits are for an instance, and when I'm looking at a new environment, I don't immediately know if the waits I see are typical, or due the performance issue. I ask the customer if they baseline wait statistics, and if not, I ask if I can clear them and let them start to accumulate while the performance problem occurs. To check wait statistics you can use the script in Paul Randal's oft-referenced post, or the version in Glenn's DMV queries.
Once you review the accumulated wait statistics, you will have the final piece that provides the “big picture” of the SQL Server instance, and the information you need to start troubleshooting. It's not uncommon to check wait statistics first when troubleshooting, but waits alone are not enough information to determine what you need to investigate next unless you also understand the basic SQL Server configuration.
As I alluded to earlier, there is typically no one piece of data that tells you where the performance problem lies, it’s multiple data points obtained that point you in the right direction. How you capture that information is up to you, but once you review the output you should have a good understanding of how the SQL Server environment is configured, and that knowledge, combined with the wait statistics, can help you decide what to investigate next. Troubleshooting works best with a methodical approach, so start with the basics and work up, and when you think you have determined the root cause, dig just a little bit more and find one or two additional pieces of evidence that support your finding. Once you have that data, then you can make a recommendation to help improve or resolve the issue.