Different Plans for "Identical" Servers

In my last post, "Multiple Plans for an 'Identical' Query," I talked about the case where you are getting two different plans for what you think is the same query, as well as the case where you are getting two copies of the same plan (and might not even know it). As we examined there, "identical" can be a pretty strong word.

Another scenario that throws people for a loop is the case where they restore a database to a different server – say, restore a production database to an "identical" test server – and they get different performance characteristics or different plans for the same query (no quotes this time – I'm really talking about truly identical queries).

Are the servers truly "identical"?

http://www.flickr.com/photos/oskay/265900118

These guys may look similar, but they're not quite identical.

If you come across this scenario, the first thing you need to ask yourself is whether these two servers really are identical. Some things to check:

Version – Many optimizer and query behavior changes are pushed through service packs and cumulative updates. Often I have seen people say, "Well, they're both 2008!" – when, in fact, one was 2008 and the other was 2008 R2, or they were at different service packs or even cumulative update levels. Since a lot of people reading @@VERSION mistake the operating system service pack information for the SQL Server service pack information, I would say the following is better:
SELECT SERVERPROPERTY(N'ProductVersion');

I can't stress enough the importance of using the exact same version to perform true, apples-to-apples tests. If you're using SQL Server 2012 or better, you can check our build posts (SQL Server 2012 | SQL Server 2014) to determine the service pack or cumulative update required to make sure the versions match.
Edition – While hopefully you are using the same edition on both servers (or equivalent, since aside from licensing, Developer and Evaluation are the same as Enterprise), mismatches here can lead to very different behavior. For example, different editions have different compute capacities for various features, and then there are subtler things like the ability to use an indexed view without the NOEXPAND hint or perform schema changes or index maintenance online. You can compare editions using:

SELECT SERVERPROPERTY(N'Edition');
CPU count – SQL Server definitely uses the number of schedulers available during the process of producing an execution plan, and there is no denying that the number of cores can affect actual runtime performance (let's leave out clock speed, since that is rarely a significant factor in query performance). Don't just validate the number of cores physically installed in the underlying server, but also check SQL Server's error log for the number of CPUs SQL Server can actually use due to licensing. Even forgetting raw core count, on a NUMA system, artificial restrictions here can lead to very different performance profiles. For more information, see Brent Ozar's recent post, "Why Core-Based Licensing Matters for Performance Tuning." Edition ties in here as well, since in SQL Server 2012 and 2014, Standard Edition can only use 16 cores no matter what your settings or physical hardware might lead you to believe. Other settings that can influence CPU-based plan choice and performance differently include Resource Governor, server-wide MAXDOP, CPU affinity, and cost threshold for parallelism.
Amount of memory – Like CPUs, the optimizer makes plan choices based on the amount of memory available. And like CPUs, I'm not just talking about the amount of RAM installed in the system, but the amount of memory granted to SQL Server, and how much it is truly using. Check the max server memory settings, but also the performance counters for total and target memory, and even DBCC MEMORYSTATUS. Other things you may want to review include Resource Governor settings and Lock Pages in Memory. There is also a setting that, if different between two servers, can have a significant effect on how much of the plan cache is in use for the same set of queries: optimize for ad hoc workloads. Kimberly Tripp has a great post on this: Plan cache and optimizing for adhoc workloads. Finally, if the server is virtual, be aware that the environment can play a part here – especially when VM memory settings do not match production or are dynamic.
Buffer pool / plan cache – When you restore the database on the test server, there are a bunch of things that simply aren't ready for you right away. The buffer pool does not contain any of the data that may have existed in the source server – so there will be additional I/O required to prime the data into memory the first time it's queried. And if the buffer pool is restricted differently than production due to some of the factors above, it may not be possible to achieve the same performance patterns even after running the query multiple times – Paul White (@SQL_Kiwi) talks about this in his answer on Database Administrators. Also, the plan cache won't contain any of the plans that existed in production, so at the very least – even if the same plan ultimately gets compiled (which may not happen due to different parameters than when the plan was compiled on the original server) – you will have additional compilation costs. And those can change if you have any plan-affecting trace flags in place, too.
Disk subsystem – While the speed and size of the disk(s) being used won't directly affect plan choice, they certainly can influence observed performance, which can make you wonder why the same query, with the same plan, runs so much faster on one system than the other. I/O is typically SQL Server's biggest bottleneck, and it is quite rare that a test server really has the exact same underlying subsystem as its production equivalent. So, if you are seeing performance differences between the two systems, and the plans and other hardware elements are the same, this might be the next best place to check. And don't forget that, as of SQL Server 2014, Resource Governor may place constraints on your I/O performance.

Trace flags – Check the list of global trace flags set on both servers; there are several that can affect optimization, plan behavior, and perceived performance, even if all of the above settings are identical. Here are 10 common and notable ones (though this is absolutely not an endorsement to turn any of these on without thorough regression testing):

Flag	Explanation
2301	Coerces the optimizer to spend more time trying to find an optimal plan.
2312	Forces SQL Server 2014's new cardinality estimator.
2335	Causes more conservative memory grants.
2453	Forces OPTION (RECOMPILE) for queries referencing table variables.
2861	Allows SQL Server to cache trivial / zero-cost plans.
4136	Effectively, adds OPTIMIZE FOR UNKNOWN to all queries (to thwart parameter sniffing).
4199	An umbrella containing a whole slew of optimizer fixes.
8744	Disables pre-fetching for nested loops.
9481	Turns off SQL Server 2014's new cardinality estimator.

That list of trace flags is by no means exhaustive; there are many others, including undocumented ones I've been asked not to mention. If you are using others not listed above (and can't explain why), you might find clues in KB #920093, KB #2964518, Trace Flags (MSDN) or Trace Flags in SQL Server (TechNet). You will also find some valuable insight in various posts by Paul White, either here, or over at sql.kiwi.

Concurrency – Presumably the test system is used for things other than whatever you're currently testing. And unless you are performing a replay of some sort, it also likely has a very different workload profile. These differences in workload can obviously have a direct impact on the availability of resources to service the requests you're testing, and in turn the perceived performance of those requests. Don't forget to check for other services that may not exist in production, or exist but are used in different ways (such as Analysis Services, Reporting Services, Windows services, and even your own applications). Conversely there may be services like this in production that affect performance there, or additional overhead on the instance itself that isn't mimicked in test: aside from the actual production workload, think about things like tracing, extended events, high-impact monitoring, change tracking, change data capture, auditing, service broker, index maintenance, backup jobs, DBCC checks, mirroring, replication, availability groups, and the list goes on and on…

Are the databases still "identical"?

Assuming all of the hardware and workload variables match up well enough, it can still be challenging to ensure that the databases remain the same. If you are performing a backup / restore onto the test system, the new database starts out as identical to the source (except for physical location and security). But as soon as you start touching it in any way, it very quickly deviates from the production copy, since you could do any or all of the following:

Change data, schema, or both.
Inadvertently kick off an auto-update of statistics.
Manually add, defragment or rebuild indexes, or create or update statistics.
Change database settings like compatibility level, isolation level, forced parameterization, selective XML indexes, or any of the options named "Auto"-<anything>. (Heck, even data and log file locations and growth settings can affect query performance, and this includes tempdb.)
Empty the plan cache, the buffer pool, or both, directly or as a side effect of other events (such as a RECONFIGURE or a service restart).

Also, once you start generating new query plans, even before any of the above changes take place, you have to remember that they may be based on data that is different than the data used to generate plans for the same queries in production. As an example, cardinality when the plan was compiled in production could have skewed significantly between that point and the time of the backup, meaning the new plan will be generated based on different statistics and histogram information.

These things diverge even further if this is not, in fact, a recent restore – but rather two schemas and data sets you're keeping synced in other ways (such as manual deployments of schema and/or data changes, or even replication). Due to disk space limitations, you may also have taken only a subset of production data, or even a stats-only clone – these differences in data will almost certainly lead to different performance characteristics for all but the simplest of queries, even if you do luck out and get the same plans for some.

Are the queries really "identical"?

Even if everything above checks out, there are still scenarios where you are getting a different plan because of session settings (you may be using a different copy of SSMS, with different settings, or a different client tool altogether), or different default schemas (you may be connecting to the test server as a different Windows or SQL auth login, for example). I talked a lot about these things in my previous post.

Conclusion

While there are ways to mitigate some differences (check out DBCC OPTIMIZER_WHATIF for fooling your test server into believing phenomenal things about the underlying hardware), the truth is that it is going to be very challenging to make two servers perform reliably and consistently identical, and that there are potentially dozens of reasons why you may get different plans or different performance on two similar (or even identical) servers.

Do you have any particular tricks? Do you have any excruciating pain points with the ideas above (or others I neglected to mention)? Please share in the comments below!