A recent consulting engagement was focused on blocking issues inside of SQL Server that were causing delays in processing user requests from the application. As we started to dig into the issues being experienced, it became clear that from a SQL Server standpoint, the problem revolved around sessions in a Sleeping status that were holding locks inside of the Engine. This is not a typical behavior for SQL Server, so my first thought was that there was some sort of application design flaw that was leaving a transaction active on a session that had been reset for connection pooling in the application, but this was quickly proven not to be the case since the locks were later being released automatically, there was just a delay in this occurring. So, we had to dig in further.
Understanding Session Status
Depending on which DMV you look at for SQL Server, a session can have a few different statuses. A Sleeping status means that the Engine has completed the command, everything between client and server has completed interaction wise, and the connection is waiting for the next command to come from the client. If the sleeping session has an open transaction, it is always related to code and not SQL Server. The transaction being held open can be explained by a couple of things. The first possibility is a procedure with an explicit transaction that doesn’t turn on the XACT_ABORT setting and then times out without the application handling cleanup correctly as explained in this really old post by the CSS team:
If the procedure had enabled the XACT_ABORT setting then it would have aborted the transaction automatically when it timed out and the transaction would have rolled back. SQL Server is doing exactly what it is required to do under ANSI standards and to maintain the ACID properties of the command that was executed. The timeout is not SQL Server related, it is set by the .NET client and the CommandTimeout property, so that too is code related and not SQL Engine related behavior. This is the same kind of problem that I talked about in my Extended Events series as well, in this blog post:
However, in this case the application didn’t use stored procedures for access to the database, and all of the code was generated by an ORM. At this point the investigation shifted away from SQL Server and more towards how the application used the ORM and where transactions would be generated by the application code base.
Understanding .NET Transactions
It is general knowledge that SQL Server wraps any data modification in a transaction that is automatically committed unless the IMPLICIT_TRANSACTIONS set option is ON for a session. After verifying that this was not ON for any portion of their code, it was pretty safe to assume that any transactions remaining after a session was Sleeping was the result of an explicit transaction being opened somewhere during the execution of their code. Now it was just a matter of understanding when, where and most importantly, why it wasn’t being closed out immediately. This leads to one of a few different scenarios that we were going to have to look for inside of their application tier code:
- The application using a TransactionScope() around an operation
- The application enlisting a SqlTransaction() on the connection
- The ORM code wrapping certain calls in a transaction internally that isn’t being committed
The documentation for TransactionScope pretty quickly ruled that out as a possible cause of this. If you fail to Complete the transaction scope it will automatically roll back and abort the transaction when it disposes, so it’s not very likely that this would persist across connection resets. Likewise, the SqlTransaction object will automatically roll back if not committed when the connection is reset for connection pooling, so that quickly became a non-starter for the problem. This just left the ORM code generation, at least that was what I thought, and it would be incredibly odd for an older version of a very common ORM to exhibit this type of behavior from my experience, so we had to dig in further.
The documentation for the ORM they are using clearly states that when any multi-entity action occurs, it is performed inside of a transaction. Multi-entity actions could be recursive saves or saving an entity collection back to the database from the application, and the developers agreed that these types of operations happen all over their code, so yes, the ORM must be using transactions, but why were they all of a sudden becoming a problem.
The Root of the Issue
At this point we took a step back and started doing a holistic review of the entire environment utilizing New Relic and other monitoring tools that were available when the blocking problems were showing up. It started to become clear that the sleeping sessions holding locks only occurred when the IIS Application servers were under extreme CPU load, but that on its own wasn’t enough to account for the lag that was being seen in transaction commits releasing locks. It also turned out that the application servers were virtual machines running on an overcommitted hypervisor host, and the CPU Ready wait times for them were severely elevated at the times of the blocking issues based on the summation values provided by the VM Administrator.
The Sleeping status is going to occur with an open transaction holding locks between the .SaveEntity calls of the objects completing and the final commit in the code generated code behind for the objects. If the VM/App server is under pressure or load, then this could be delayed and lead to issues with blocking, but the problem isn’t in SQL Server, it is doing exactly what it should within the scope of the transaction. The problem is ultimately the result of the delay in processing the application side commit point. Getting the timings of the statement completed and RPC completed events from Extended Events along with the database_transaction_end event timing shows the round-trip delay from the app tier closing out the transaction on the open connection. In this case everything being seen in SQL Server is the victim of an overloaded application server, and an overloaded VM host. Moving/dividing the application load across servers in a NLB or hardware load balanced configuration using hosts that aren’t over committed on CPU usage would quickly restore the immediate commit of the transactions and remove the Sleeping sessions holding locks in SQL Server.
Yet one more example of an environmental issue causing what looked like a run-of-the-mill blocking problem. It always pays to investigate why the blocking thread isn’t able to release its locks quickly.
Great write up of an interesting and hard to pin-point problem. Cheers