This is the fourth part in a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. Part 3 started the query scan, performed some early phase processing, and started the first additional parallel tasks in branch C.
Branch C execution details
This is the second step of the execution sequence:
- Branch A (parent task).
- Branch C (additional parallel tasks).
- Branch D (additional parallel tasks).
- Branch B (additional parallel tasks).
A reminder of the branches in our parallel plan (click to enlarge)
A short time after the new tasks for branch C are queued, SQL Server attaches a worker to each task, and places the worker on a scheduler ready for execution. Each new task runs inside a new execution context. At DOP 2, there are two new tasks, two worker threads, and two execution contexts for branch C. Each task runs its own copy of the iterators in branch C on its own worker thread:
The two new parallel tasks start running at a sub-procedure entry point, which initially leads to an Open
call on the producer side of the exchange (CQScanXProducerNew::Open
). Both tasks have identical call stacks at the start of their lives:
Exchange synchronization
Meanwhile, the parent task (running on its own worker thread) registers the new sub-processes with the sub-process manager, then waits at the consumer side of the repartition streams exchange at node 5. The parent task waits on CXPACKET
* until all of the branch C parallel tasks complete their Open
calls and return to the producer side of the exchange. The parallel tasks will open every iterator in their subtree (i.e. down to the index seek at node 9 and back) before returning to the repartition streams exchange at node 5. The parent task will wait on CXPACKET
while this happens. Remember the parent task is executing early phases calls.
We can see this wait in the waiting tasks DMV:
Execution context zero (the parent task) is blocked by both of the new execution contexts. These execution contexts are the first additional ones to be created after context zero, so they are assigned the numbers one and two. To emphasise: Both new execution contexts need to open their subtrees and return to the exchange for the parent task’s CXPACKET
wait to end.
You might have been expecting to see CXCONSUMER
waits here, but that wait is reserved for waiting on row data to arrive. The current wait is not for rows — it is for the producer side to open, so we get a generic CXPACKET
* wait.
* Azure SQL Database and Managed Instance use the new CXSYNC_PORT
wait instead of CXPACKET
here, but that improvement hasn’t made its way into SQL Server yet (as of 2019 CU9).
Inspecting the new parallel tasks
We can see the new tasks in the query profiles DMV. Profiling information for the new tasks appears in the DMV because their execution contexts were derived (cloned, then updated) from the parent (execution context zero):
There are now three entries for each iterator in Branch C (highlighted). One for the parent task (execution context zero), and one for each new additional parallel task (contexts 1 and 2). Notice that the per-thread estimated row counts ( see part 1) have arrived now, and are shown only for the parallel tasks. The first and last active times for the parallel tasks represent the time their execution contexts were created. None of the new tasks has opened any iterators yet.
The repartition streams exchange at node 5 still only has a single entry in the DMV output. This is because the associated invisible profiler monitors the consumer side of the exchange. The additional parallel tasks are on the producer side of the exchange. The consumer side of node 5 will eventually have parallel tasks, but we haven’t got to that point yet.
Checkpoint
This seems like a good point to pause for breath and summarize where everything is at the moment. There will be more of these stopping points as we go along.
- The parent task is on the consumer side of the repartition streams exchange at node 5, waiting on
CXPACKET
. It is in the middle of executing early phases calls. It paused to start up Branch C because that branch contains a blocking sort. The parent task’s wait will continue until both parallel tasks complete opening their subtrees. - Two new parallel tasks on the producer side of the node 5 exchange are ready to open the iterators in Branch C.
Nothing outside Branch C of this parallel execution plan can make forward progress until the parent task is released from its CXPACKET
wait. Remember we have only created one set of additional parallel workers so far, for Branch C. The only other thread is the parent task, and that is blocked.
Branch C Parallel Execution
The two parallel tasks start at the producer side of the repartition streams exchange at node 5. Each has a separate (serial) plan with its own stream aggregate, sort, and index seek. The compute scalar does not appear in the runtime plan because its calculations are deferred to the sort.
Each instance of the index seek is parallel-aware and operates on disjoint sets of rows. These sets are generated on demand from the parent rowset created earlier by the parent task (covered in part 1). When either instance of the seek needs a new subrange of rows, it synchronizes with the other worker threads, so that only one is allocating a new subrange at the same time. The synchronization object used was also created earlier by the parent task. When a task waits for exclusive access to the parent rowset to acquire a new subrange, it waits on CXROWSET_SYNC
.
Branch C tasks open
The sequence of Open
calls for each task in Branch C is:
CQScanXProducerNew::Open
. Notice there is no preceding profiler on the producer side of an exchange. This is unfortunate for query tuners.CXTransLocal::Open
CXPort::Register
CXTransLocal::ActivateWorkers
CQScanProfileNew::Open
. The profiler above node 6.CQScanStreamAggregateNew::Open
(node 6)CQScanProfileNew::Open
. The profiler above node 7.CQScanSortNew::Open
(node 7)
The sort is a fully blocking operator. It consumes its entire input during its Open
call. There are a great number of interesting internal details to explore here, but space is short, so I will only cover the highlights:
The sort builds its sort table by opening its subtree and consuming all the rows its children can provide. Once sorting is complete, the sort is ready to transition to output mode, and it returns control to its parent. The sort will later respond to GetRow()
calls, returning the next sorted row each time. An illustrative call stack during sort input is:
Execution continues until each sort has consumed all the (disjoint ranges of) rows available from its child index seek. The sorts then call Close
on the index seeks, and return control to their parent stream aggregate. The stream aggregates initialize their counters and return control to the producer side of the repartition exchange at node 5. The sequence of Open
calls is now complete in this branch.
The profiling DMV at this point shows updated timing numbers, and close times for the parallel index seeks:
More exchange synchronization
Recall the parent task is waiting on the consumer side of node 5 for all producers to open. A similar synchronization process now happens among the parallel tasks on the producer side of the same exchange:
Each producer task synchronizes with the others via CXTransLocal::Synchronize
. The producers call CXPort::Open
, then wait on CXPACKET
for all consumer-side parallel tasks to open. When the first Branch C parallel task arrives back at the producer side of the exchange and waits, the waiting tasks DMV looks like this:
We still have the parent task’s consumer-side waits. The new CXPACKET
highlighted is our first producer-side parallel task waiting for all consumer-side parallel tasks to open the exchange port.
The consumer-side parallel tasks (in Branch B) do not even exist yet, so the producer task displays NULL for the execution context it is blocked by. The task currently waiting on the consumer side of the repartition streams exchange is the parent task (not a parallel task!) running EarlyPhases
code, so it doesn’t count.
Parent task CXPACKET wait ends
When the second parallel task in Branch C arrives back at the producer side of the exchange from its Open
calls, all producers have opened the exchange port, so the parent task on the consumer side of the exchange is released from its CXPACKET
wait.
The workers on the producer side continue to wait for the consumer side parallel tasks to be created and open the exchange port:
Checkpoint
At this point in time:
- There are a total of three tasks: Two in Branch C, plus the parent task.
- Both producers at the node 5 exchange have opened, and are waiting on
CXPACKET
for the consumer side parallel tasks to open. Much of the exchange machinery (including row buffers) is created by the consumer side, so there is nowhere for the producers to put rows yet. - The sorts in Branch C have consumed all their input, and are ready to provide sorted output.
- The index seeks in Branch C have completed their work and closed down.
- The parent task has just been released from waiting on
CXPACKET
at the consumer side of the node 5 repartition streams exchange. It is still executing nestedEarlyPhases
calls.
Branch D Parallel Tasks Start
This is the third step in the execution sequence:
- Branch A (parent task).
- Branch C (additional parallel tasks).
- Branch D (additional parallel tasks).
- Branch B (additional parallel tasks).
Released from its CXPACKET
wait at the consumer side of the repartition streams exchange at node 5, the parent task ascends the Branch B query scan tree. It returns from nested EarlyPhases
calls to the various iterators and profilers on the outer (upper) input of the merge join.
As mentioned, ascending the tree updates the elapsed and CPU times recorded by the invisible profiling iterators. We are executing code using the parent task, so those numbers are recorded against execution context zero. This is the ultimate source of the “thread 0” timing numbers referred to in my previous article, Understanding Execution Plan Operator Timings.
Once back at the merge join, the parent task calls EarlyPhases
for the iterators and profilers on the inner (lower) input to the merge join. These are nodes 10 to 15 (excluding 14, which is deferred):
Once the parent task’s early phases calls reach the index seek at node 15, it begins to ascend the tree again (setting profiling times) until it reaches the repartition streams exchange at node 11.
Then, just as it did on the outer (upper) input to the merge join, it starts the producer side of the exchange at node 11, creating two new parallel tasks.
This sets Branch D in motion (shown below). Branch D executes exactly as already described in detail for Branch C.
Immediately after starting tasks for Branch D, the parent task waits on CXPACKET
at node 11 for the new producers to open the exchange port:
The new CXPACKET
waits are highlighted. Notice the reported node id might be a bit misleading. The parent task really is waiting at the consumer side of node 11 (repartition streams), not node 2 (gather streams). This is a quirk of early phase processing.
Meanwhile, the producer threads in Branch C continue to wait on CXPACKET
for the consumer side of the node 5 repartition streams exchange to open.
Branch D opening
Just after the parent task starts the producers for Branch D, the query profile DMV shows the new execution contexts (3 and 4):
The two new parallel tasks in Branch D proceed exactly as those in Branch C did. The sorts consume all their input, and the Branch D tasks return to the exchange. This releases the parent task from its CXPACKET
wait. The Branch D workers then wait on CXPACKET
at the producer side of node 11 for the consumer side parallel tasks to open. Those parallel workers (in Branch B) do not exist yet.
Checkpoint
The waiting tasks at this point are shown below:
Both sets of parallel tasks in Branches C and D are waiting on CXPACKET
for their parallel task consumers to open, at repartition streams exchange nodes 5 and 11 respectively. The only runnable task in the whole query right now is the parent task.
The query profiler DMV at this point is shown below, with operators in Branches C and D highlighted:
The only parallel tasks we haven’t started yet are in Branch B. All the work in Branch B so far has been early phases calls performed by the parent task.
End of Part 4
In the final part of this series, I will describe how the remainder of this particular parallel execution plan starts up, and briefly cover how the plan returns results. I will conclude with a more general description that applies to parallel plans of arbitrary complexity.