Allocation Order Scans - SQLPerformance.com

When an execution plan includes a scan of a b-tree index structure, the storage engine may be able to choose between two physical access strategies when the plan is executed:

Follow the index b-tree structure; or,
locate pages using internal page allocation information.

Where a choice is available, the storage engine makes the runtime decision on each execution. A plan recompilation is not required for it to change its mind.

The b-tree strategy starts at the root of the tree, descends to an extreme edge of the leaf level (depending on whether the scan is forward or backward), then follows leaf-level page links until the other end of the index is reached. The allocation strategy uses Index Allocation Map (IAM) structures to locate database pages allocated to the index. Each IAM page maps allocations to a 4GB interval in a single physical database file, so scanning the IAM chains associated with an index tends to access index pages in physical file order (at least as far as SQL Server can tell).

The main differences between the two strategies are:

A b-tree scan can deliver rows to the query processor in index key order; an IAM-driven scan cannot;
a b-tree scan may not be able to issue large read-ahead I/O requests if logically contiguous index pages are not also physically contiguous (e.g. as a result of page splitting in the index).

A b-tree scan is always available for an index. The conditions often cited for allocation order scans to be available are:

The query plan must allow an unordered scan of the index;
the index must be at least 64 pages in size; and,
either a TABLOCK or NOLOCK hint must be specified.

The first condition simply means that the query optimizer must have marked the scan with the Ordered:False property. Marking the scan Ordered:False means that correct results from the execution plan do not require the scan to return rows in index key order (though it may do so if it is convenient or otherwise necessary).

The second condition (size) applies only to SQL Server 2005 and later. It reflects the fact that there is a certain start-up cost to performing an IAM-driven scan, so there needs to be a minimum number of pages for the potential savings to repay the initial investment. The “64 pages” refers to the value of data_pages for the IN_ROW_DATA allocation unit only, as reported in sys.allocation_units.

Of course, there can only be a payoff from an allocation order scan if the possibly-larger read-ahead considerations actually come into play, but SQL Server does not currently consider this factor. In particular, it does not account for how much of the index is currently in memory, nor does it care how fragmented the index is.

The third condition is probably the least complete description in the list. Hints are not in fact required, though they can be used to meet the real requirements: The data must be guaranteed not to change during the scan, or (more controversially) we must indicate that we do not care about potentially inaccurate results, by performing the scan at the read uncommitted isolation level.

Even with these clarifications, the list of conditions for an allocation-ordered scan is still not complete. There are a number of important caveats and exceptions, which we will come to shortly.

Demo

The following query uses the AdventureWorks sample database:

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO
SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P;

Note that the Person table contains 3,869 pages. The post-execution (actual) plan is as follows (shown in SQL Sentry Plan Explorer):

Execution Plan

In terms of the allocation-order scanning requirements we have so far:

The plan has the required Ordered:False property; and,
the table has more than 64 pages; but,
we have done nothing to ensure the data cannot change during the scan. Assuming our session is using the default read committed isolation level, the scan is not being performed at the read uncommitted isolation level either.

As a consequence, we would expect this scan to be performed by scanning the b-tree rather than being IAM-driven. The query results indicate that this is likely true:

Query Results

The rows are returned in Clustered Index key order (by BusinessEntityID). I should state clearly that this result ordering is absolutely not guaranteed, and should not be relied on. Ordered results are only guaranteed by an appropriate top-level ORDER BY clause.

Nevertheless, the observed output order is circumstantial evidence that the scan was performed this time by following the clustered index b-tree structure. If more evidence is needed, we can attach a debugger and look at the code path SQL Server is executing during the scan:

Call stack

The call stack clearly shows the scan following the b-tree.

Adding a table lock hint

We now modify the query to include a table-lock hint:

CHECKPOINT;
DBCC DROPCLEANBUFFERS;

SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P
    WITH (TABLOCK);

At the default locking read committed isolation level, the shared table-level lock prevents any possible concurrent modifications to the data. With all three preconditions for IAM-driven scans met, we would now expect SQL Server to use an allocation-order scan. The execution plan is the same as before, so I won’t repeat it, but the query results certainly look different:

Query Results

The results are still apparently ordered by BusinessEntityID, but the starting point (10866) is different. Indeed, if we scroll down the results, we soon encounter sections that are more obviously out of key order:

More Query Results

The partial ordering is due to the allocation-order scan processing a whole index page at a time. The results within a page happen to be returned ordered by the index key, but the order of the scanned pages is now different. Again, I should stress that the results may look different for you: there is no guarantee of output order, even within a page, without a top-level ORDER BY on the original query.

For comparison with the call stack shown earlier, this is a stack trace obtained while SQL Server was processing the query with the TABLOCK hint:

Call stack

Stepping on a little further through the execution:

Call stack

Clearly, SQL Server is performing an allocation-ordered scan when the table lock is specified. It is a shame there is no indication in a post-execution plan of which type of scan was used at runtime. As a reminder, the type of scan is chosen by the storage engine, and can change between executions without a plan recompilation.

Other ways to meet the third condition

I said before that to get an IAM-driven scan, we need to ensure the data cannot change underneath the scan while it is in progress, or we need to run the query at the read uncommitted isolation level. We have seen that a table lock hint at locking read committed isolation is sufficient to meet the first of those requirements, and it is easy to show that using a NOLOCK/READUNCOMMITTED hint also enables an allocation-order scan with the demo query.

In fact there are many ways to meet the third condition, including:

Altering the index to only allow table locks;
making the database read-only (so data is guaranteed not to change); or,
changing the session isolation level to READ UNCOMMITTED.

There are, however, much more interesting variations on this theme that mean we need to amend the three conditions stated previously…

Row-versioning isolation levels

Enable read committed snapshot isolation (RCSI) on the AdventureWorks database, and run the test with the TABLOCK hint again (at read committed isolation):

ALTER DATABASE AdventureWorks2012
SET READ_COMMITTED_SNAPSHOT ON
    WITH ROLLBACK IMMEDIATE;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
GO
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO
SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P
    WITH (TABLOCK);
GO
ALTER DATABASE AdventureWorks2012
SET READ_COMMITTED_SNAPSHOT OFF
    WITH ROLLBACK IMMEDIATE;

With RCSI active, an index-ordered scan is used with TABLOCK, not the allocation-order scan we saw just before. The reason is the TABLOCK hint specifies a table-level shared lock, but with RCSI enabled, no shared locks are taken. Without the shared table lock, we have not met the requirement to prevent concurrent modifications to the data while the scan is in progress, so an allocation-ordered scan cannot be used.

Achieving an allocation-ordered scan when RCSI is enabled is possible, however. One way is to use a TABLOCKX hint (for a table-level exclusive lock) instead of TABLOCK. We could also retain the TABLOCK hint and add another one like READCOMMITTEDLOCK, or REPEATABLE READ or SERIALIZABLE … and so on. All these work by preventing the possibility of concurrent modifications by taking a shared table lock, at the cost of losing the benefits of RCSI. We can also still achieve an allocation-order scan using a NOLOCK or READUNCOMMITTED hint, of course.

The situation under snapshot isolation (SI) is very similar to RCSI, and not explored in detail for space reasons.

TABLESAMPLE always* performs an allocation-order scan

The TABLESAMPLE clause is an interesting exception to many of the things we have discussed so far.

Specifying a TABLESAMPLE clause always* results in an allocation-order scan, even under RCSI or SI, and even without hints. To be clear about it, the allocation-order scan that results from using TABLESAMPLE retains RCSI/SI semantics – the scan uses row versions and reading does not block writing (and vice versa).

A second surprise is that TABLESAMPLE always* performs an IAM-driven scan even if the table has fewer than 64 pages. This makes some sense because the documentation at least hints that the SYSTEM sampling method uses the IAM structure (so there is no choice but to do an allocation-order scan):

SYSTEM Is an implementation-dependent sampling method specified by ISO standards. In SQL Server, this is the only sampling method available and is applied by default. SYSTEM applies a page-based sampling method in which a random set of pages from the table is chosen for the sample, and all the rows on those pages are returned as the sample subset.

* An exception occurs if the ROWS or PERCENT specification in the TABLESAMPLE clause works out to mean 100% of the table. Specifying more ROWS than the metadata indicates are currently in the table will not work either. Using TABLESAMPLE SYSTEM (100 PERCENT) or equivalent will not force an allocation-order scan.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO
SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P
    TABLESAMPLE SYSTEM (50 ROWS)
    REPEATABLE (12345678)
    --WITH (TABLOCK);

Results:

Query Results

The effect of TOP and SET ROWCOUNT

In short, neither of these has any effect on the decision to use an allocation-order scan or not. This may seem surprising in cases where it is "obvious" that fewer than 64 pages will be scanned.

For example, the following queries both use an IAM-driven scan to return 5 rows from a scan:

SELECT TOP (5)
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P WITH (TABLOCK)

SET ROWCOUNT 5;

SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P WITH (TABLOCK)

SET ROWCOUNT 0;

The results are the same for both:

Query Results

This means that TOP and SET ROWCOUNT queries might incur the overhead of setting up an allocation-order scan, even if fewer than 64 pages are scanned. In mitigation, more complex TOP queries with selective predicates pushed into the scan could still benefit from an allocation-order scan. If the scan must process 10,000 pages to find the first 5 rows that match, an allocation-order scan could still be a win.

Preventing all* allocation-order scans instance-wide

This is not something you would ever likely do intentionally, but there is a server setting that will prevent allocation-order scans for all* user queries in all databases.

Unlikely as it may seem, the setting in question is the cursor threshold server configuration option, which has the following description in Books Online:

The cursor threshold option specifies the number of rows in the cursor set at which cursor keysets are generated asynchronously. When cursors generate a keyset for a result set, the query optimizer estimates the number of rows that will be returned for that result set. If the query optimizer estimates that the number of returned rows is greater than this threshold, the cursor is generated asynchronously, allowing the user to fetch rows from the cursor while the cursor continues to be populated. Otherwise, the cursor is generated synchronously, and the query waits until all rows are returned.

If the cursor threshold option is set to anything other than –1 (the default), no allocation-order scans will occur for user queries in any database on the SQL Server instance.

In other words, if asynchronous cursor population is enabled, no IAM-driven scans for you.

* The exception is (non-100%) TABLESAMPLE queries. The internal queries generated by the system for statistics creation and statistics updates also continue to use allocation-ordered scans.

CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO

-- WARNING! Disables allocation-order scans instance-wide
EXECUTE sys.sp_configure 
    @configname = 'cursor threshold',
    @configvalue = 5000;

RECONFIGURE WITH OVERRIDE;
GO

-- Would normally result in an allocation-order scan
SELECT
    P.BusinessEntityID,
    P.PersonType
FROM Person.Person AS P
    WITH (READUNCOMMITTED);
GO

-- Reset to default allocation-order scans
EXECUTE sys.sp_configure 
    @configname = 'cursor threshold',
    @configvalue = -1;

RECONFIGURE WITH OVERRIDE;

Results (no allocation-order scan):

Query Results

One can only guess that asynchronous cursor population does not work well with allocation-order scans for some reason. It is entirely unexpected that this restriction would affect all non-cursor user queries as well though. Perhaps it is too hard for SQL Server to detect if a query is running as part of an externally-issued API cursor? Who knows.

It would be nice if this side-effect were officially documented somewhere, though it is hard to know exactly where it should go in Books Online. I wonder how many production systems out there are not using allocation-order scans because of this? Maybe not many, but you never know.

To wrap things up, here is a summary. An allocation-ordered scan is available if:

The server option cursor threshold is set to –1 (the default); and,
the query plan scan operator has the Ordered:False property; and,
the total data_pages of the IN_ROW_DATA allocation units is at least 64; and,
either:
1. SQL Server has an acceptable guarantee that concurrent modifications are impossible; or,
2. the scan is running at the read uncommitted isolation level.

Regardless of all the above, a scan with a TABLESAMPLE clause always uses allocation-ordered scans (with the one technical exception noted in the main text).

11 thoughts on “Allocation Order Scans”

tobi says:

January 23, 2015 at 4:25 PM

Why don't read only filegroups enable allocation order scans for indexes purely contained in such filegroups? Or for those partitions that are in a read only file group.
Davos says:

February 17, 2015 at 12:40 PM

If you have a datawarehouse being loaded daily, would it be a good idea to set it to 'readonly' at the end of the ETL process?

I read a lot of places where people warn against using NOLOCK or transaction isolation levels other than the default, but does that really apply to a system not being written-to during operation?

Is retrieving data in allocation order always a good thing? Does the read-ahead reads benefit you mentioned apply to SSDs? What if your query uses the data retrieved from this table and then uses it to join to another table. If you retrieve the data where Ordered=True then perhaps you will get a fast merge join without a sort operator. If you get the data in allocation order maybe it will force a sort, or use a hash instead. What are the pros & cons of these things?

Not that you haven't thoroughly covered this topic in this blog post (which is awesome btw) I just have more questions!
Paul White says:

February 17, 2015 at 2:19 PM

Maybe because the IAM pages that drive the scan aren't guaranteed to be in the same filegroup? I don't really know. It could also simply be an implementation gap. One thing's for sure: it doesn't work that way today.
Paul White says:

February 17, 2015 at 2:25 PM

Setting the DW to read only at the end of the ETL process might be a good idea, yes. One of the factors to consider in making that decision would be the usefulness of allocation-order scans. A database that isn't being written to doesn't have to worry about isolation levels, no. At least, not in terms of transaction isolation guarantees.

No, retrieving data in allocation order is not always a good thing. There is a definite start-up cost, so the strategy only pays off if larger I/O is actually realized, as I mentioned in the main text. Yes, larger I/O is generally beneficial whatever the underlying storage system is. The query optimizer accounts for some ordered-data possibilities when deciding on the final plan, but you're right to mention the possibility of order preservation because it is possible for the QO to miss opportunities in this area. That's not an observation that is specific to allocation-ordered scans though. It is sometimes possible for a skilled human to do better than the QO where order-preservation is a factor.
Davos says:

February 20, 2015 at 5:09 AM

Thanks Paul, very helpful.
Paul Randal says:

March 13, 2015 at 11:16 PM

Just FYI – IAM pages will always be from the same filegroup, but not necessarily from the same file as the GAM interval they represent.
Pingback: Data recovery: investigating weird SELECT failures around corruption - Paul S. Randal
Andrei says:

May 3, 2017 at 11:25 PM

Is there any way to find out the type of the scan – b-tree or allocation order? Seems like there is nothing in execution plan indicating this.
Paul White says:

May 4, 2017 at 6:30 AM

Thank you, Paul!
Paul White says:

May 4, 2017 at 7:25 AM

No way I am aware of. I did look for things – extended events, trace flags etc. – but didn't find anything, hence the debugger shots in the article. Would be great if Microsoft were to add this information somewhere.
neeraj prasad sharma says:

June 2, 2017 at 4:47 PM

hi paul,
I have seen if the storage engine chooses Allocation Order Scan(IAM driven scan) then number of logical reads = number of pages in the table (data+index), when it uses B tree structure its data+index +1 in serial execution. In parallel version logical reads are data+index +x number of pages. can you explain it.

Comments are closed.