The APPEND_ONLY_STORAGE_INSERT_POINT Latch

Continuing my series of articles on latches, this time I’m going to discuss the APPEND_ONLY_STORAGE_INSERT_POINT latch and show how it can be a major bottleneck for heavy update workloads where either form of snapshot isolation is being used.

I strongly recommend you read the initial post in the series before this one, so you have all the general background knowledge about latches.

What Is the APPEND_ONLY_STORAGE_INSERT_POINT Latch?

To explain this latch, I need to explain a little bit about how snapshot isolation works.

When you enable one of the two forms of versioning, SQL Server uses a mechanism called versioning to preserve pre-change versions of a record in the version store in tempdb. This is done as follows:

A record is identified as just about to be changed.
The current record is copied into the version store.
The record is changed.
If the record didn’t already have a 14-byte versioning tag, one is added to the end of the record. The tag contains a timestamp (not a real time) and a pointer to the previous version of the record in the version store.
If the record did already have a versioning tag, it’s updated with the new timestamp and version store pointer.

The instance-wide versioning timestamp is incremented whenever a new statement or batch begins, or a new version of a record is created, in any database where either form of snapshot isolation is enabled. This timestamp is used to make sure a query processes the correct version of a record.

For instance, imagine a database that had read committed snapshot enabled, so each statement is guaranteed to see the records as of the time the statement started. The versioning timestamp is set for when the statement started, so any record it encounters that has a higher timestamp is the “wrong” version, and so the “right” version, with a timestamp before the statement’s timestamp, has to be retrieved from the version store. The mechanics of this process aren’t relevant for the purposes of this post.

So, how are the versions physically stored in the version store? The whole pre-change record, including off-row columns, is copied into the version store, broken down into 8,000-byte chunks, which can span two pages if necessary (e.g., 2,000 bytes at the end of one page and 6,000 bytes at the start of the next). This special-purpose storage is made up of append-only allocation units and is only used for version store operations. It is called that because new data can only ever be appended immediately after the end of the most recently entered version. A new allocation unit is created every so often, and this allows regular version store cleanup to be very efficient—as an unneeded allocation unit can simply be dropped. Again, the mechanics of that are beyond the scope of this post.

And now we get to the definition of the latch: any thread that needs to copy a pre-change record into the version store needs to know where the insertion point is in the current append-only allocation unit. This information is protected by the APPEND_ONLY_STORAGE_INSERT_POINT latch.

How Does the Latch Become a Bottleneck?

Here’s the problem: there’s only one acceptable mode in which the APPEND_ONLY_STORAGE_INSERT_POINT latch can be acquired: EX mode (exclusive). And as you’ll know from reading the intro post to the series, only one thread at a time can hold the latch in EX mode.

Pulling all this information together: when one or more databases has snapshot isolation enabled, and there’s a high-enough concurrent workload of updates to those databases, there will be a lot of versions being generated by the various connections, and this latch will become a bit of a bottleneck, with the bottleneck size increasing as the update workload increases where versioning is involved.

Showing the Bottleneck

You can easily reproduce the bottleneck for yourself. I did so as follows:

Created a table with a bunch of integer columns named cXXX where XXX is a number and a clustered index on an int identity column named DocID
Inserted 100,000 records, with random values for all the columns
Created a script with an infinite loop to select a random DocID in the range 1 to 10,000, select a random column name, and increment the column value by 1 (hence creating a version)
Created nine identical scripts, but each selecting from a different 10,000 value cluster key range
Set DELAYED_DURABILITY to FORCED to reduce WRITELOG waits (admittedly you’d rarely be doing this, but it helps to exacerbate the bottleneck for demo purposes)

I then ran all ten scripts simultaneously and measured the Access Methods: Index Searches/sec counter to track how many updates were occurring. I couldn’t use Databases: Batch Requests/sec as each script only had one batch (the infinite loop), and I didn’t want to use Transactions/sec as it might count internal transactions as well as the one wrapping each update.

When snapshot isolation wasn’t enabled, on my Windows 10 laptop running SQL Server 2019, I was getting about 80,000 updates per second across the ten connections. Then when I turned the setting READ_COMMMITED_SNAPSHOT to ON for the database and re-ran the test, the workload throughput dropped to about 60,000 updates per second (a 25% drop in throughput). From looking at wait statistics, 85% of all waits were LATCH_EX, and from looking at latch statistics, 100% were for APPEND_ONLY_STORAGE_INSERT_POINT.

Bear in mind that I set up the scenario to show the bottleneck at its worst. In a real-life environment with a mixed workload, the generally accepted guidance for a throughput drop when using snapshot isolation is 10-15%.

Summary

One other potential area that could be impacted by this bottleneck is availability group readable secondaries. If a database replica is set to be readable, all queries against it automatically use snapshot isolation, and all replay of log records from the primary will generate versions. With a high enough update workload coming from the primary and many databases set to be readable, and with parallel redo being the norm for availability group secondaries, the APPEND_ONLY_STORAGE_INSERT_POINT latch might become a bottleneck on an availability group readable secondary too, which could lead to the secondary falling behind the primary. I haven’t tested this, but it’s exactly the same mechanism I’ve described above, so it seems likely. In that case, it’s possible to disable parallel redo using trace flag 3459, but this might lead to worse overall throughput on the secondary.

Putting the availability group scenario aside, unfortunately, not using snapshot isolation is the only way to completely avoid this bottleneck, which isn’t a viable option if your workload relies on the semantics provided by snapshot isolation, or you need it to alleviate blocking problems (as snapshot isolation means read queries don’t acquire share locks which block change queries).

Edit: from the comments below, you *can* remove the latch bottleneck by using ADR in SQL Server 2019 but then performance is much worse because of the ADR overhead. The scenario where the latch becomes a bottleneck because of the high update workload is absolutely not a valid use case for ADR.

2 thoughts on “The APPEND_ONLY_STORAGE_INSERT_POINT Latch”

Rubén Garrigós says:

August 30, 2021 at 8:02 AM

Hi Paul,

Using SQL Server 2019 ADR and the associated PVS will alleviate the latching problem? If the problem is caused by the load of a group of databases, maybe it can help.

On the other side, using the PVS versus the tempdb "volatile version store" seems to be a slower option. Can it be related with the optimized tempdb logging or there is something different with PVS storage (sys.persistent_version_store) compared with the classic tempdb version store which makes it slower by design?

Thanks,

Ruben
Paul Randal says:

September 3, 2021 at 10:21 PM

Yes – when ADR is enabled, the versions are stored in the PVS in the database rather than in the version store in tempdb. And yes, this would be a lot slower (I saw a drop of 75% throughput when I enabled ADR for my test workload). The latch problem disappears, but then you have likely allocation contention in the user database with the same update workload, unless the PVS is put into a filegroup with enough data files to prevent PFS and GAM page contention, plus all the extra overhead from the ADR process itself compared to the fast versioning mechanism in tempdb. Thanks

This site uses Akismet to reduce spam. Learn how your comment data is processed.