Aaron Bertrand

Delayed Durability in SQL Server 2014

SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

Delayed Durability is a late-breaking but interesting feature in SQL Server 2014; the high-level elevator pitch of the feature is, quite simply:

    "Trade durability for performance."

Some background first. By default, SQL Server uses a write-ahead log (WAL), which means that changes are written to the log before they are allowed to be committed. In systems where transaction log writes become the bottleneck, and where there is a moderate tolerance for data loss, you now have the option to temporarily suspend the requirement to wait for the log flush and acknowledgement. This happens to quite literally take the D out of ACID, at least for a small portion of data (more on this later).

You kind of already make this sacrifice now. In full recovery mode, there is always some risk of data loss, it's just measured in terms of time rather than size. For example, if you back up the transaction log every five minutes, you could lose up to just under 5 minutes of data if something catastrophic happened. I'm not talking simple failover here, but let's say the server literally catches on fire or someone trips over the power cord – the database may very well be unrecoverable and you may have to go back to the point in time of the last log backup. And that's assuming you are even testing your backups by restoring them somewhere – in the event of a critical failure you may not have the recovery point you think you have. We tend not to think about this scenario, of course, because we never expect bad things™ to happen.

How it works

Delayed durability enables write transactions to continue running as if the log had been flushed to disk; in reality, the writes to disk have been grouped and deferred, to be handled in the background. The transaction is optimistic; it assumes that the log flush will happen. The system uses a 60KB chunk of log buffer, and attempts to flush the log to disk when this 60KB block is full (at the latest – it can and often will happen before that). You can set this option at the database level, at the individual transaction level, or – in the case of natively compiled procedures in In-Memory OLTP – at the procedure level. The database setting wins in the case of a conflict; for example, if the database is set to disabled, trying to commit a transaction using the delayed option will simply be ignored, with no error message. Also, some transactions are always fully durable, regardless of database settings or commit settings; for example, system transactions, cross-database transactions, and operations involving FileTable, Change Tracking, Change Data Capture, and Replication.

At the database level, you can use:

ALTER DATABASE dbname SET DELAYED_DURABILITY = DISABLED | ALLOWED | FORCED;

If you set it to ALLOWED, this means that any individual transaction can use Delayed Durability; FORCED means that all transactions that can use Delayed Durability will (the exceptions above are still relevant in this case). You will likely want to use ALLOWED rather than FORCED – but the latter can be useful in the case of an existing application where you want to use this option throughout and also minimize the amount of code that has to be touched. An important thing to note about ALLOWED is that fully durable transactions may have to wait longer, as they will force the flush of any delayed durable transactions first.

At the transaction level, you can say:

COMMIT TRANSACTION WITH (DELAYED_DURABILITY = ON);

And in an In-Memory OLTP natively-compiled procedure, you can add the following option to the BEGIN ATOMIC block:

BEGIN ATOMIC WITH (DELAYED_DURABILITY = ON, ...)

A common question is around what happens with locking and isolation semantics. Nothing changes, really. Locking and blocking still happen, and transactions are committed in the same way and with the same rules. The only difference is that, by allowing the commit to occur without waiting for the log to flush to disk, any related locks are released that much sooner.

When You Should Use It

In addition to the benefit you get from allowing the transactions to proceed without waiting for the log write to happen, you also get fewer log writes of larger sizes. This can work out very well if your system has a high proportion of transactions that are actually smaller than 60KB, and particularly when the log disk is slow (though I found similar benefits on SSD and traditional HDD). It doesn't work out so well if your transactions are, for the most part, larger than 60KB, if they are typically long-running, or if you have high throughput and high concurrency. What can happen here is that you can fill the entire log buffer before the flush finishes, which just means transferring your waits to a different resource and, ultimately, not improving the perceived performance by the users of the application.

In other words, if your transaction log is not currently a bottleneck, don't turn this feature on. How can you tell if your transaction log is currently a bottleneck? The first indicator would be high WRITELOG waits, particularly when coupled with PAGEIOLATCH_**. Paul Randal (@PaulRandal) has a great four-part series on identifying transaction log problems, as well as configuring for optimal performance:

Also see this blog post from Kimberly Tripp (@KimberlyLTripp), 8 Steps to Better Transaction Log Throughput, and the SQL CAT team's blog post, Diagnosing Transaction Log Performance Issues and Limits of the Log Manager.

This investigation may lead you to the conclusion that Delayed Durability is worth looking into; it may not. Testing your workload will be the most reliable way to know for sure. Like many other additions in recent versions of SQL Server (*cough* Hekaton), this feature is NOT designed to improve every single workload – and as noted above, it can actually make some workloads worse. See this blog post by Simon Harvey for some other questions you should ask yourself about your workload to determine if it is feasible to sacrifice some durability to achieve better performance.

Potential for data loss

I'm going to mention this several times, and add emphasis every time I do: You need to be tolerant to data loss. Under a well-performing disk, the maximum you should expect to lose in a catastrophe – or even a planned and graceful shutdown – is up to one full block (60KB). However, in the case where your I/O subsystem can't keep up, it is possible that you could lose as much as the entire log buffer (~7MB).

To clarify, from the documentation (emphasis mine):

For delayed durability, there is no difference between an unexpected shutdown and an expected shutdown/restart of SQL Server. Like catastrophic events, you should plan for data loss. In a planned shutdown/restart some transactions that have not been written to disk may first be saved to disk, but you should not plan on it. Plan as though a shutdown/restart, whether planned or unplanned, loses the data the same as a catastrophic event.

So it is very important that you weigh your data loss risk with your need to alleviate transaction log performance issues. If you run a bank or anything dealing with money, it may be much safer and more appropriate for you to move your log to faster disk than to roll the dice using this feature. If you are trying to improve the response time in your Web Gamerz Chat Room application, maybe the risk is less severe.

You can control this behavior to some degree in order to minimize your risk of data loss. You can force all delayed durable transactions to be flushed to disk in one of two ways:

  1. Commit any fully durable transaction.
  2. Call sys.sp_flush_log manually.

This allows you to revert to controlling data loss in terms of time, rather than size; you could schedule the flush every 5 seconds, for example. But you will want to find your sweet spot here; flushing too often can offset the Delayed Durability benefit in the first place. In any case, you will still need to be tolerant to data loss, even if it is only <n> seconds' worth.

You would think that CHECKPOINT might help here, but this operation actually does not technically guarantee the log will be flushed to disk.

Interaction with HA/DR

You might wonder how Delayed Durablity functions with HA/DR features such as log shipping, replication, and Availability Groups. With most of these it works unchanged. Log shipping and replication will replay the log records that have been hardened, so the same potential for data loss exists there. With AGs in asynchronous mode, we're not waiting for the secondary acknowledge anyway, so it will behave the same as today. With synchronous, however, we can't commit on the primary until the transaction is committed and hardened to the remote log. Even in that scenario we may have some benefit locally by not having to wait for the local log to write, we still have to wait for the remote activity. So in that scenario there is less benefit, and potentially none; except perhaps in the rare scenario where the primary's log disk is really slow and the secondary's log disk is really fast. I suspect the same conditions hold true for sync/async mirroring, but you won't get any official commitment from me about how a shiny new feature works with a deprecated one. :-)

Performance Observations

This wouldn't be much of a post here if I didn't show some actual performance observations. I set up 8 databases to test the effects of two different workload patterns with the following attributes:

  • Recovery model: simple vs. full
  • Log location: SSD vs. HDD
  • Durability: delayed vs. fully durable

I am really, really, really lazyefficient about this kind of thing. Since I want to avoid repeating the same operations within each database, I created the following table temporarily in model:

USE model;
GO

CREATE TABLE dbo.TheTable
(
  TheID INT IDENTITY(1,1) PRIMARY KEY,
  TheDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  RowGuid UNIQUEIDENTIFIER NOT NULL DEFAULT NEWID()
);

Then I built a set of dynamic SQL command to builds these 8 databases, rather than create the databases individually and then muck with the settings:

-- C and D are SSD, G is HDD

DECLARE @sql NVARCHAR(MAX) = N'';

;WITH l AS (SELECT l FROM (VALUES('D'),('G')) AS l(l)),
r AS (SELECT r FROM (VALUES('FULL'),('SIMPLE')) AS r(r)),
d AS (SELECT d FROM (VALUES('FORCED'),('DISABLED')) AS d(d)),
x AS (SELECT l.l, r.r, d.d, n = CONVERT(CHAR(1),ROW_NUMBER() OVER 
  (ORDER BY d.d DESC, l.l)) FROM l CROSS JOIN r CROSS JOIN d)
SELECT @sql += N'
CREATE DATABASE dd' + n + ' ON '
+ '(name = ''dd' + n + '_data'','
+ ' filename = ''C:\SQLData\dd' + n + '.mdf'', size = 1024MB)
LOG ON (name = ''dd' + n + '_log'','
+ ' filename = ''' + l + ':\SQLLog\dd' + n + '.ldf'', size = 1024MB);
  ALTER DATABASE dd' + n + ' SET RECOVERY ' + r  + ';
  ALTER DATABASE dd' + n + ' SET DELAYED_DURABILITY = ' + d + ';'
FROM x ORDER BY d, l;

PRINT @sql;
-- EXEC sp_executesql @sql;

Feel free to run this code yourself (with the EXEC still commented out) to see that this would create 4 databases with Delayed Durability OFF (two in FULL recovery, two in SIMPLE, one of each with log on slow disk, and one of each with log on SSD). Repeat that pattern for 4 databases with Delayed Durability FORCED – I did this to simplify the code in the test, rather than to reflect what I would do in real life (where I would likely want to treat some transactions as critical, and some as, well, less than critical).

For sanity checking, I ran the following query to ensure that the databases had the right matrix of attributes:

SELECT d.name, d.recovery_model_desc, d.delayed_durability_desc, 
  log_disk = CASE WHEN mf.physical_name LIKE N'D%' THEN 'SSD' else 'HDD' END
FROM sys.databases AS d
INNER JOIN sys.master_files AS mf
ON d.database_id = mf.database_id
WHERE d.name LIKE N'dd[1-8]'
AND mf.[type] = 1; -- log

Results:

name recovery_model delayed_durability log_disk
dd1 FULL FORCED SSD
dd2 SIMPLE FORCED SSD
dd3 FULL FORCED HDD
dd4 SIMPLE FORCED HDD
dd5 FULL DISABLED SSD
dd6 SIMPLE DISABLED SSD
dd7 FULL DISABLED HDD
dd8 SIMPLE DISABLED HDD

Relevant configuration of the 8 test databases

I also ran the test cleanly multiple times to ensure that a 1 GB data file and 1 GB log file would be sufficient to run the entire set of workloads without introducing any autogrowth events into the equation. As a best practice, I routinely go out of my way to ensure customers' systems have enough allocated space (and proper alerts built in) such that no growth event ever occurs at an unexpected time. In the real world I know this doesn't always happen, but it is ideal.

I set up the system to be monitored with SQL Sentry – this would allow me to easily show most of the performance metrics I wanted to highlight. But I also created a temporary table to store batch metrics including duration and very specific output from sys.dm_io_virtual_file_stats:

SELECT test = 1, cycle = 1, start_time = GETDATE(), * 
INTO #Metrics 
FROM sys.dm_io_virtual_file_stats(DB_ID('dd1'), 2) WHERE 1 = 0;

This would allow me to record the start and finish time of each individual batch, and measure deltas in the DMV between start time and end time (only reliable in this case because I know I'm the only user on the system).

    Lots of small transactions

    The first test I wanted to perform was a lot of small transactions. For each database, I wanted to end up with 500,000 separate batches of a single insert each:

    INSERT #Metrics SELECT 1, 1, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID('dd1'), 2);
    GO
    INSERT dbo.TheTable DEFAULT VALUES;
    GO 500000
    INSERT #Metrics SELECT 1, 2, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID('dd1'), 2);

    Remember, I try to be lazyefficient about this kind of thing. So to generate the code for all 8 databases, I ran this:

    ;WITH x AS 
    (
      SELECT TOP (8) number FROM master..spt_values 
      WHERE type = N'P' ORDER BY number
    )
    SELECT CONVERT(NVARCHAR(MAX), N'') + N'
    INSERT #Metrics SELECT 1, 1, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID(''dd' + RTRIM(number+1) + '''), 2);
    GO
    INSERT dbo.TheTable DEFAULT VALUES;
    GO 500000
    INSERT #Metrics SELECT 1, 2, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID(''dd' + RTRIM(number+1) + '''), 2);'
    FROM x;

    I ran this test and then looked at the #Metrics table with the following query:

    SELECT 
      [database] = db_name(m1.database_id),
      num_writes = m2.num_of_writes - m1.num_of_writes, 
      write_bytes = m2.num_of_bytes_written - m1.num_of_bytes_written,
      bytes_per_write = (m2.num_of_bytes_written - m1.num_of_bytes_written)*1.0
        /(m2.num_of_writes - m1.num_of_writes),
      io_stall_ms = m2.io_stall_write_ms - m1.io_stall_write_ms,
      m1.start_time,
      end_time = m2.start_time,
      duration = DATEDIFF(SECOND, m1.start_time, m2.start_time)
    FROM #Metrics AS m1
    INNER JOIN #Metrics AS m2
    ON m1.database_id = m2.database_id
    WHERE m1.cycle = 1 AND m2.cycle = 2
    AND m1.test = 1 AND m2.test = 1;

    This yielded the following results (and I confirmed through multiple tests that the results were consistent):

    database writes bytes bytes/write io_stall_ms start_time end_time duration (seconds)
    dd1 8,068 261,894,656 32,460.91 6,232 2014-04-26 17:20:00 2014-04-26 17:21:08 68
    dd2 8,072 261,682,688 32,418.56 2,740 2014-04-26 17:21:08 2014-04-26 17:22:16 68
    dd3 8,246 262,254,592 31,803.85 3,996 2014-04-26 17:22:16 2014-04-26 17:23:24 68
    dd4 8,055 261,688,320 32,487.68 4,231 2014-04-26 17:23:24 2014-04-26 17:24:32 68
    dd5 500,012 526,448,640 1,052.87 35,593 2014-04-26 17:24:32 2014-04-26 17:26:32 120
    dd6 500,014 525,870,080 1,051.71 35,435 2014-04-26 17:26:32 2014-04-26 17:28:31 119
    dd7 500,015 526,120,448 1,052.20 50,857 2014-04-26 17:28:31 2014-04-26 17:30:45 134
    dd8 500,017 525,886,976 1,051.73 49,680 2014-04-26 17:30:45 2014-04-26 17:32:58 133

    Small transactions: Duration and results from sys.dm_io_virtual_file_stats

    Definitely some interesting observations here:

    • Number of individual write operations was very small for the Delayed Durability databases (~60X for traditional).
    • Total number of bytes written was cut in half using Delayed Durability (I presume because all of the writes in the traditional case contained a lot of wasted space).
    • The number of bytes per write was a lot higher for Delayed Durability. This was not overly surprising, since the whole purpose of the feature is to bundle writes together in larger batches.
    • The total duration of I/O stalls was volatile, but roughly an order of magnitude lower for Delayed Durability. The stalls under fully durable transactions were much more sensitive to the type of disk.
    • If anything hasn't convinced you so far, the duration column is very telling. Fully durable batches that take two minutes or more are cut almost in half.

    The start/end time columns allowed me to focus on the Performance Advisor dashboard for the precise period where these transactions were happening, where we can draw a lot of additional visual indicators:

    small_tx_dashboard
    SQL Sentry dashboard – click to enlarge

    Further observations here:

    • On several graphs, you can clearly see exactly when the non-Delayed Durability portion of the batch took over (~5:24:32 PM).
    • There is no observable impact to CPU or memory when using Delayed Durability.
    • You can see a tremendous impact to batches/transactions per second in the first graph under SQL Server Activity.
    • SQL Server waits go through the roof when the fully durable transactions started. These were comprised almost exclusively of WRITELOG waits, with a small number of PAGEIOLOATCH_EX and PAGEIOLATCH_UP waits for good measure.
    • The total number of log flushes throughout the Delayed Durability operations was quite small (low 100s/sec), while this jumped to over 4,000/sec for the traditional behavior (and slightly lower for the HDD duration of the test).
    Fewer, larger transactions

    For the next test, I wanted to see what would happen if we performed fewer operations, but made sure each statement affected a larger amount of data. I wanted this batch to run against each database:

    CREATE TABLE dbo.Rnd
    (
      batch TINYINT,
      TheID INT
    );
    
    INSERT dbo.Rnd SELECT TOP (1000) 1, TheID FROM dbo.TheTable ORDER BY NEWID();
    INSERT dbo.Rnd SELECT TOP (10)   2, TheID FROM dbo.TheTable ORDER BY NEWID();
    INSERT dbo.Rnd SELECT TOP (300)  3, TheID FROM dbo.TheTable ORDER BY NEWID();
    GO
    
    INSERT #Metrics SELECT 1, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID('dd1'), 2);
    GO
    UPDATE t SET TheDate = DATEADD(MINUTE, 1, TheDate)
      FROM dbo.TheTable AS t
      INNER JOIN dbo.Rnd AS r
      ON t.TheID = r.TheID
      WHERE r.batch = 1;
    GO 10000
    UPDATE t SET RowGuid = NEWID()
      FROM dbo.TheTable AS t
      INNER JOIN dbo.Rnd AS r
      ON t.TheID = r.TheID
      WHERE r.batch = 2;
    GO 10000
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID   FROM dbo.Rnd WHERE batch = 3);
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID+1 FROM dbo.Rnd WHERE batch = 3);
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID-1 FROM dbo.Rnd WHERE batch = 3);
    GO
    INSERT #Metrics SELECT 2, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID('dd1'), 2);

    So again I used the lazy method to produce 8 copies of this script, one per database:

    ;WITH x AS (SELECT TOP (8) number FROM master..spt_values WHERE type = N'P' ORDER BY number)
    SELECT N'
    USE dd' + RTRIM(Number+1) + ';
    GO
    
    CREATE TABLE dbo.Rnd
    (
      batch TINYINT,
      TheID INT
    );
    
    INSERT dbo.Rnd SELECT TOP (1000) 1, TheID FROM dbo.TheTable ORDER BY NEWID();
    INSERT dbo.Rnd SELECT TOP (10)   2, TheID FROM dbo.TheTable ORDER BY NEWID();
    INSERT dbo.Rnd SELECT TOP (300)  3, TheID FROM dbo.TheTable ORDER BY NEWID();
    GO
    
    INSERT #Metrics SELECT 2, 1, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID(''dd' + RTRIM(number+1) + ''', 2);
    GO
    UPDATE t SET TheDate = DATEADD(MINUTE, 1, TheDate)
      FROM dbo.TheTable AS t
      INNER JOIN dbo.rnd AS r
      ON t.TheID = r.TheID
      WHERE r.cycle = 1;
    GO 10000
    UPDATE t SET RowGuid = NEWID()
      FROM dbo.TheTable AS t
      INNER JOIN dbo.rnd AS r
      ON t.TheID = r.TheID
      WHERE r.cycle = 2;
    GO 10000
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID   FROM dbo.rnd WHERE cycle = 3);
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID+1 FROM dbo.rnd WHERE cycle = 3);
    DELETE dbo.TheTable WHERE TheID IN (SELECT TheID-1 FROM dbo.rnd WHERE cycle = 3);
    GO
    INSERT #Metrics SELECT 2, 2, GETDATE(), * 
      FROM sys.dm_io_virtual_file_stats(DB_ID(''dd' + RTRIM(number+1) + '''), 2);'
    FROM x;

    I ran this batch, then changed the query against #Metrics above to look at the second test instead of the first. The results:

    database writes bytes bytes/write io_stall_ms start_time end_time duration (seconds)
    dd1 20,970 1,271,911,936 60,653.88 12,577 2014-04-26 17:41:21 2014-04-26 17:43:46 145
    dd2 20,997 1,272,145,408 60,587.00 14,698 2014-04-26 17:43:46 2014-04-26 17:46:11 145
    dd3 20,973 1,272,982,016 60,696.22 12,085 2014-04-26 17:46:11 2014-04-26 17:48:33 142
    dd4 20,958 1,272,064,512 60,695.89 11,795 2014-04-26 17:48:33 2014-04-26 17:50:56 143
    dd5 30,138 1,282,231,808 42,545.35 7,402 2014-04-26 17:50:56 2014-04-26 17:53:23 147
    dd6 30,138 1,282,260,992 42,546.31 7,806 2014-04-26 17:53:23 2014-04-26 17:55:53 150
    dd7 30,129 1,281,575,424 42,536.27 9,888 2014-04-26 17:55:53 2014-04-26 17:58:25 152
    dd8 30,130 1,281,449,472 42,530.68 11,452 2014-04-26 17:58:25 2014-04-26 18:00:55 150

    Larger transactions: Duration and results from sys.dm_io_virtual_file_stats

    This time, the impact of Delayed Durability is much less noticeable. We see a slightly smaller number of write operations, at a slightly larger number of bytes per write, with the total bytes written almost identical. In this case we actually see the I/O stalls are higher for Delayed Durability, and this likely accounts for the fact that the durations were almost identical as well.

    From the Performance Advisor dashboard, we some similarities with the previous test, and some stark differences as well:

    large_tx_dashboard
    SQL Sentry dashboard – click to enlarge

    One of the big differences to point out here is that the delta in wait stats is not quite as pronounced as with the previous test – there is still a much higher frequency of WRITELOG waits for the fully durable batches, but nowhere near the levels seen with the smaller transactions. Another thing you can spot immediately is that the previously observed impact on batches and transactions per second is no longer present. And finally, while there are more log flushes with fully durable transactions than when delayed, this disparity is far less pronounced than with the smaller transactions.

Conclusion

It should be clear that there are certain workload types that may benefit greatly from Delayed Durability – provided, of course, that you have a tolerance for data loss. This feature is not restricted to In-Memory OLTP, is available on all editions of SQL Server 2014, and can be implemented with little to no code changes. It can certainly be a powerful technique if your workload can support it. But again, you will need to test your workload to be sure that it will benefit from this feature, and also strongly consider whether this increases your exposure to the risk of data loss.

As an aside, this may seem to the SQL Server crowd like a fresh new idea, but in truth Oracle introduced this as "Asynchronous Commit" in 2006 (see COMMIT WRITE ... NOWAIT as documented here and blogged about in 2007). And the idea itself has been around for nearly 3 decades; see Hal Berenson's brief chronicle of its history.

Next Time

One idea that I have batted around is to try to improve the performance of tempdb by forcing Delayed Durability there. One special property of tempdb that makes it such a tempting candidate is that it is transient by nature – anything in tempdb is designed, explicitly, to be tossable in the wake of a wide variety of system events. I am saying this now without having any idea if there is a workload shape where this will work out well; but I do plan to try it out, and if I find anything interesting, you can be sure I will post about it here.