If you are deploying SQL Server Availability Groups, one of the important aspects of a successful deployment is monitoring the synchronization of the secondary replica databases with the primary replica. There are multiple ways to monitor replica synchronization in an Availability Group, and this post will show each of them and explain their benefits and drawbacks,
One of the easiest ways to monitor the status of an Availability Group, each of the replica servers, and the availability databases, is through the built-in dashboard in Management Studio. However, the default layout of the dashboard doesn't provide a lot of details, and will need to be customized to show additional information about the replica servers as well as the availability databases. Additional columns can be added to the layout through the Add/Remove Columns link on the dashboard, or through the right-click context menu on any of the existing column headers, as shown below:
For the availability databases, monitoring the Log Send Queue Size (KB), Log Send Rate (KB/sec), Estimated Data Loss (time), Estimated Recovery Time (seconds), and Synchronization Performance (seconds) will give you a better understanding of how data is flowing to the replicas and the overall health of the availability databases. For example, in the screenshot below, I've modified the VM network configuration for SQL03 so that it has higher latency and lower throughput, which is affecting the synchronization of the databases:
Here we can see that there is nearly six minutes of potential data loss for SQL03, and 505 MB of unsent log that is being sent at a rate of 7 MB/sec to the secondary (which, in this case, is an asynchronous secondary). Whereas SQL02 is currently caught up and has no data loss as a synchronous secondary in the configuration.
An alternative to the Availability Group Dashboard is direct querying of the DMVs, which is where the dashboard pulls its information from as a source. The following query shows the current status and synchronization metrics for each database in an availability group:
SELECT ar.replica_server_name, adc.database_name, ag.name AS ag_name, drs.is_local, drs.is_primary_replica, drs.synchronization_state_desc, drs.is_commit_participant, drs.synchronization_health_desc, drs.recovery_lsn, drs.truncation_lsn, drs.last_sent_lsn, drs.last_sent_time, drs.last_received_lsn, drs.last_received_time, drs.last_hardened_lsn, drs.last_hardened_time, drs.last_redone_lsn, drs.last_redone_time, drs.log_send_queue_size, drs.log_send_rate, drs.redo_queue_size, drs.redo_rate, drs.filestream_send_rate, drs.end_of_log_lsn, drs.last_commit_lsn, drs.last_commit_time FROM sys.dm_hadr_database_replica_states AS drs INNER JOIN sys.availability_databases_cluster AS adc ON drs.group_id = adc.group_id AND drs.group_database_id = adc.group_database_id INNER JOIN sys.availability_groups AS ag ON ag.group_id = drs.group_id INNER JOIN sys.availability_replicas AS ar ON drs.group_id = ar.group_id AND drs.replica_id = ar.replica_id ORDER BY ag.name, ar.replica_server_name, adc.database_name;
By querying the DMVs directly on the primary replica, it is easy to get up to date information without waiting for the refresh period of the dashboard in Management Studio. This has been useful a few times while consulting with clients that had a link failure between data centers, or where the connectivity was down for maintenance for a duration of time, and the secondary replicas are in the process of catching up once the connection was restored.
The final native tool for monitoring Availability Group synchronization is Performance Monitor, using the SQLServer:Database Replica performance object. The table below shows the relevant performance counters and their descriptions from Books Online (https://msdn.microsoft.com/en-us/library/ff878356(v=sql.110).aspx):
|File Bytes Received/sec||Amount of FILESTREAM data received by the secondary replica for the secondary database in the last second.|
|Log Bytes Received/sec||Amount of log records received by the secondary replica for the database in the last second.|
|Log remaining for undo||The amount of log in kilobytes remaining to complete the undo phase.|
|Log Send Queue||Amount of log records in the log files of the primary database, in kilobytes, that has not yet been sent to the secondary replica. This value is sent to the secondary replica from the primary replica. Queue size does not include FILESTREAM files that are sent to a secondary.|
|Recovery Queue||Amount of log records in the log files of the secondary replica that has not yet been redone.|
|Redo blocked/sec||Number of times the redo thread is blocked on locks held by readers of the database.|
|Redo Bytes Remaining||The amount of log in kilobytes remaining to be redone to finish the reverting phase.|
|Redone Bytes/sec||Amount of log records redone on the secondary database in the last second.|
|Total kilobytes of log that must be undone.|
One of the challenges and limitations of using Performance Monitor to monitor the environment is that the object is valid only on the instance of SQL Server that hosts a secondary replica. This means that you have to add the counters from each secondary replica into Performance Monitor get a full view of what is happening with all of the secondary databases, where both the AG Dashboard in Management Studio, and the DMV query against the primary replica, provide information about all of the secondary databases in a single location.
As an alternative to the built-in features for monitoring Availability Group synchronization, you can also leverage third party tools like SQL Sentry Performance Advisor, which includes monitoring of Availability Groups as a standard feature. You can read more about this feature in this blog post from Greg Gonzalez that first introduced the feature in version 7.5 of the product.
The Replicas tab in Performance Advisor allows each of the secondary replica servers to be expanded to show the databases and their current synchronization data easily. The WSFC Node/Group Matrix default layout at the top of the dashboard also gives health information about the primary replica's send queue status, the secondary replica's redo queue status, and the flow of data between each of the replica servers. In this example, we can see that the log send queue on the primary is currently sending a large amount of data from SQL01 to SQL03, based on the width of the line between the servers, after the connectivity issues between SQL01 and SQL03 were corrected in the environment. The chart on the right shows the rate at which data is being transferred from SQL01, along with the current size of the Log Send Queue, since that is the replica selected on the left hand side. Clicking on one of the other replica servers in the WSFC Node/Group Matrix layout will also change the chart to match that specific replica's performance metrics on the right hand side.
There are many ways to monitor the performance of data synchronization between replica servers in an Availability Group in SQL Server. The built-in Availability Group Dashboard in Management Studio contains a wealth of information that is easy to access once you know how to customize the layout to show the most important information on the dashboard. It is also possible to use the DMVs directly from the primary replica server to monitor the performance of data synchronization using Transact-SQL, and third party tools like Performance Advisor include monitoring of the data synchronization as well. While Performance Monitor can provide this same information, the fact that the performance counters are only available from the secondary replica server makes it a little more work to get a full view of the entire environment.