SQL server AlwaysOn Availability Group data latency issue on secondary replicas and synchronous state monitoring

This article explains how data synchronization works on SQL Server AlwaysOn Availability Group and some details about how to use sys.dm_hadr_database_replica_states table to check replica states.

I borrowed this daligram from this article which explains data synchronization process on SQL Server AlwaysOn Availability group.

The article has full details about the process. Things are worthing noting here are step 5 and 6.
When a transaction log is received by a secondary replica, it will be cached and then

  • 5. Log is hardened to the disk. Once the log is hardened, an acknowledgement is sent back to the primary replica.

  • 6.Redo process will write the change to the actual database

So for a Synchronous-Commit secondary replica, after step 5, the primary replica will be acknowledged to complete the transaction as “no data loss” has been confirmed on the secondary replica. This means after a transaction is completed, SQL server only guarantees the update has been written to the secondary replica’s transaction logs files rather than the actual data file. So there will be some data latency on the secondary replicas even though they are configured as “ Synchronous-Commit”. This means after you made some changes on the primary replica, if you try to read it immediately from a secondary replica, you might find your changes are there yet.

This is why in our project, we need to monitor the data synchronization states on the secondary replicas. To get the replica states, I use this query:



   rs.is_primary_replica IsPrimary,






from sys.availability_replicas r

inner join sys.dm_hadr_database_replica_states rs on r.replica_id = rs.replica_id

The fields end with “_lsn” are the last Log Sequence Number at different stages.

  • last_received_lsn: the last LSN a secondary replica has received

  • end_of_log_lsn: the last LSN a has been cached

  • last_hardened_lsn: the last LSN a has been hardened to disk

  • last_redone_lsn: the last LSN a has been redone

  • last_commit_lsn: not sure what exactly this one this. But from my test, most of the time, it equals to last_redone_lsn with very rare cases, it is little smaller than last_redone_lsn. So I guess it happens a little bit after redone.

If you run the query on the primary replica, it will returns the information for all replicas

If you run the query on a secondary replica, it will returns the data for itself.

Now we need to understand the format of those LSNs. As this article explained, a LSN is in following format:

  • the VLF (Virtual Log Files) sequence number 0x14 (20 decimal)

  • in the log block starting at offset 0x61 in the VLF (measured in units of 512 bytes)

  • the slot number 1

As the LSN values returned by the query are in decimal, we will have to break them into parts like this:

Now we can compare the LSN values to figure out some information about state of the replica. For example, we can tell how much redo process in behind the logs have been hardened on Replica-2:

last_hardened_lsn - last_redone_lsn = 5137616 - 5137608 = 8.

Note, we don’t need to include the slot number(the last 5 digits) into the calculation. In fact, most of the LSNs in dm_hadr_database_replica_states table do not have slot number. As this document says: they are not actual log sequence numbers (LSNs), rather each of these values reflects a log-block ID padded with zeroes. We can tell this by looking at the last 5 digits of a LSN number. If it always ends with 00001, it means it is not actual LSN.

As from this article we know the block offset in measured in units of 512 bytes, in the example above, the difference between last_hardened_lsn and last_redone_lsn that means there is 8 * 512 = 4096 bytes data is waiting to be written to the data file.

UPDATE: I have just observed that last_redone_lsn could be NULL when there are no replicas acting as primary in the group. It is not always and I don't know under what condition it will NULL. However last_commit_lsn seems always have a value and as I mentioned it has similar value as last_redone_lsn(maybe just slightly behind last_redone_lsn). So  last_commit_lsn might be used when last_redone_lsn is not available. 

This article has much better explanation about how the transaction logs work in Availability group.

Print | posted on Sunday, April 30, 2017 9:58 AM


No comments posted yet.

Your comment:


Copyright © Changhong Fu

Design by Bartosz Brzezinski

Design by Phil Haack Based On A Design By Bartosz Brzezinski