Failed SQL Always On cluster node recovery issue

tjdoc · Post by **tjdoc** » Dec 20, 2024 10:43 am this post

Hi,
We have been testing a scenario whereby there is a VM failure / corruption for a server in a SQL AO cluster pair.
We would then use Veeam to restore the failed virtual machine and 're-connect' it into the cluster.
The flow of this process / test is:

-SQL primary node (active SQL databases) AO replicating to secondary (standby) AO cluster member
-Simulate failure of the active node by shutting it down and disconnecting from the network
-SQL AO failover, so the secondary is now active
-Restore the failed VM from backup (to a new VM but same settings etc)
-Bring the server online (as the standby)
-SQL AO would then sync itself, from active to the new standby and all databases be up to date
-Once all synced up, failover SQL so the new VM becomes the active SQL server again

When we followed this process, we noticed that as the time the backup was taken was too old for SQL AO to automatically start its sync (approx. 18 hours) we needed to use Veeam to replay transaction logs to a more updated time, this was done at the SQL instance level using Veeam SQL Explorer.
This appeared to work ok.

However, the reason for this discussion is that while the transaction logs were being replayed, we noticed that the restore process also dropped the "active" databases on the secondary node, then restored the databases as part of the restore process to the standby.
This was observed by our DBA's in the SQL server logs.

This means that any data written to the secondary (active) server following the failover would have been overwritten / lost as part of the Veeam restore process to recover the failed server.

The Veeam backup being restored was for the failed primary only, we did not try to do anything to the secondary, however it seems Veeam restored data to both.

This was just a test scenario, but our conclusion is if this happened in Production then it appears we would have data loss?
Something does not sound right with it all - maybe we did not use Veeam correctly or there is a process we did not follow, however if anyone has any ideas as to if this behaviour is expected we would be grateful to understand more and be able to create a procedure that allows us to recover a failed node which is part of a SQL always on cluster.

Many thanks for any help...
Tim

Dec 23, 2024 8:42 am

Hi Tim,
there is a lot going on in this note, so let´s dissect it a bit.

a)
VM fails and SQL AlwaysON failover happens to secondary node (becomes active).
You restore the failed VM to an older point, but SQL AlwaysON will not sync.
- Can you please share what errors you got?
- What did you try to mitigate this?

b)
You workaround to restore the SQL server to a newer state in order to allow SQL to sync.
I think this is maybe done wrong as usually the SQL server is recovered completely and becomes the active one then SQL will sync this restored SQL with the original. Likely this is where you saw the overwrite.
If you "Restore to original" the wizard does not give you and option and restores the database in a way that the restore becomes the active database overwriting everything. This is done in that simple way to allow non DBAs to recover databases without making it complicated.
Likely what you need is the wizard with the "restoring to another server" which allows you as well to restore to the same server but leave the transaction recovery open for later applying other changes. https://helpcenter.veeam.com/docs/backu ... ml?ver=120
I do not know how this comes together with the AlwaysOn sync of the DB that is in active state, but you can at least give this option of recovery to your DBA and he should know if this is what he needs.

c)
There is as well the opion that you let the DB sync from the active server from scratch which might be the most simplest after server restore.

R&D Forums

Failed SQL Always On cluster node recovery issue

Re: Failed SQL Always On cluster node recovery issue

Who is online