Agent-based backup of Windows, Linux, Max, AIX and Solaris machines.
Post Reply
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Potential failover cluster issue

Post by ejenner »

I'm just at the initial stages of dealing with a problem which has occurred overnight.

I may have the wrong end of the stick as I've only spent a few minutes looking at this.

However, it seems from initial inspection that if your fileserver role has failed over to a different cluster node then Veeam Agent for Windows will backup your file server as if it's never seen it before.

i.e. it's used all the free space in my repository as it's treating my file server backup as a completely new backup after the file server role flipped to the other node.

Can anyone confirm whether they have seen the same behavior or is this a unique quirk of my setup?

To rectify, I'm going to have to delete the backup files which were created, flip my file server role back to the node it was on previously and fingers crossed it will do incrementals based on what I have for those jobs in the repository.

Looks pretty serious, I hope I'm mistaken! :(
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

Hi EJ.

To clarify, are you talking about managed by backup server agent job with failover cluster type or your are referring to a standalone agent installation? Thanks!
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

Managed agent.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

Have decided to make this official as I don't want to lose the existing fileserver backup and would like some advice on best way to recover the situation - 03256027
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

Sounds like a good plan. The behavior you described does not look right for failover cluster job type, so please keep working with our support team to identify the root cause. Please do not forget to share the updates in this thread. Cheers!
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

Well it's turning into a very messy situation. :(

We don't want to lose the backups as it takes so long to start them again. And of course all our restore points will be lost.

Just to make it clear, this has happened because one of our cluster nodes had a blue-screen / STOP error. It happened while it was being backed up by Veeam... but in the spirit of charity we'll say this could equally happen whenever a STOP error occurs regardless what was happening at the time that the node crashed. i.e. the fact that it was running a backup at the time isn't important.

So the only reason this has happened is because the cluster wasn't stable. A problem with the cluster caused Veeam to try to backup data from a different cluster node, for some reason it sees it as different data and is doing a full backup to new files in the repository.

Unfortunately I've taken a couple of wrong turns trying to avoid the loss of all the backups so we may be forced to start again with new initial backups.

I had an idea that we might be able to create a scale-out repository and add another disk tray to give the jobs room to grow into a new volume. That went ok and when I merged the existing disk into the scale-out repository Veeam automatically updated all of the jobs on that repository to point to the scale-out.

When the backups tried to run the problem jobs finished with warnings saying the data placement policies can't be met. Which is expected functionality according to support.

So then the plan was to clear space on the first repository so the problematic jobs had room to stretch their legs and complete. This would require moving jobs from one repository to another. However, it seems after creating the scale-out there is no way to delete it as the jobs are now dependent on the scale-out repository. You can't go back, you can't go forward....!

At the moment we're in a bit of a dodgy position as the jobs have not been very successful over the last few days. I hope nobody asks for a restore. I don't know how it is going to end but hopefully sooner rather than later.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

Well I feel like I've been quite lucky. All of my backups completed successfully last night.

So the only problem I have at this point is that compared with the same time last week I have another 11TB on disk despite not having changed any of the job configurations! If it has happened once, could it happen again? How am I going to get back to the previous state before the unexpected growth spurt?
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

This topic does not seem very popular with anyone? I'm not sure if that's because not many people are using Veeam Agent for Windows to backup clustered fileservers?

I added a second disk tray to tray to give the backups room to grow and hopefully resolve this issue.

Unfortunately jobs have grown beyond the capacity of that second disk tray and backups have stopped again.

The other potential issue we have is that our file servers have DFS replication enabled. Occasionally they go out of sync and have to resynchronise. I think when the node went down and the cluster role moved that also triggered a DFS resynchronisation. Assuming I'm correct about this, it is possibly the reason Veeam saw this as an entirely new backup.

Maybe there is a way to disable CBT and make Veeam treat our data as files, look for the filename and backup using attributes instead of changed blocks? Dunno, just speculating.
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

EJ,

Thank you for sharing the updates! I was tracking your progress very attentively, but I trust our support team as they have more expertise in finding the root of the problem and resolving such issues. I've asked our RnD team to review your case details and share the update with me. Stay tuned, I'll update this thread once I hear back anything.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

I'm fairly sure the problem to investigate here is how Veeam behaves when a DFS (Microsoft distributed file system) re-synchronization occurs.

DFS synchronization runs constantly in our environment to make sure our primary file server cluster is synchronized with our standby file server cluster. So both file server clusters hold identical copies of our file server data.

IF there is a critical issue such as what we experienced where one of the cluster nodes had a STOP error it can force DFS to recover from potential database corruption by performing a full re-synchronization of all the data.

The issue with Veeam which follows on from this is that Veeam sees the newly synchronized files as new files which aren't already protected.

We have more than 5 million files and it takes 12 days to re-synchronize. That isn't ideal but it's the way our system was designed and we can't change our setup without there being a new solution in place. I understand our configuration isn't particularly uncommon, lots of Microsoft customers are running synchronized fileserver clusters. We are backing up the passive cluster rather than the active one. We're doing that to save loading the active cluster with backup traffic although if resolving this issue meant that we had to change our source to the active cluster I'm sure we could, we would just have to take greater care not to backup during busy periods as with many other servers we're protecting.

In summary, I think there is an issue here. But I can't think of a way around it other than to perhaps try backing up the active cluster instead of the passive one. In the meantime, I hope we don't have any re-synchronization events... but they do happen occasionally.
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

Hi EJ,
The issue with Veeam which follows on from this is that Veeam sees the newly synchronized files as new files which aren't already protected.
Since agent does image backup that should not be the case. When fail over is performed agent should perform a full read of the source disk to make sure that disk is the same, but backup will be incremental.

We are trying to sum up the case details, so please check my comments below:

- Repository was filled and that caused cluster backup job to fail. If that's not true what was the original error in your cluster job?
- You cleaned up repository (moved the backup files to another repository). Please clarify what backup files were moved (original backup files created by this job or other backups files)?
- After the cleanup was performed cluster job created a new full backup
- Can you also clarify if the original cluster backups were moved to a Scale-out backup repository did you change the cluster job configuration and point it no a new repository instead of the old one?

Thank you in advance.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

"- Repository was filled and that caused cluster backup job to fail. If that's not true what was the original error in your cluster job?"

The original error which started everything was a bluescreen (STOP) error, frozen OS... on the first node in the cluster. That's the node usually holding the fileserver role. When the STOP error occurred the fileserver role moved to the second cluster node. As the failure of the first node wasn't graceful a full resynchronization of the DFS occurred, replicating all our fileserver data from a different cluster. We have more than 5 million files on the server and it took 12 days to resynchronize over WAN.


"- You cleaned up repository (moved the backup files to another repository). Please clarify what backup files were moved (original backup files created by this job or other backups files)?"

From my point of view I know the 6 jobs we have for that cluster take many days to perform an initial backup. So at all costs (almost), I wanted to avoid starting a new backup anywhere as we would be without any backup while an initial backup was happening. What I wanted to do was to allow the jobs original jobs to complete. To try and make this happen I deleted less important backups to make space for the fileserver job. Unfortunately, this didn't create enough space and the jobs still failed. So I didn't move any files.


"- After the cleanup was performed cluster job created a new full backup"

Nope. The new full backup seemed to begin after the STOP error for the cluster node. That's what seemed to trigger a new backup of all our data.


"- Can you also clarify if the original cluster backups were moved to a Scale-out backup repository did you change the cluster job configuration and point it no a new repository instead of the old one?"

I added a new disk tray and merged it with the existing disk tray to create the Scale-Out repository. So the original jobs were on the first disk tray, then I added a second physical disk tray and merged them at the Veeam level to create the new Scale-Out. As the jobs continued to run they started to spread onto the second set of disks. Unfortunately even the second disk tray filled up.

What I've done to try and workaround this problem is to reduce the number of restore points to 5 and turn off creation of synthetic fulls. As the jobs have been running they've deleted old restore points and I'm getting free space on the repository again. This is at the cost of number of available restore points.

As mentioned above, I'm a bit worried about continuing with the scale-out repository. Not keen on the way it spreads jobs across disk devices. It creates a failure point in the sense that if we lose a disk tray then we potentially damage the integrity of backups on other storage devices in the scale-out. After the failure of one part of the scale-out they could be missing their full .VBK or some of the increments and no longer a good backup. It seems safer to me if you can keep an entire backup chain on a single storage system. Not sure if that can be set by changing the 'performance' setting in the storage configuration for the job? Similarly, if we get a runaway job which fills up all the space on the repository it would only stop the jobs going to one storage device if the devices are kept separate.
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

Thanks for the clarifications. In the case details, I noticed that you are about to remove one disk tray from Scale-out repository and re run the job. Please let us know how it goes. Cheers!
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner » 1 person likes this post

I'm fully clearing out the Scale-out repository now.

I've created new jobs targeted at the active cluster (not the failover cluster as we've been doing so far). My theory is that the active cluster is not a replica, it's the live data. So if another resynchronization occurs (they do happen) this will be the source that is synchronized to the redundant standby failover cluster. So data on the source won't change while a full DFS replication is happening.

To help me switch from one to the other I am running both backups at the same time. So I'm backing up the passive cluster to the Scale-out. I'm also backing up the active cluster to a standard repository separate from the Scale-out. I had to add another tray of disks (and a processor) to be able to do this but increasing storage was the only way I could see of storing two copies of the fileserver data at the same time to enable me to keep backups available while building new backups elsewhere.

In a few days I'll have equal levels of backup for the active and passive clusters. Then I can delete the backups of the passive cluster, split the Scale-out down to individual disk trays and start backing up at that site from scratch with blank repositories.

It's a lot of work having to think creatively to find ways of retaining data whilst also working around these fundamental issues but as I'm sure we've managed to maintain a restore capability throughout it is reasonably enjoyable work.

Can't really blame Veeam for this as Veeam has to be able to recognize when data is new. If I'm right (not always ;) ) and this is ultimately caused by DFS resynchronization then the response from Veeam would be to find a way to recommend against backing up passive clusters. It was my initial instinct to backup the passive cluster as it isn't normally being used. I couldn't see the potential issue of a resynchronization at the outset.

I don't know if there is any way to detect that the target is a passive cluster at the time the job is being configured so users can be warned of the possibility that a resynchronization could cause a secondary backup on the repository? It might not be necessary to detect it, just to warn on the screen where it is being configured.
Dima P.
Product Manager
Posts: 14396
Liked: 1568 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Potential failover cluster issue

Post by Dima P. »

EJ,

Thank you for keeping us up to date. In the separate topic I saw a good catch for DFS replication group schedule to configure the 'no replication during certain times' window to make sure dfs is not interfering with backups jobs. Might come handy in you case as well.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Potential failover cluster issue

Post by ejenner »

Thank you for the suggestion.

However, it is not a conflict with the standard synchronization that DFS is always doing. The issue is when a total 'RE'-synchronization occurs. i.e. all data is updated with a fresh copy synchronized from the production cluster.
Post Reply

Who is online

Users browsing this forum: No registered users and 8 guests