Comprehensive data protection for all workloads
Post Reply
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Replication job is randomly creating duplicate replica VMs

Post by Unison »

Hi all,
I have had a long running case with Veeam regarding this issue (almost a year - ever since veeam v7 went in - Case # 00528944) but to date we have not been able to solve this problem - so thought I would put it to the community to see if others have seen this, found the cause and if there is a solution. I have nearly given up finding a solution for this and am hoping that with v8 we will see this issue vanish, just as easily at it appeared with v7.

Bit of a run down on the setup and the problem....

We have 2 hosts running all our VMs (VMWare 5.1) - veeam backs them up to local storage on the veeam server (physical box). Backup jobs run fine - they even got much faster with v7.
All of those VMs on the 2 hosts are also get replicated to a 3rd host via a veeam replication job. We are only seeing a problem with replication. Host3 is only used for replication, it has no production VMs running on it, nothing else goes to it and nothing else interacts with it besides Veeam.
All 3 hosts are in the same rack and connected to the same gigabit switch stack.
What we are seeing happen from time to time is a doubling up of some of the replica VMs at the destination......the replica job only holds 7 restore points for each VM. When the job first runs, it creates each replica VM as you would expect....and the job hums along nicely like that sometimes for weeks/months with no problem.....then all of a sudden, it will stop using the replica set for a particular VM and will create a whole new replca VM on host3 - the job will continue humming along fine as if noting is wrong, it just doubles up and moves on.....so now we see a duplication of that VM on the replication host....in the vsphere client under host 3 we see this....

Server1
Server1 (1)
Server1 (2)
Server2
Server3

The doubled up VM replica just gets a number in a bracket - the old ones are left behind....sometimes we see some VMs double up 3-4 times. Its not always the same VM that gets doubled up and we dont even know when it happens - just we will be in vsphere doing something else and notice that there is a new double up. "oh, theres another one".

The other VERY strange thing we see sometimes is a doubling up of a VM with the EXACT same name!!
so we see this in the vsphere client...
Server1
Server1

Server2
Server3

Notice there are 2 Server1's!! with the exact same name - no brackets! The first time i reported that to Veeam (within this case) they didn't believe the print screen i sent them....they thought it was photoshopped until they remoted in and seen it with their own eyes! 2 VMs with the same name shouldn't even be possible.....but it happens here with these replica sets from time to time. (wish Veeam used a better forum engine so i could post that print screen here :( - comon veeam, time to use a new forum engine :) )


Veeam server is win7 pro 64bit. Replication target is 08R2 - transport mode is set to network rather than auto on the advice of veeam. Ive lost count of the number of Veeam techs that have been on this case and the amount of things we have tried would make this post much longer than it already is - but i am willing to try any of your suggestions if you have indeed seen this issue at your site and been able to resolve it with *something*.

The Veeam techs and i have analysed gigs and gigs of logs from Veeam/Vmware and there is just no clue/hint as to when or why this happens.
We have deleted all replica sets several times and had the job recreate them, we have recreated the job from scratch, turned off parallel processing, applied veeam patches, changed the target for the replication job and a whole collection of other tricks but still, we eventually see this doubling up of replicas.

Really hoping others have seen this and that there might be a solution - though i have all but lost hope in the current release and all fingers crossed that this problem magically vanishes with v8 :)

Thanks guys
foggy
Veeam Software
Posts: 21071
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by foggy »

The first thing that comes to my mind, is that something changes moref ID of the replica VM, making Veeam B&R unable to find it and forcing it to start replication from scratch. If support engineer could somehow confirm that the replica VM moref ID has changed at some point (was different from the one replication job was trying to find), that would point to some external activity to be the reason of such behavior (since VM gets moref ID upon its registration in VI and Veeam B&R does not handle moref ID assignment, except the call to vSphere to register the newly created replica).
Gostev
Chief Product Officer
Posts: 31559
Liked: 6722 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Gostev »

How do you add VMs to the replication job?

If via container (instead of individual VMs), then what is happening is pretty clear. There is some process that un-registers, and then registers your VMs back at the source host, making them get new unique ID (aka moRef). Because of new moRef, the job automatically treats those re-registered VMs as the new VMs, automatically picks up and replicates them over, just as it would with any "legitimate" new VM created on the host. Veeam then correctly resolves the naming conflict by adding (1), (2) etc. to the original VM name. v8 will definitely not change anything about it, so I would recommend that you track down the root cause of these mystery VM re-registrations with VMware support.

This was probably already checked by support, but make sure that you have registered source hosts by adding vCenter (if you have one), as opposed to adding your ESXi hosts as standalone. Because if you do the latter, VMotions will cause this kind of behavior.
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Unison »

foggy wrote:The first thing that comes to my mind, is that something changes moref ID of the replica VM, making Veeam B&R unable to find it and forcing it to start replication from scratch. If support engineer could somehow confirm that the replica VM moref ID has changed at some point (was different from the one replication job was trying to find), that would point to some external activity to be the reason of such behavior (since VM gets moref ID upon its registration in VI and Veeam B&R does not handle moref ID assignment, except the call to vSphere to register the newly created replica).
Hi Foggy,
that is what has been making it more difficult to resolve - Veeam techs have gone down that path regarding moref...... i am not exactly sure what veeam 'logs' when it runs into this condition of a missing VM or if it cannot find the VM with the correct moref - but to date, Veeam techs have not been able to find any evidence in the Veeam or VMWare logs that suggests that is what is happening. There is nothing in the job log that says "oh, there is no VM with that moref - so i am going to create a new replica set now" - it just seems to happen. No error, no logging about it - then we end up with two VM replicas.
is this right? Should Veeam be logging this sort of thing?
I am not sure if you can look at this job, see how many Veeam techs have looked at it, how long its been going for.....or how many hours i have spent with a Veeam tech remotely in our system.....but it has been a lot of time and not only has no solution been found....but no cause either :(.

There is no other product doing replication and in particular with host3, nothing else interacts with it besides Veeam......if a VM is created on host3 - its because Veeam created it......and veeam is only supposed to create a VM if the one its looking for doesn't exist (which it still does - and there is no complaint from veeam about a changed moref). We even tried replicating to a different host - just to make sure it wasnt something funky happening with vmware on that host - but the same problem happened with veeam replication on the other host.

Trust me and if you can see the job notes - we have spent countless hours blaming every other thing/system besides Veeam in the hunt for a solution......but nothing has been found and nothign else is at play here. Veeam is the only thing that would cause a change in moref and the only thing that would request the creation of a new VM.
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Unison »

Gostev wrote:How do you add VMs to the replication job?

If via container (instead of individual VMs), then what is happening is pretty clear. There is some process that un-registers, and then registers your VMs back at the source host, making them get new unique ID (aka moRef). Because of new moRef, the job automatically treats those re-registered VMs as the new VMs, automatically picks up and replicates them over, just as it would with any "legitimate" new VM created on the host. Veeam then correctly resolves the naming conflict by adding (1), (2) etc. to the original VM name. v8 will definitely not change anything about it, so I would recommend that you track down the root cause of these mystery VM re-registrations with VMware support.

This was probably already checked by support, but make sure that you have registered source hosts by adding vCenter (if you have one), as opposed to adding your ESXi hosts as standalone. Because if you do the latter, VMotions will cause this kind of behavior.
The replication job is populated with VMs via a connection to the vcentre server (in job - add VM - navigate thru vcentre server into datacentre then selecting all VMs one by one). We also tried creating a replication job where the VMs were added to the job by directly connecting to the host (going around vcentre) - but the problem happens when the job is configured by either method (just when you go around vcentre the only difference is that the duplicate VMs end up showing in vsphere as '(INVALID)' and greyed out - but the VMs files still exist on the datastore - when you go thru vcentre, the duplicate VMs staff valid and accessible in vsphere).

What your both saying is what seems like is happening - but in all this time, we have not been able to find any indication of that happening in the veeam logs or the vmware logs. If this is what is happening - would you expect to find detail about that in the logs? Which logs? what would you expect them to say?

In your response you mention that veeam correctly resolves the naming conflict by adding a (1) or (2) at the end of the new VM it creates......but this also is not always true.....we sometimes end up with VMs that have the exact same VM name....two valid replicas that are both accessible with the exact same name - veeam is creating the VMs but i didnt think that vmware would allow the creation of two VMs with the same name - another sign that something very strange is happening with replication. I have one still present in the replica set right now (2 replicas with the same name) because the Veeam techs are still looking into this -cant explain it.
VMware have looked at this but push it back onto Veeam as there are no signs of issues in their logs and its the veeam product that seems to be getting confused about what VMs exist and hence creating new ones (when it really does need to).
I understand how the veeam logic is working......moref doesnt exist....so i will create a new one......but we have all seen software do things that does not follow its expected logic.

I dont bother with 'blaming' anything/anyone - thats not helpful - just looking for the cause and the solution where ever that leads :)....but no leads can be found. relying on the logic is not a lead in all cases. My hope for v8 was a long shot but its almost what we are reduced to - it seemed to start with v7 so maybe it will go away with v8 :).....though i know that's not likely.
tsightler
VP, Product Management
Posts: 6012
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by tsightler »

I looked at the case and it appears that you only very recently started using vCenter instead of the host directly. It's not clear to me if you cleaned up all of the replicas completely before you did this.

There's plenty of information in the Veeam logs around MoRef ID changes so I'm sure if the issue was caused by these changes it would have been caught by now, however, I do see the events where the duplicate replicas are created, and in the three cases that I looked at they appear to coincide with a specific failure on the previous run.

Specifically there are mulitple replication runs that end with a failure like this (identifiable information changed):

Code: Select all

<33> Error    Failed to reload replica vm '[name '<vm_name>', ref 'vm-2123']'   at Veeam.Backup.Core.CSnapReplicaVm.ReloadVm()
In each and every case where the replica run ended with this error on the next successful run Veeam failed to find the replica VM and instead created a new VM. Interestingly, it appears that these runs are reported as success within the Veeam HTML report included in the logs.

I'm not sure what's causing the ReloadVm function to fail, however, I suspect this the cause of the seemingly "random" creation of duplicate replica VMs. Veeam keeps replica details in a custom property in the VMX file on the target VM. Since this property is updated at the end of the job, I'm guessing we have to call ReloadVM to update this information correctly with the host and it appears that this reload is timing out in some cases. On the next replica run this failure appears to cause Veeam to be unable to locate the correct VM since the properly updated property is not available and thus the replica and Veeam DB are no longer in sync. Veeam then goes forward with creating a new replica VM instead.

I'm not 100% sure this is what is happening, but the fact that I was able to trace this to 3 different VMs, and the exact pattern was obvious in all of those cases, I'm guessing that's the reason for the behavior and this information should be quite valuable for support.

Assuming this theory is correct, while not handling this failure might be considered a "bug", the root cause would appear to be the failure of this ReloadVM call, which isn't something that I would expect to see on a regular basis in a properly functioning environment (I've not seen it in other environments). I suspect there should be a failed event in vCenter (at least during the time you were using vCenter) showing this reconfig failure (it's a call the the VMware API) so perhaps it might have some additional details as to the failure, or perhaps it did finish but just took longer than Veeam was willing to wait. That should be easy to correlate with the vCenter event logs.
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Unison »

Hi Tom,
Yes about a month and half ago we changed back to having the job go through vcentre. It was setup like that in the beginning then we went though the host directly for a long time but in an attempt to resolve this issue went back through vcentre - but as you know, same problem with either method. No issues came from using either method, nor did it improve this problem.

The veeam logs are showing you detail around moref issues? you can see the creation of 3 of the last duplicate replicas? What was the reason for creating them.....did it simply just say that the moref couldnt be found or was there more detail then that? Or was all you could get in the way of a 'reason' was the log line you posted but removed the VM name from?
how many times did you see that log line that you posted - can you PM me the VM names that had this log entry so i can confirm if they indeed had or do have a duplicate replica?
What does this error actually mean to veeam...."Failed to load replica" - does that just mean it couldn't find the vm with that moref? or does it specifically mean something else that mgith give a clue as to where to look?
I am not sure if you can find it in the case notes - but Veeam support did find some errors in the logs at one point but they did not correlate to the creation of a duplicate VM, so no more attention was paid to those - i.e. when the errors showed up, no duplicate VM were created.

I don ever remember veeam techs showing me that log entry you just revealed - at least you can find the log entry that shows WHEN this happened.....and it has helped you to think about WHY its happening.

So in the veeam logs - you can see that veeam failed to find the right replica, but was able to re-create a whole new replica.....and even though it did that - the html job report showed the job as being 'successful'?? This is why we never know or catch when it happens - as we just get no alert or indication that something happened. While it is good that veeam can continue on like this and complete the job....it shouldn't really be considered successful because the original set couldn't be found so it had to recreate - thats a pretty drastic step and issue that an entire vm set is 'missing' - it deserves an alert or at least a note on the job report.
Is the job supposed to still be considered successful if veeam has to do a VM re-creation - or should it be showing an alert about that in the job email report?

With your reloadVM function idea - you are thinking that veeam is calling this to update the VMX, there is a time-out (i.e. it fails to run/complete) and that results in the VMX becoming locked or corrupted or missing some information as a result - so on the next run veeam then creates a new VM - is there anything in the logs that gives more detail as to what happened with the reloadVM function - if it ran at all, what point it got up to?
Can you pls send me the log line, time/date details of those 3 failed ReloadVM?
I have not seen a duplicate VM get created in the last few weeks - there are still a heap of duplicate VMs so its not obvious to see a new one......so i was thinking of clearing them out again and watching for new ones.....i left the old ones there in case veeam support could use them in the investigation.

We recently had to clear out the events area of our vcentre database so i cant really go back far in history any more - but in the past when we have seen double ups happen, i have captured veeam and vmware logs at that time going back a few days before hand and there was nothing found in the vmware logs to show anything going wrong.
However you seemed to have easily found this ReloadVM function issue in the logs pretty easily so perhaps maybe just the wrong pair of eyes have been looking.

Really appreciate the time you have taken to look at this Tom (everyone) - making me feel like i should have called on the high level of skill in the community sooner :).
I hope the cause is still discoverable!
tsightler
VP, Product Management
Posts: 6012
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by tsightler »

I don't see any changes to MoRef IDs that aren't caused by Veeam. The source VMs have the same MoRef IDs even when new replica VMs are created. I was only pointing out that there is plenty of information in the log to determine if changes to the source MoRef IDs were the cause of the duplicate replicas and I don't see any evidence of this. When the duplicate replicas are created the source MoRef remains constant so that's obviously not the problem in your case, at least not in any of the instances I could find.

If it's OK, I'll reach out to you offline as I'm not really comfortable posting any log snippets in the forum and this will likely require significant digging to really find out what's going on.

At this point I think the best thing to do would be to follow the advice in the in the support case which is to reconfigure everything one more time from complete scratch, using vCenter for everything, and completely remove the old replicas. Let it run and when you get the very first deuplicate replica grab the logs from both Veeam and vCenter. I don't know if vCenter would actually show an error, but that's important to investigate as, if the ReloadVM completed successfully according to vCenter but Veeam shows it as a failure, that would give even more clues as to the possible root cause. But for now it's just a working theory.
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Unison »

Hi Tom,
Thanks for coming back and yes i see your private email this morning so i will jump into that next.
Right - i see your point about what you were saying about moref in the logs now - its good that you can see that the 'expected' cause for an issue like this is not actually causing the problem in this instance.....at least we can move on from that and dig deeper now.

What ever we try and discover again offline/private - i will continue to update this post.....and hopefully i can eventually update this post with what the cause/solution was all along :)
Unison
Enthusiast
Posts: 96
Liked: 16 times
Joined: Feb 17, 2012 6:02 am
Full Name: Gav
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by Unison » 1 person likes this post

Just updating this post.
to resolve this issue we completed the recommended steps again...

"At this point I think the best thing to do would be to follow the advice in the in the support case which is to reconfigure everything one more time from complete scratch, using vCenter for everything, and completely remove the old replicas."

since doing that a couple of months have passed and there have been no more duplicate VMs created. If they have not appeared by now i think the issue must have been resolved by recreating the job completely and deleting all existing replicas......a pain in the butt but looks like the only fix.

hopefully never see this again :)
foggy
Veeam Software
Posts: 21071
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication job is randomly creating duplicate replica V

Post by foggy »

Thanks for reporting this back, Gav. Glad you were finally able to get rid of those duplicate VMs.
Post Reply

Who is online

Users browsing this forum: Google [Bot] and 154 guests