V9: random soap fault- vm backup failure

kjstech · Post by **kjstech** » Jan 19, 2016 11:55 am this post

Hi, running v9 since last week and today I experienced my first set of failed backups. There are a small number of VM's that completely failed backup (including the retries) with messages like this:

Code: Select all

Error: Soap fault. fault.InvalidArgument.summaryDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>deviceKey</invalidProperty></InvalidArgumentFault>', endpoint: '' Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.

Any idea what this means? Thanks!

Post by **PTide** » Jan 19, 2016 11:58 am this post

Hi,

A closer look to logs is needed. Please open a case with support team and post your case ID here. Do those VMs differ somehow from others? Also, this KB is worth checking.

Thank you.

kjstech · Post by **kjstech** » Jan 19, 2016 3:07 pm this post

Thanks 01671181 created.

kjstech · Jan 26, 2016 1:54 pm

While not detailed in the daily job log summaries, an examination of full logs to support yielded this gem:

Code: Select all

[18.01.2016 20:43:42] <  2824>      >>  |--tr:Failed to enumerate changed areas of the disk using CTK. Device key: [2001], size: [1319413952512]. VM ref: [vm-5231]. Change ID: [52 e2 b4 c9 f2 dc ad bf-a6 38 24 10 b0 ff ce e9/504]

Since its referencing CTK, we decided to try that vmware powershell script from Veeam KB1113 on this machine (and the few others). Last night the job completed fine for the machine in question.

It still is odd that it began to happen out of the blue when there were no updates to vSphere or ESXi.

Post by **PTide** » Jan 26, 2016 2:08 pm this post

Thank you for feedback! As it was mentioned in release notes, it's considered to be a best practice to reset CBT after new installs/updates. Please see this thread and this KB for more info.

kjstech · Post by **kjstech** » Jan 27, 2016 3:19 am this post

It happened again tonight. Just got the failure again. Here's what the email says:

Code: Select all

Error: Soap fault. fault.InvalidArgument.summaryDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>deviceKey</invalidProperty></InvalidArgumentFault>', endpoint: '' Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}. Exception from server: Soap fault. fault.InvalidArgument.summaryDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>deviceKey</invalidProperty></InvalidArgumentFault>

I will update the case tomorrow.

I don't know, do we have to reset CBT every night?

Post by **PTide** » Jan 27, 2016 9:58 am this post

I don't know, do we have to reset CBT every night?

Of course not. That's an unexpected behaviour so please keep working with support team.

Also, did the failure happened to the very same VMs? If so please check if there are any snapshots present.

Thank you.

kjstech · Post by **kjstech** » Jan 27, 2016 10:24 am this post

Another job had showed two more VMs that failed with this same error

Code: Select all

Error: Soap fault. fault.InvalidArgument.summaryDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>deviceKey</invalidProperty></InvalidArgumentFault>', endpoint: '' Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}

There's no way they had snapshots as the are very active machines and vcenter sends us alert emails when the snapshot delta disk exceeds 5gb. That only takes a few hours to achieve and we didn't get a single email about that yesterday.

I also have two newer VM's who've never been backed up. First it was that veeam couldn't connect to them, but I opened up ping and smb in their respective firewalls and tested successful access of their c$ share from the Veeam server:

Code: Select all

Error: Error code: 0x800706ba Failed to invoke func [RegisterIndexJob]: The RPC server is unavailable.. RPC function call failed. Function name: [BlobCall]. Target machine: [10.1.1.39]. RPC error:The RPC server is unavailable. Code: 1722 Error code: 0x800706ba Failed to invoke func [RegisterIndexJob]: The RPC server is unavailable.. RPC function call failed. Function name: [BlobCall]. Target machine: [10.1.1.39]. RPC error:The RPC server is unavailable. Code: 1722

I forwarded the job summary emails to support and will update the logs when I get in to work.

Post by **PTide** » Jan 27, 2016 10:42 am this post

Should you feel like you're not satisfied with a service you get from tech team please use the "Talk to Manager" button to escalate the case. Looking forward to hear good news from you.

Thank you.

kjstech · Post by **kjstech** » Jan 27, 2016 1:05 pm this post

Thank you, all the appropriate logs were updated to the case. I generated 16 days worth of logs so they can go back and look at a time when the backups just worked. I have three backup jobs. In job 01 4 VM's failed out of 34. In job 02 3 VM's failed out of 13, but in the retry operation 2 of those failures were successful. What's weird is the retry operation indicated the 3 failures were all the same soap error urn:internalvim25. But two were able to recover and one wasn't. Then in job 03 that soap error happen on 2 out of 11 VM's, however during the first retry operation they were able to be backed up successfully. If CBT was messed up, I doubt a retry operation would have been able to make a difference. However I reset CBT Monday anyway, all backups on Monday evening worked, then its Tuesday into Wednesday where we are seeing the failures creep up again.

davecla · Post by **davecla** » Feb 03, 2016 9:51 pm this post

Did you get a fix for this?

I have the same error.
Since upgrading to v9 I get SOAP auth errors.

In my case I have a job with 10 servers in it. Some servers will fail in each run. The servers that fail change in each run.
I'm also seeing many, many more CBT errors since upgrading to v9.

I have a call logged with veeam support, but my experience to date with veeam support is "sub optimal". I'm sure my TZ doesn't help, but still....

Post by **foggy** » Feb 04, 2016 11:57 am this post

As far as I can see from the OP's case, it is still under investigation. I recommend to work with support within your case, should you be not satisfied with how it is going, you can always use the Talk to Manager button at the support portal.

kjstech · Post by **kjstech** » Feb 05, 2016 8:37 pm this post

So far I have been getting successful backups if Veeam's parallel processing is unchecked. In Veeam backup and replication, click the top left icon (sometimes referred to as the hamburger button). Click Options and then uncheck "Enable parallel processing".

Now its only been two days for me, but since then all the backups were running great. I only have two warnings on SQL servers, which is being looked at. But no more SOAP errors. I thought we would really take a performance hit from disabling parallel processing, however that's not the case. Our three nightly backup jobs are still completing well within our backup window. I'll be interested how long this weekend's synthetic full jobs will take. However we are using a Veeam accelerated data mover in an Exagrid appliance that handles a lot of that synthetic full data in itself.

Alex · Post by **Alex** » Feb 08, 2016 9:56 pm this post

Hi,

I have the same error(I think):

Code: Select all

02/08/2016 10:32:25 PM :: Error: Soap fault. A specified parameter was not correct. 
startOffsetDetail: '<InvalidArgumentFault xmlns="urn:internalvim25" xsi:type="InvalidArgument"><invalidProperty>startOffset</invalidProperty></InvalidArgumentFault>', endpoint: ''
Failed to upload disk.
Agent failed to process method {DataTransfer.SyncDisk}

This started to happen when upgraded to V9

Will try the Parallel setting to Disable, just to get working backups again.

It only happens with Incremental backups, the full backups don't have that problem.

Alex · Post by **Alex** » Feb 08, 2016 10:16 pm this post

Disabling Paralell is not the solution, just tried it. Also tried https://www.veeam.com/kb1113 , but it din't help.

it seems to only happen with the Large Disks(2+ TB) in a VM.

Post by **PTide** » Feb 09, 2016 10:55 am this post

Hi Alex,

Kindly open a case with our support team and post your case ID here.

Thank you.

kjstech · Feb 10, 2016 1:15 pm

Ok the next thing support had me test is to re-enable parallel processing, but on one job choose a specific Veeam proxy. Then in the backup infrastructure on that veeam proxy specify the network transport mode.

So I did that as a test and last night this job completed sucessfully (about 45 minutes FASTER than usual too).

This got me thinking... We are an NFS shop here, why not try the new direct storage access? Both of my Veeam proxies have a second nic on that storage network and can ping our EMC VNX5200 NFS interfaces. So I went into our EMC VNX5200 storage array configuration and gave both Veeam proxies IP addresses as full access to our NFS file systems. I am testing a job right now and its sucessfully backedup 11 VM's in 15 minutes 17 seconds. Now the backup already ran a few hours ago so there was only 12.8 GB of changed data out of the 745.9 GB processed and 28.5 GB read.

Tonight I will be excited to see how NFS direct access handles the jobs, and also correlate if our alert-bot text message does not go out about our website not being accessible. See with NFS and hotadd, if a machine is not on the same ESXi host as the Veeam proxy (and it won't always be since we have 2 proxies but 6 hosts), when the hotadd disk is released it stuns the VM and it looses pings / nework connectivity. So I'm hoping that this mode will help according to https://www.veeam.com/kb1681

I know that is deriving a little off topic, however maybe you can try the different transport modes. For me network worked, and nfs appeared to work as well on one job I tested it with.

kjstech · Feb 11, 2016 5:39 pm

Wow Direct NFS access is great. We didn't get any text message notifications of machines loosing ping during the backup window. We also halved our backup window. 10 gig port now is showing close to 10 gig going to the exagrid appliance. Previously it was a little less than half that.

The only issue is that both NFS and NBD failover could not access a VM on the same NFS datastore of other successful backups. The path would be vnxfs1\SolarWinds Log & Event Manager\SolarWinds Log +Jg-Event Manager.vmdk.

Logs sent on support ticket.

No CBT or SOAP errors at all in this transport mode.

davecla · Post by **davecla** » Feb 14, 2016 9:43 pm this post

So in my case the SOAP Auth errors just stopped after about a week.
As far as I can tell nothing changed in the ESX or Veeam environments over that time which could have impacted on the backup process.

Strange...

Post by **Gostev** » Feb 15, 2016 10:40 pm this post

These issues are currently suspected to be caused by intermittent failures in vCenter SSL certificate validation process, so the root cause of the issue is likely outside of our code. We're currently trying to confirm that in one of the affected environments by running a temp hot fix that disables certificate validation completely. I asked devs to keep me posted on their findings.

kjstech · Feb 17, 2016 4:52 pm

Thanks Gostev,

We were able to close the case today. Since forcing transport to Direct Storage we haven't had a single SOAP error.
Were an NFS shop so we've also reaped huge benefits from not having disruptive VM STUN times during backup completion due to the hot add transport mode. We've also halved our backup window as the Direct Storage transport has proven to be almost twice as fast in throughput.

We had one issue with a VM having an ampersand (&) in the name, but the fix was to rename it in vSphere, then vmotion it to another filesystem. Storage vMotion takes care of renaming all the file paths. In Veeam we moved this particular VM to another job that is tied to that filesystem and it was sucessful.

So in our case changing the transport mode worked. Initially disabling parallel processing helped, but support had us do some tests with it re-enabled but using a different transport method. I came across the support for NFS direct transport in V9 and gave it a shot. THANK YOU!!!

For reference we are on the following vmware builds
ESXi 5.0.0, 2312428
vCenter Server 5.0.0, 2656067

Storage is NFS on an EMC VNX5200.
Backup is to an Exagrid appliance using the Veeam Accelerated Data Mover.

R&D Forums

V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Re: V9: random soap fault- vm backup failure

Who is online